Fine Tuning results into ValueError: Number of image tokens in input_ids different from num_images.

#40
by cvalore - opened

Hello everyone,
I've run the workbook I found here:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb

where is explained how to fine tue llava-v.1.6-mistral-7b on cord dataset.

However, when starting the train I got:
ValueError: Number of image tokens in input_ids (251) different from num_images (1).

Stack trace here: https://we.tl/t-nRuGejSct3

I'm not really into this kind of thing, but investigating a bit I found out that function "train_collate_fn" called on a train example:
train_collate_fn([train_example])[0]

results into
input_ids = tensor([[ 1, 733, 16289, 28793, 28705, 32000, 32000, 32000, 32000, 32000, ..., 32000]])

with 251 times token_id 32000, that reading online it should be like the <image> token id
(increasing MAX_LENGTH gives me a greater tensor in size but filled with 32000 again up to when, passing no MAX_LENGTH I have 2160 times token 32000 and then other values before and after for a total number of elements of 3251)

Looks like truncation is truncating tensor to MAX_LENGTH resulting into truncating the (eventuallu) 2160 times token 32000 into 251 times, but then this does not match for some reason what it expects to find out as num_images = 1 during train.

I don't know exactly if this might be the problem or I'm competely out of scope.

Can anyone help me?

Thanks,
Carmelo

Sign up or log in to comment