Fine Tuning results into ValueError: Number of image tokens in input_ids different from num_images.
Hello everyone,
I've run the workbook I found here:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb
where is explained how to fine tue llava-v.1.6-mistral-7b on cord dataset.
However, when starting the train I got:
ValueError: Number of image tokens in input_ids (251) different from num_images (1).
Stack trace here: https://we.tl/t-nRuGejSct3
I'm not really into this kind of thing, but investigating a bit I found out that function "train_collate_fn" called on a train example:
train_collate_fn([train_example])[0]
results into
input_ids = tensor([[ 1, 733, 16289, 28793, 28705, 32000, 32000, 32000, 32000, 32000, ..., 32000]])
with 251 times token_id 32000, that reading online it should be like the <image> token id
(increasing MAX_LENGTH gives me a greater tensor in size but filled with 32000 again up to when, passing no MAX_LENGTH I have 2160 times token 32000 and then other values before and after for a total number of elements of 3251)
Looks like truncation is truncating tensor to MAX_LENGTH resulting into truncating the (eventuallu) 2160 times token 32000 into 251 times, but then this does not match for some reason what it expects to find out as num_images = 1 during train.
I don't know exactly if this might be the problem or I'm competely out of scope.
Can anyone help me?
Thanks,
Carmelo