Submission to ViDoRe

#1
by merve HF staff - opened

Hiya, congratulations on the release! Would be nice to submit to ViDoRe leaderboard πŸ€— https://huggingface.co/spaces/vidore/vidore-leaderboard

Yeah I am running it right now @merve ;)
If the authors want, please verify my implementation here (https://github.com/illuin-tech/vidore-benchmark/pull/52/files#diff-8697e94428e08895c26cb52efbc99022c90c4198c1a8ed62cdadbc14d180590d)

I was able to match your reported scores on Shift (the vidore version).
As you can expect, your contribution is more geared towards multilingualism on PDF-like data with Gemini queries, so the original DSE model (with the same resolution) tends to perform slightly better both on the synthetic data sets and the academic ones (DocVQA, ArxivQA, etc).
it would be interesting to extend evals to other languages, if you want to contribute to this, please don't hesitate to reach out !
I agree with your assessment on bias introduced in the training set split, I think we want to tend towards 100% human data at terms but that's costly...

I just uploaded the results.json file as a PR so you can just merge it in !
It's the branch I sent you, ran with this command:

vidore-benchmark evaluate-retriever     --model-class dse-qwen2  --model-name <path>/mcdse-2b-v1  --collection-name ../colpali/data_dir/eval_vidore     --split test

Hi @manu , thanks for running the evals!
Qwen DSE is trained with add_generation_prompt=False (both the query and document prompt end with the user <|im_end|>\n + the end of text token). But this seems to not affect eval results at all, so I think its ok.

Overall it's an average ~3% drop from the base model (I think the one on the leaderboard is evaluated with 2560 image tokens), I was expecting a bigger drop due to the higher image resolution. In the next version I'm going to focus on just building a better, larger, high quality dataset of multilingual query/image pairs.

Beyond multilinguial performance, your work on Matrioshka / binarization is super interesting as well, it would be cool to have a leaderboard that reflects these aspects...

Perhaps I can integrate runtime / index size on a given fixed GPU or something + some graphs...
Excited to see people work on this, if you ever want to put some of these datasets in common (we're also working on some things on our side), don't hesitate to reach out by mail or LinkedIn !

marco changed discussion status to closed

Sign up or log in to comment