ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models
Introduction
In June 2024, ColPali [1] was introduced as an OCR-free document retrieval model, built over PaliGemma [2], shifting the paradigm of PDF document retrieval by directly processing images instead of using error-prone and resource-heavy OCR pipelines. However, with three billion parameters, ColPali might be computationally expensive, especially for large document databases. In contrast, text retrieval models like ColBERT are more efficient with just a few hundred million parameters, but they require error-prone and expensive OCR pipelines. To bridge this gap, we introduce ColFlor, an OCR-free visual document retrieval model with only 174 million parameters. ColFlor is 17 times smaller than ColPali, 9.8 times faster at query encoding, and 5.25 times faster at image encoding, with just a 1.8% performance drop on text-rich English documents.
Modeling
Architecture
ColFlor leverages Florence-2’s architecture [3], utilizing both its vision and text encoders while discarding the text decoder.
- Vision Encoder: Utilizes DaViT [4], producing N visual embedding vectors from the input images.
- Text Encoder: Based on the BART [5] encoder which takes N visual embeddings as input to produce contextualized embeddings. These contextualized embeddings are then projected to 128-dimensional vectors using a linear layer to optimize storage requirements for the embeddings, similar to ColBERT and ColPali.
Retrieval System
Document retrieval often involves two main steps:
Indexing: Machine learning models generate embeddings for each document, capturing key features. These embeddings are stored in an index for efficient retrieval later.
Querying: A user’s query is encoded into an embedding, which is compared to the indexed document embeddings to retrieve and rank results based on similarity.
Traditional retrieval models generate a single embedding vector per document/query. In contrast, ColBERT-style models like ColPali use contextual late interaction, encoding documents and queries into bags of contextualized embeddings. Similarity is calculated between these bags of embeddings using the MaxSim operation.
ColFlor follows this approach. During indexing, both the image and one <OCR> text token (as used in the Florence-2 paper) are encoded to create text-aware contextual embeddings for the input image. At query time, the text encoder processes the query to produce text embeddings. Finally, the similarity between the image and query embeddings is measured using the MaxSim operation.
Training Setup
We initialized the model weights from Florence-2-base, with the exception of the new linear projection layer, which was randomly initialized. Initially, training was unstable, and the loss failed to converge despite doing some hyperparameter search. To address this, we first removed the randomly initialized projection layer and trained the model for 5 epochs by applying the MaxSim operation directly on the text encoder's output embeddings. This stabilized the training and improved convergence. Afterward, we reintroduced the linear layer and fine-tuned the model for 40 epochs on the ViDoRe dataset [1], using a learning rate of 2e-5 and a batch size of 64 on 4-A100 GPUs.
Evaluation
We evaluated ColFlor on the ViDoRe benchmark, which consists of 10 subcategories of document retrieval tasks. We group them as follows:
Text-rich English Documents: Includes academic datasets like DocVQA, TatDQA, and real-world practical data like AI, Energy, Government Reports, and Healthcare.
Figure Documents: Includes InfoVQA and ArxivQA, which primarily consist of complex visuals such as figures, diagrams, and infographics.
French Documents: Includes TabFQuAD and Shift, testing the model’s multilingual capabilities.
The results (as shown in the table below) indicate that ColFlor performs comparably to ColPali on text-rich English documents, with only a 1.8% decrease in the average performance, despite its significantly smaller size. Notably, ColFlor outperforms ColPali on TatDQA, a VQA dataset derived from publicly available real-world financial reports, as well as the Health dataset. This highlights ColFlor's potential for real-world applications and its ability to scale efficiently. The performance gap is more pronounced in the Figure Documents category, likely due to the backbone model's (Florence-2) focus on text-rich documents and limited training on figures. We plan to address this by continuing the pretraining of Florence-2 on figures in the future. Lastly, ColFlor performs poorly on French documents, as Florence-2 was designed for English only and lacks multilingual support.
Efficiency of ColFlor
The ColFlor model aims to offer an efficient, affordable, yet high-performing alternative to ColPali, making the new OCR-free document retrieval paradigm accessible to users with limited computing resources (GPU-poor). ColFlor is 17 times smaller in terms of parameters than ColPali. We benchmarked both models' forward passes on a free T4 GPU using the float32 data type. For image encoding, we used a batch size of 32 for ColFlor and 2 for ColPali. For query encoding, we used a batch size of 1 to simulate online querying. As shown in the figure below, ColFlor is 5.25 times faster for image encoding and 9.8 times faster for query encoding. Additionally, ColFlor processes images at a higher resolution (768x768 vs. 448x448 for ColPali) while producing fewer contextualized embeddings (587 vs. 1024), reducing storage costs.
Conclusion and Future Work
We introduced ColFlor, a BERT-size model for OCR-free document retrieval. ColFlor is 17 times smaller than ColPali and delivers 5.25 times faster image encoding and 9.8 times faster query encoding, with only a 1.8% drop in performance on text-rich English documents. Looking ahead, we are exploring continual pretraining of Florence-2 to enhance figure and diagram understanding. We also plan to develop a multilingual variant of Florence-2, enabling ColFlor to support a wider range of languages for broader applications.
In this blog post, we shared preliminary findings and released early artifacts from our research on multimodal document retrieval. We are excited about the potential of this work within the open-source community. Also, we are currently working on further improvements along with a more comprehensive technical report!.
Resources
🧠 Model: https://huggingface.co/ahmed-masry/ColFlor
💻 Demo: https://huggingface.co/spaces/ahmed-masry/ColFlor-Demo
🏋️♂️ Training code: https://github.com/AhmedMasryKU/colflor
📊 Evaluation code: https://github.com/AhmedMasryKU/vidore-benchmark-colflor
If you have any questions about this work, feel free to reach out to Ahmed Masry at [email protected]
Acknowledgement
This work was carried out at the Intelligent Visualization Lab at York University in Canada. It was supported by the Natural Sciences Engineering Research Council (NSERC) of Canada and Canada Foundation for Innovation (CFI). Additionally, it received support through a GCP credits award from Google's PaliGemma Academic Program.
We greatly appreciate the well-documented training and evaluation GitHub repositories provided by the ColPali team, which were essential in our model development.
Citation
If you plan to use ColFlor in your research, please consider citing us as follows:
@misc{masry2024colflor,
title={ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models},
url={https://huggingface.co/blog/ahmed-masry/colflor},
author={Masry, Ahmed},
month={October},
year={2024}
}
References
[1] Faysse, M., Sibille, H., Wu, T., Viaud, G., Hudelot, C. and Colombo, P., 2024. ColPali: Efficient Document Retrieval with Vision Language Models. arXiv preprint arXiv:2407.01449.
[2] Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E. and Unterthiner, T., 2024. PaliGemma: A versatile 3B VLM for transfer. arXiv preprint arXiv:2407.07726.
[3] Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C. and Yuan, L., 2024. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4818-4829).
[4] Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J. and Yuan, L., 2022, October. Davit: Dual attention vision transformers. In European conference on computer vision (pp. 74-92). Cham: Springer Nature Switzerland.
[5] Lewis, M., 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
[6] Khattab, O. and Zaharia, M., 2020, July. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (pp. 39-48).