CLIP Model based on DistilBERT and ViT

This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:

  • DistilBERT (based on distilbert-base-uncased): A smaller, faster, and lighter version of BERT.
  • Vision Transformer (ViT) (based on google/vit-base-patch16-224): A powerful vision transformer architecture for image processing.

The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.

Model Overview

CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.

Components:

  • Text Encoder: distilbert-base-uncased is used to encode the textual input into a dense vector.
  • Image Encoder: google/vit-base-patch16-224 processes image data by dividing images into patches and learning their contextual relationships.

Future work:

Train over larger datasets and with more computer resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for sebastiansarasti/clip_fashion

Finetuned
(7200)
this model

Dataset used to train sebastiansarasti/clip_fashion

Space using sebastiansarasti/clip_fashion 1