A comprehensive multimodal retrieval system built with Hugging Face's CLIP model for cross-modal search between images and text, featuring LoRA fine-tuning capabilities for domain-specific performance enhancement and knowledge distillation for model compression.
This project implements a production-ready multimodal retrieval system that supports:
- Text-to-Image Retrieval: Find relevant images based on text descriptions
- Image-to-Text Retrieval: Find relevant text descriptions based on images
- Performance Evaluation: Compute Recall@1 and Recall@5 metrics
- Embedding Analysis: Analyze embedding distributions and similarity patterns
- Interactive CLI: Command-line interface for real-time searching
- LoRA Fine-tuning: Low-rank adaptation for domain-specific model improvement
- Knowledge Distillation: Compress large teacher models to smaller, efficient student models
- Hard Negative Training: Enhanced training with hard negative mining for improved discrimination
The system follows a modular architecture:
project/
├── config.py # Configuration management (includes LoRA & distillation config)
├── run_pipeline.py # Main pipeline orchestrator (supports LoRA option)
├── dataset/ # Data loading and preprocessing
│ ├── dataset.py
│ └── preprocessing.py
├── models/ # CLIP model wrapper (with LoRA support)
│ └── clip_model.py
├── training/ # Training scripts
│ ├── lora_finetune.py # LoRA fine-tuning script
│ ├── train_hard_negative.py # Hard negative training script
│ ├── trainer.py # Original LoRA trainer
│ ├── hard_negative_trainer.py # Hard negative trainer
│ ├── contrastive_loss.py # Original contrastive loss
│ ├── hard_negative_loss.py # Hard negative loss functions
│ └── model_utils.py
├── compression/ # Knowledge distillation module
│ ├── distill_loss.py # Distillation loss functions
│ ├── distill_trainer.py # Distillation trainer class
│ ├── distill_utils.py # Distillation utility functions
│ ├── run_distill.py # Distillation training script
│ └── README.md # Documentation for distillation module
├── embedding/ # Embedding computation and caching
│ └── build_embeddings.py
├── index/ # FAISS indexing and search
│ ├── build_index.py
│ └── search.py
├── evaluation/ # Metrics computation
│ └── metrics.py
├── analysis/ # Embedding analysis and visualization
│ └── embedding_analysis.py
├── demo/ # Interactive CLI (with LoRA support)
│ └── search_cli.py
├── utils/ # Utility functions
│ ├── logger.py
│ └── seed.py
└── requirements.txt # Dependencies
pip install -r requirements.txtKey configurations are managed in config.py:
DataConfig: Paths to data files and split ratiosModelConfig: Model name, batch size, and device settingsTrainingConfig: Training parametersLoRAConfig: LoRA-specific settings (use_lora, lora_path, etc.)DistillationConfig: Knowledge distillation settings (teacher_model, student_model, distill_types, lambda weights, etc.)EmbeddingConfig: Cache directories and file pathsIndexConfig: FAISS index settingsEvaluationConfig: Metrics to computeAnalysisConfig: Visualization settings
Command-line arguments take precedence over configuration file values.
The system expects a CSV file with the following columns:
image,description,display name,category
image001.jpg,"A beautiful mountain landscape","Mountain View","Nature"
image002.jpg,"Red sports car racing","Speed Demon","Vehicle"
...Configure the data path in config.py.
As a quick start example, you can use the product image dataset from Kaggle: Mini Product Image and Text Dataset
Search for images based on text query:
# Basic search
python demo/search_cli.py --query "a beautiful landscape with mountains" --topk 5
# Search with LoRA adapter
python demo/search_cli.py --query "a beautiful landscape with mountains" --use_lora --topk 5
# Image-to-image search
python demo/search_cli.py --image_query /path/to/image.jpg --use_lora --topk 5Run the complete pipeline (build index, evaluate, analyze):
# With base CLIP model
python run_pipeline.py --task full_pipeline
# With LoRA fine-tuned model
python run_pipeline.py --task full_pipeline --use_lora
# With specific LoRA adapter
python run_pipeline.py --task full_pipeline --use_lora --lora_path ./path/to/lora/adapterRun specific tasks:
# Build FAISS index with LoRA
python run_pipeline.py --task build_index --use_lora
# Evaluate model performance with LoRA
python run_pipeline.py --task evaluate --use_lora
# Analyze embeddings with LoRA
python run_pipeline.py --task analyze --use_lora
# Run demo search with LoRA
python run_pipeline.py --task demo --use_loraFine-tune the model for your specific domain using LoRA:
# Basic LoRA fine-tuning
python training/lora_finetune.py
# Custom fine-tuning parameters
python training/lora_finetune.py \
--epochs 10 \
--lr 5e-5 \
--batch_size 16 \
--save_path ./my_lora_adapterEnhance your model with hard negative mining to improve fine-grained discrimination:
# Basic hard negative training
python training/train_hard_negative.py
# Custom hard negative training parameters
python training/train_hard_negative.py \
--epochs 10 \
--lr 5e-5 \
--batch_size 16 \
--lambda_hard 0.5 \
--hard_negative_type batch \
--save_path ./my_hard_negative_lora_adapter
# Use category-level hard negatives for fine-grained learning
python training/train_hard_negative.py \
--epochs 10 \
--lr 5e-5 \
--lambda_hard 0.3 \
--hard_negative_type category \
--save_path ./my_category_hard_negative_adapter
# Load existing checkpoint and continue training with hard negatives
python training/train_hard_negative.py \
--load_checkpoint ./outputs/lora_adapter_l14_r16/checkpoint_epoch_30 \
--epochs 20 \
--lambda_hard 0.2 \
--save_path ./continued_hard_negative_trainingThe hard negative training offers two approaches:
- Batch-level hard negatives: Mines the most challenging negative samples within each training batch
- Category-level hard negatives: Uses same-category samples as hard negatives to improve fine-grained discrimination
Compress large teacher models to efficient student models with multiple distillation strategies:
# Basic knowledge distillation
python compression/run_distill.py --data_path path/to/your/data.csv
# Using LoRA-fine-tuned teacher model
python compression/run_distill.py \
--teacher_model openai/clip-vit-large-patch14 \
--teacher_use_lora \
--teacher_lora_path ./outputs/lora_adapter \
--student_model openai/clip-vit-base-patch32 \
--epochs 10 \
--lr 1e-5 \
--batch_size 32 \
--lambda_embed 0.3 \
--lambda_similarity 0.3 \
--lambda_logits 0.4 \
--temperature 1.0 \
--output_dir ./outputs/distilled_model \
--data_path data/cleaned_data.csv
# Custom distillation parameters with selective methods
python compression/run_distill.py \
--teacher_model openai/clip-vit-large-patch14 \
--student_model openai/clip-vit-base-patch32 \
--distill_types embedding similarity logits \
--epochs 10 \
--lr 1e-5 \
--batch_size 32 \
--lambda_embed 0.3 \
--lambda_similarity 0.3 \
--lambda_logits 0.4 \
--temperature 2.0 \
--output_dir ./outputs/distilled_model \
--data_path data/cleaned_data.csv
# Only use logits distillation
python compression/run_distill.py \
--distill_types logits \
--lambda_logits 1.0 \
--data_path data/cleaned_data.csvThe --distill_types argument accepts one or more of: embedding, similarity, logits. The system automatically handles dimension mismatches between teacher and student models.
Hard negative training enhances the model's ability to distinguish between similar but non-matching samples. Two strategies are implemented:
- Batch-level Hard Negatives: Identifies the most difficult negative samples within each training batch by selecting the highest similarity scores among incorrect pairs
- Category-level Hard Negatives: Uses samples from the same category as hard negatives, forcing the model to learn fine-grained distinctions between similar items
- Composite Loss Function: Combines traditional CLIP contrastive loss with hard negative loss:
Total Loss = CLIP Loss + λ × Hard Negative Loss - Flexible Configuration: Adjustable
lambda_hardparameter controls the influence of hard negative loss, andhard_negative_typeselects between batch or category strategies
The knowledge distillation module allows compressing large teacher models (e.g., CLIP ViT-L/14) into efficient student models (e.g., CLIP ViT-B/32) while preserving performance. Our enhanced implementation includes multiple distillation strategies:
- Teacher Model: Frozen, used only for inference during training
- Student Model: Trained using combined objectives
- Three Distillation Methods:
- Embedding Distillation: Aligns teacher and student embeddings using MSE loss
- Similarity Distillation: Aligns similarity matrices using KL divergence with temperature scaling
- Logits Distillation: Directly compares final similarity scores using KL divergence
- Automatic Dimension Alignment: Handles dimension mismatches between teacher and student models with learnable projection layers
- Configurable Loss Weights: Independent lambda weights for each distillation method (lambda_embed, lambda_similarity, lambda_logits)
- Supports: GPU acceleration, checkpoint saving, LoRA-integrated teachers, and resume training
The system automatically handles cases where teacher and student models have different embedding dimensions (e.g., 768-dim ViT-L/14 teacher vs 512-dim ViT-B/32 student) by creating learnable linear projection layers.
LoRA (Low-Rank Adaptation) enables efficient fine-tuning of large models:
- Target Modules:
q_proj,k_proj,v_proj,out_projlayers - Rank (r): Default 16
- Alpha: Default 32
- Dropout: Default 0.1
This approach significantly reduces trainable parameters while maintaining strong performance.
- Embeddings are L2 normalized for efficient cosine similarity computation
- FAISS IndexFlatIP is used for exact similarity search
- Caching prevents redundant embedding computations
- Batch processing optimizes inference speed
- LoRA integration allows domain-specific model improvements without full retraining
This project is open-source and available under the MIT license.