# FT-CLIP
**Repository Path**: HangbinZheng/FT-CLIP
## Basic Information
- **Project Name**: FT-CLIP
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-02-26
- **Last Updated**: 2024-02-26
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# FT-CLIP
This repo is the official implementation of ["CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet"](https://arxiv.org/abs/2212.06138).
## Introduction
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%, 88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset. These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP.
## Results
| |
ViT-Base/16224 |
ViT-Base/16384 |
ViT-Large/16384 |
ViT-Large/14224 |
ViT-Large/14336 |
| FLOPS |
17.5G |
55.4G |
190.7G |
80.7G |
190.6G |
| Supervised Baseline |
| ImageNet-21K |
84.0 |
86.2 |
87.1 |
---- |
---- |
| JFT-300M |
---- |
86.7 |
88.0 |
---- |
---- |
| JFT-3B |
---- |
86.6 |
88.5 |
---- |
---- |
| MIM with CLIP as prediction target |
| MVP |
84.4 | ---- | ---- | ---- | ---- |
| FD-CLIP |
84.9 | ---- | ---- | ---- | ---- |
| CAE-v2 |
85.3 | ---- | ---- | ---- | ---- |
| BEiT-2 |
85.5 | ---- | ---- | ---- | ---- |
| Fine-tuning CLIP directly |
| FT-CLIP(ours) |
85.7 |
86.6 |
---- |
88.0 |
88.3 |
## Setup
[PyTorch](https://pytorch.org/), [Timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeed](https://github.com/microsoft/DeepSpeed) is needed. CUDA version or GPU difference may slightly influence the results.
```bash
pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install --user timm==0.4.12
pip install --user deepspeed==0.4.0
```
## Fine-tuning configs
The CLIP-Base/16 model can be fine-tuned on ImageNet-1k using 8 A100-40GB:
```bash
MODEL=CLIP_B16
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
echo $OUTPUT_DIR
mkdir -p $OUTPUT_DIR
cp $0 $OUTPUT_DIR
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
--model ${MODEL} --data_path $DATA_PATH \
--input_size 224 \
--finetune True \
--num_workers 8 \
--output_dir ${OUTPUT_DIR} \
--batch_size 256 --lr 6e-4 --update_freq 1 \
--warmup_epochs 10 --epochs 50 \
--layer_decay 0.6 \
--drop_path 0 \
--dist_eval --eval_all --no_save_ckpt \
--enable_deepspeed \
--clip_mean_and_std \
--layer_scale_init_value 0 \
--abs_pos_emb --disable_rel_pos_bias \
--weight_decay 0.05 --mixup 0 --cutmix 0 \
--nb_classes 1000 --model_prefix visual.\
--model_ema --model_ema_decay 0.9998 \
2>&1 | tee -a ${OUTPUT_DIR}/log.txt
```
- `--batch_size`: batch size per GPU.
- Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*256*1 = 2048`.
- `--lr`: base learning rate.
- `--layer_decay`: layer-wise learning rate decay. The LR of i_th layer is `lr * layer_decay ** i`.
- `--warmup_epochs`: learning rate warmup epochs.
- `--epochs`: total pre-training epochs.
- `--clip_mean_and_std`: use the CLIP norm factor, instead of the ImageNet norm.
see [scripts/](https://github.com/LightDXY/FT-CLIP/tree/main/scripts/) for more config
# Acknowledgments
This repository is modified from [BEiT](https://github.com/microsoft/unilm/tree/master/beit), built using the [timm](https://github.com/rwightman/pytorch-image-models) library, the [DeiT](https://github.com/facebookresearch/deit) repository and the [CLIP](https://github.com/openai/CLIP) repository. The CLIP model file is modified from [DeCLIP](https://github.com/Sense-GVT/DeCLIP).
# Citation
If you use this code for your research, please cite our paper.
```
@article{dong2022ftclip,
title={CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet},
author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Shuyang, Gu and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
journal={arXiv preprint arXiv:2212.06138},
year={2022}
}
```