# FT-CLIP **Repository Path**: HangbinZheng/FT-CLIP ## Basic Information - **Project Name**: FT-CLIP - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-02-26 - **Last Updated**: 2024-02-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FT-CLIP This repo is the official implementation of ["CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet"](https://arxiv.org/abs/2212.06138). ## Introduction Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%, 88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset. These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. ## Results
ViT-Base/16224 ViT-Base/16384 ViT-Large/16384 ViT-Large/14224 ViT-Large/14336
FLOPS 17.5G 55.4G 190.7G 80.7G 190.6G
Supervised Baseline
ImageNet-21K 84.0 86.2 87.1 ---- ----
JFT-300M ---- 86.7 88.0 ---- ----
JFT-3B ---- 86.6 88.5 ---- ----
MIM with CLIP as prediction target
MVP 84.4 ---- ---- ---- ----
FD-CLIP 84.9 ---- ---- ---- ----
CAE-v2 85.3 ---- ---- ---- ----
BEiT-2 85.5 ---- ---- ---- ----
Fine-tuning CLIP directly
FT-CLIP(ours) 85.7 86.6 ---- 88.0 88.3
## Setup [PyTorch](https://pytorch.org/), [Timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeed](https://github.com/microsoft/DeepSpeed) is needed. CUDA version or GPU difference may slightly influence the results. ```bash pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html pip install --user timm==0.4.12 pip install --user deepspeed==0.4.0 ``` ## Fine-tuning configs The CLIP-Base/16 model can be fine-tuned on ImageNet-1k using 8 A100-40GB: ```bash MODEL=CLIP_B16 OUTPUT_DIR=/path/to/save/your_model DATA_PATH=/path/to/imagenet echo $OUTPUT_DIR mkdir -p $OUTPUT_DIR cp $0 $OUTPUT_DIR OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \ --model ${MODEL} --data_path $DATA_PATH \ --input_size 224 \ --finetune True \ --num_workers 8 \ --output_dir ${OUTPUT_DIR} \ --batch_size 256 --lr 6e-4 --update_freq 1 \ --warmup_epochs 10 --epochs 50 \ --layer_decay 0.6 \ --drop_path 0 \ --dist_eval --eval_all --no_save_ckpt \ --enable_deepspeed \ --clip_mean_and_std \ --layer_scale_init_value 0 \ --abs_pos_emb --disable_rel_pos_bias \ --weight_decay 0.05 --mixup 0 --cutmix 0 \ --nb_classes 1000 --model_prefix visual.\ --model_ema --model_ema_decay 0.9998 \ 2>&1 | tee -a ${OUTPUT_DIR}/log.txt ``` - `--batch_size`: batch size per GPU. - Effective batch size = `number of GPUs` * `--batch_size` * `--update_freq`. So in the above example, the effective batch size is `8*256*1 = 2048`. - `--lr`: base learning rate. - `--layer_decay`: layer-wise learning rate decay. The LR of i_th layer is `lr * layer_decay ** i`. - `--warmup_epochs`: learning rate warmup epochs. - `--epochs`: total pre-training epochs. - `--clip_mean_and_std`: use the CLIP norm factor, instead of the ImageNet norm. see [scripts/](https://github.com/LightDXY/FT-CLIP/tree/main/scripts/) for more config # Acknowledgments This repository is modified from [BEiT](https://github.com/microsoft/unilm/tree/master/beit), built using the [timm](https://github.com/rwightman/pytorch-image-models) library, the [DeiT](https://github.com/facebookresearch/deit) repository and the [CLIP](https://github.com/openai/CLIP) repository. The CLIP model file is modified from [DeCLIP](https://github.com/Sense-GVT/DeCLIP). # Citation If you use this code for your research, please cite our paper. ``` @article{dong2022ftclip, title={CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet}, author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Shuyang, Gu and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai}, journal={arXiv preprint arXiv:2212.06138}, year={2022} } ```