Awesome Zero-shot VLM
작성일:
Tags: vlm, foundation-model, multi-modal, zero-shot, image_classification
참고 논문: Vision-Language Models for Vision Tasks: A Survey (2024 TPAMI) (arXiv, github)
위 참고 논문에는 대략 2023년까지의 논문들이 정리되어있다, 후속 연구들도 정리해두자. 우선 top-down으로 각 방법들의 성능에 대해 테이블과 그래프를 정리하고 세부사항에 대해서는 추후에 정리한다.
Zero-shot image classification 성능 테이블
| Methods | Image encoder | Text encoder | Data Size | ImageNet-1k | CIFAR-10 | CIFAR-100 | Food101 | sun397 | Cars | Aircraft | DTD | Pets | caltech101 | flowers102 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIP1 | ViT-L/14 | Transformer | 400M | 76.2 | 95.7 | 77.5 | 93.8 | 68.4 | 78.8 | 37.2 | 55.7 | 93.5 | 92.8 | 78.3 |
| DeCLIP2 | REGNET-Y | BERT | 88M | 73.7 | - | - | - | - | - | - | - | - | - | - |
| FILIP3 | ViT-L/14 | Transformer | 340M | 77.1 | 95.7 | 75.3 | 92.2 | 73.1 | 70.8 | 60.2 | 60.7 | 92.0 | 93.0 | 90.1 |
| Florence4 | CoSwin | RoBERT | 900M | 83.7 | 94.6 | 77.6 | 95.1 | 77.0 | 93.2 | 55.5 | 66.4 | 95.9 | 94.7 | 86.2 |
| SLIP5 | ViT-L | Transformer | 15M | 47.9 | 87.5 | 54.2 | 69.2 | 56.0 | 9.0 | 9.5 | 29.9 | 41.6 | 80.9 | 60.2 |
| PyramidCLIP6 | ResNet50 | T5 | 143M | 47.8 | 81.5 | 53.7 | 67.8 | 65.8 | 65.0 | 12.6 | 47.2 | 83.7 | 81.7 | 65.8 |
| Chinese CLIP7 | ViT-L/14 | CNRoberta | 200M | - | 96.0 | 79.7 | - | - | - | 26.2 | 51.2 | - | - | - |
| LiT8 | ViT-g/14 | - | 4B | 85.2 | - | - | - | - | - | - | - | - | - | - |
| KELIP9 | ViT-B/32 | Transformer | 1.1B | 62.6 | 91.5 | 68.6 | 79.5 | - | 75.4 | - | 51.2 | - | - | - |
| nCLIP10 | ViT/B/16 | Transformer | 35M | 48.8 | 83.4 | 54.5 | 65.8 | 59.9 | 18.0 | 5.8 | 57.1 | 33.2 | 73.9 | 50.0 |
| NLIP11 | ViT-B/16 | BART | 26M | 47.4 | 81.9 | 47.5 | 59.2 | 58.7 | 7.8 | 7.5 | 32.9 | 39.2 | 79.5 | 54.0 |
| UniCLIP12 | ViT-B/32 | Transformer | 30M | 54.2 | 87.8 | 56.5 | 64.6 | 61.1 | 19.5 | 4.7 | 36.6 | 69.2 | 84.0 | 8.0 |
| RA-CLIP13 | ViT-B/32 | BERT | 15M | 53.5 | 89.4 | 62.3 | 43.8 | 46.5 | - | - | 25.6 | - | 76.9 | - |
| LA-CLIP14 | ViT-B/32 | Transformer | 400M | 64.4 | 92.4 | 73.0 | 79.7 | 64.9 | 81.9 | 20.8 | 55.4 | 87.2 | 91.8 | 70.3 |
| ALIP15 | ViT-B/32 | Transformer | 15M | 40.3 | 83.8 | 51.9 | 45.4 | 47.8 | 3.4 | 2.7 | 23.2 | 30.7 | 74.1 | 54.8 |
| GrowCLIP16 | ViT-B/16 | Transformer | 12M | 36.1 | 60.7 | 28.3 | 42.5 | 45.5 | - | - | 17.3 | - | 71.9 | 23.3 |
성능 비교 차트
(하단 모델명 클릭하여 on/off)
Dataset
| Name | Year | Num. of Image-Text Pairs | Language | Public |
|---|---|---|---|---|
| SBU Caption [Link] | 2011 | 1M | English | ✓ |
| COCO Caption [Link] | 2016 | 1.5M | English | ✓ |
| Yahoo Flickr Creative Commons 100 Million (YFCC100M) [Link] | 2016 | 100M | English | ✓ |
| Visual Genome (VG) [Link] | 2017 | 5.4M | English | ✓ |
| Conceptual Captions (CC3M) [Link] | 2018 | 3.3M | English | ✓ |
| Localized Narratives (LN) [Link] | 2020 | 0.87M | English | ✓ |
| Conceptual 12M (CC12M) [Link] | 2021 | 12M | English | ✓ |
| Wikipedia-based Image Text (WIT) [Link] | 2021 | 37.6M | 108 Languages | ✓ |
| Red Caps (RC) [Link] | 2021 | 12M | English | ✓ |
| LAION400M [Link] | 2021 | 400M | English | ✓ |
| LAION5B [Link] | 2022 | 5B | Over 100 Languages | ✓ |
| WuKong [Link] | 2022 | 100M | Chinese | ✓ |
| CLIP | 2021 | 400M | English | ✗ |
| ALIGN | 2021 | 1.8B | English | ✗ |
| FILIP | 2021 | 300M | English | ✗ |
| WebLI | 2022 | 12B | 109 Languages | ✗ |
Reference
- CLIP: [ICML 2021] [Code] [Data: CLIP]
- DeCLIP: [ICLR 2022] [Code] [Data: CC3M, CC12M, YFCC100M, WIT]
- FILIP: [ICLR 2022] [Data: FILIP, CC3M, CC12M, YFCC100M]
- Florence: [arXiv 2021] [Data: FLD-900M]
- SLIP: [ECCV 2022], [arXiv] [Code] [Data: YFCC100M]
- PyramidCLIP: [NIPS 2022] [Data: SBU, CC3M, CC12M, YFCC100M, LAION400M]
- Chinese CLIP: [arXiv 2022] [Code] [Data: LAION5B, WuKong, VG, COCO]
- LiT: [CVPR 2022] [Project] [Data: CC12M, YFCC100M, WIT]
- KELIP: [ICLRW 2022] [Code] [Data: CUB200, WIT, YFCC15M, CC3M, CC12M, LAION400M, K-WIT]
- nCLIP: [CVPR 2023] [Data: COCO, VG, SBU, CC3M, CC12M, YFCC14M]
- NLIP: [AAAI 2023] [Data: YFCC100M, COCO]
- UniCLIP: [NIPS 2022] [Data: CC3M, CC12M, YMCC100M]
- RA-CLIP: [CVPR 2023] [Data: YFCC100M]
- LA-CLIP: [NIPS 2023] [Code] [Data: CC3M, CC12M, RC, LAION400M]
- ALIP: [ICCV 2023] [Code] [Data: YFCC100M]
- GrowCLIP: [ICCV 2023] [Data: CC12M]