Awesome Zero-shot VLM

작성일:

Tags: vlm, foundation-model, multi-modal, zero-shot, image_classification


참고 논문: Vision-Language Models for Vision Tasks: A Survey (2024 TPAMI) (arXiv, github)

위 참고 논문에는 대략 2023년까지의 논문들이 정리되어있다, 후속 연구들도 정리해두자. 우선 top-down으로 각 방법들의 성능에 대해 테이블과 그래프를 정리하고 세부사항에 대해서는 추후에 정리한다.

Zero-shot image classification 성능 테이블

Methods Image encoder Text encoder Data Size ImageNet-1k CIFAR-10 CIFAR-100 Food101 sun397 Cars Aircraft DTD Pets caltech101 flowers102
CLIP1ViT-L/14Transformer400M76.295.777.593.868.478.837.255.793.592.878.3
DeCLIP2REGNET-YBERT88M73.7----------
FILIP3ViT-L/14Transformer340M77.195.775.392.273.170.860.260.792.093.090.1
Florence4CoSwinRoBERT900M83.794.677.695.177.093.255.566.495.994.786.2
SLIP5ViT-LTransformer15M47.987.554.269.256.09.09.529.941.680.960.2
PyramidCLIP6ResNet50T5143M47.881.553.767.865.865.012.647.283.781.765.8
Chinese CLIP7ViT-L/14CNRoberta200M-96.079.7---26.251.2---
LiT8ViT-g/14-4B85.2----------
KELIP9ViT-B/32Transformer1.1B62.691.568.679.5-75.4-51.2---
nCLIP10ViT/B/16Transformer35M48.883.454.565.859.918.05.857.133.273.950.0
NLIP11ViT-B/16BART26M47.481.947.559.258.77.87.532.939.279.554.0
UniCLIP12ViT-B/32Transformer30M54.287.856.564.661.119.54.736.669.284.08.0
RA-CLIP13ViT-B/32BERT15M53.589.462.343.846.5--25.6-76.9-
LA-CLIP14ViT-B/32Transformer400M64.492.473.079.764.981.920.855.487.291.870.3
ALIP15ViT-B/32Transformer15M40.383.851.945.447.83.42.723.230.774.154.8
GrowCLIP16ViT-B/16Transformer12M36.160.728.342.545.5--17.3-71.923.3

성능 비교 차트

(하단 모델명 클릭하여 on/off)

Dataset

Name Year Num. of Image-Text Pairs Language Public
SBU Caption [Link] 2011 1M English
COCO Caption [Link] 2016 1.5M English
Yahoo Flickr Creative Commons 100 Million (YFCC100M) [Link] 2016 100M English
Visual Genome (VG) [Link] 2017 5.4M English
Conceptual Captions (CC3M) [Link] 2018 3.3M English
Localized Narratives (LN) [Link] 2020 0.87M English
Conceptual 12M (CC12M) [Link] 2021 12M English
Wikipedia-based Image Text (WIT) [Link] 2021 37.6M 108 Languages
Red Caps (RC) [Link] 2021 12M English
LAION400M [Link] 2021 400M English
LAION5B [Link] 2022 5B Over 100 Languages
WuKong [Link] 2022 100M Chinese
CLIP 2021 400M English
ALIGN 2021 1.8B English
FILIP 2021 300M English
WebLI 2022 12B 109 Languages

Reference

  1. CLIP: [ICML 2021] [Code] [Data: CLIP]
  2. DeCLIP: [ICLR 2022] [Code] [Data: CC3M, CC12M, YFCC100M, WIT]
  3. FILIP: [ICLR 2022] [Data: FILIP, CC3M, CC12M, YFCC100M]
  4. Florence: [arXiv 2021] [Data: FLD-900M]
  5. SLIP: [ECCV 2022], [arXiv] [Code] [Data: YFCC100M]
  6. PyramidCLIP: [NIPS 2022] [Data: SBU, CC3M, CC12M, YFCC100M, LAION400M]
  7. Chinese CLIP: [arXiv 2022] [Code] [Data: LAION5B, WuKong, VG, COCO]
  8. LiT: [CVPR 2022] [Project] [Data: CC12M, YFCC100M, WIT]
  9. KELIP: [ICLRW 2022] [Code] [Data: CUB200, WIT, YFCC15M, CC3M, CC12M, LAION400M, K-WIT]
  10. nCLIP: [CVPR 2023] [Data: COCO, VG, SBU, CC3M, CC12M, YFCC14M]
  11. NLIP: [AAAI 2023] [Data: YFCC100M, COCO]
  12. UniCLIP: [NIPS 2022] [Data: CC3M, CC12M, YMCC100M]
  13. RA-CLIP: [CVPR 2023] [Data: YFCC100M]
  14. LA-CLIP: [NIPS 2023] [Code] [Data: CC3M, CC12M, RC, LAION400M]
  15. ALIP: [ICCV 2023] [Code] [Data: YFCC100M]
  16. GrowCLIP: [ICCV 2023] [Data: CC12M]

← Back to blog