CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

1Alibaba Group, 2South China University of Technology
leaderboard

The performance of models across various subsets, revealing distinct strengths. Gemini-1.5-Pro and Qwen2-VL-72B are the two top-performing models. Gemini-1.5-Pro achieves first in the multi-scene OCR, multilingual OCR, and document parsing tracks, while Qwen2-VL-72B takes first place in the KIE track and second place in the multi-scene OCR and document parsing tracks.

Introduction

The CC-OCR benchmark is specifically designed to evaluate the OCR capabilities of multimodal large models. It draws from publicly available specialized task data, as well as carefully curated representative data from various real-world application scenarios. CC-OCR includes a range of core OCR tasks while also focusing on the challenges and difficulties that arise in applications. The benchmark covers most key perception tasks in the fields of image text and image documents. One of the philosophy behind this work is that high accuracy in perception is the foundation of multimodal cognition. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, formula recognition, document parsing, table parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. This means that the benchmark can better assess the zero-shot recognition capabilities of large models.

datasets

The main features of CC-OCR include:

  • Four OCR-centric Tasks: Multi-Scene Text Reading, Multilingual Text Reading, Document Parsing, and Visual Information Extraction.
  • Fine-grained visual challenges: TMulti-orientation, multi-scale, wild-scene noise, and various text fonts.
  • Well annotated: For OCR, both textual labels and positional boxes are annotated. For Parsing, LaTeX and HTML formats are applicable for documents and tables respectively. For KIE, JSON format is adopted..

Leaderboard on CC-OCR

Accuracy scores of models on our CC-OCR benchmark

Model Multi-Scene OCR Multilingual OCR Document Parsing Key Information Extraction
TextMonkey 56.9 n/a n/a n/a
KOSMOS2.5 47.5 36.2 n/a n/a
Florence 49.2 49.7 n/a n/a
GOT 61.0 24.9 39.2 n/a
InternVL2-76B 76.9 46.6 35.3 61.6
Claude3.5-sonnet 72.9 65.7 47.8 64.6
GPT-4O 76.4 73.4 53.3 63.5
Qwen2-VL-72B 78.0 71.1 53.8 71.8
Gemini-1.5-pro 83.2 79.0 62.4 67.3

We evaluate nine representative LMMs either with open-source models or commercial APIs. The commercial APIs with specific versions are GPT-4o-2024-08-06, Gemini-1.5-Pro-002, Claude-3.5-Sonnet-20241022. The open-source LMMs include KOSMOS2.5, TextMonkey,Florence, GOT, InternVL2-76B, and Qwen2-VL-72B.

🚨 For more details, please refer to this link

Examples

Examples from different perspectives in CC-OCR



BibTeX

@misc{yang2024ccocrcomprehensivechallengingocr,
      title={CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy}, 
      author={Zhibo Yang and Jun Tang and Zhaohai Li and Pengfei Wang and Jianqiang Wan and Humen Zhong and Xuejing Liu and Mingkun Yang and Peng Wang and Shuai Bai and LianWen Jin and Junyang Lin},
      year={2024},
      eprint={2412.02210},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02210}, 
}