CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Introduction

The CC-OCR benchmark is specifically designed to evaluate the OCR capabilities of multimodal large models. It draws from publicly available specialized task data, as well as carefully curated representative data from various real-world application scenarios. CC-OCR includes a range of core OCR tasks while also focusing on the challenges and difficulties that arise in applications. The benchmark covers most key perception tasks in the fields of image text and image documents. One of the philosophy behind this work is that high accuracy in perception is the foundation of multimodal cognition. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, formula recognition, document parsing, table parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. This means that the benchmark can better assess the zero-shot recognition capabilities of large models.

The main features of CC-OCR include:

Four OCR-centric Tasks: Multi-Scene Text Reading, Multilingual Text Reading, Document Parsing, and Visual Information Extraction.
Fine-grained visual challenges: TMulti-orientation, multi-scale, wild-scene noise, and various text fonts.
Well annotated: For OCR, both textual labels and positional boxes are annotated. For Parsing, LaTeX and HTML formats are applicable for documents and tables respectively. For KIE, JSON format is adopted..

Leaderboard on CC-OCR

Accuracy scores of models on our CC-OCR benchmark

Model	Multi-Scene OCR	Multilingual OCR	Document Parsing	Key Information Extraction
TextMonkey	56.9	n/a	n/a	n/a
KOSMOS2.5	47.5	36.2	n/a	n/a
Florence	49.2	49.7	n/a	n/a
GOT	61.0	24.9	39.2	n/a
InternVL2-76B	76.9	46.6	35.3	61.6
Claude3.5-sonnet	72.9	65.7	47.8	64.6
GPT-4O	76.4	73.4	53.3	63.5
Qwen2-VL-72B	78.0	71.1	53.8	71.8
Gemini-1.5-pro	83.2	79.0	62.4	67.3

We evaluate nine representative LMMs either with open-source models or commercial APIs. The commercial APIs with specific versions are GPT-4o-2024-08-06, Gemini-1.5-Pro-002, Claude-3.5-Sonnet-20241022. The open-source LMMs include KOSMOS2.5, TextMonkey,Florence, GOT, InternVL2-76B, and Qwen2-VL-72B.

🚨 For more details, please refer to this link

Examples

Examples from different perspectives in CC-OCR

Example of Formula Parsing

Example of Multi-Scene Text Recognition

Example of Text Grounding

Example of Document Parsing

Example of Table Parsing

Example of KIE

BibTeX

@misc{yang2024ccocrcomprehensivechallengingocr,
      title={CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy}, 
      author={Zhibo Yang and Jun Tang and Zhaohai Li and Pengfei Wang and Jianqiang Wan and Humen Zhong and Xuejing Liu and Mingkun Yang and Peng Wang and Shuai Bai and LianWen Jin and Junyang Lin},
      year={2024},
      eprint={2412.02210},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02210}, 
}