DanQing:
An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

DanQing Team, Glint Lab.
*Equal contribution   ‡Team Leader  †Project Leader

Catalogue

Introduction Data Construction Pipeline Statistics of Dataset Main Results Analysis License BibTeX

Introduction

Overview of Existing VLP Datasets

Dataset Year Lang Avail Size Rate
CC3M 2018 EN Yes 3.1M ≈60%
CC12M 2021 EN Yes 12M ≈60%
RedCaps 2021 EN Yes 12M -
WIT 2021 Multi Yes 11.5M -
YFCC100M 2014 EN Yes 100M ≈70%
COYO 2022 EN Yes 700M -
LAION-400M 2021 EN Yes 400M -
RealSyn 2025 EN Yes 100M -
Product1M 2021 CN Yes 1M -
WudaoMM 2022 CN Yes 5M -
M6-Corpus 2021 CN No 60.5M -
Wukong 2022 CN Yes 100M ≈85%
TaiSu 2022 CN Yes 166M 100%
Zero 2022 CN Yes 250M ≈60%
DanQing 2025 CN Yes 100M 100%

Motivation

Existing Chinese datasets (Zero, Wukong) are 3+ years old with two critical issues:

  • Temporal Irrelevance: Missing contemporary concepts
  • Dead Links: High proportion of invalid image URLs

DanQing: A Modern Solution

DanQing provides ~100M image-text pairs from 2024-2025 web data.

1.
Rigorous Filtration: Filters out 90% of raw data for high quality.
2.
Up-to-Date Semantics: Captures evolving trends and modern concepts.
3.
SOTA Performance: Outperforms in zero-shot classification, retrieval, and LMM reasoning.

Open-sourced under CC-BY 4.0 license: providing a foundation for next-generation Chinese AI models.

Data Construction Pipeline

1

Data Collection

Raw data from Common Crawl (2024-2025) processed in 7 parallel batches, filtered by "zho" language tag yields 1.05B initial pairs. Three coarse-grained filters are applied: Content Safety (1M-parameter binary classifier), Textual Constraints (5-60 words), and Source Reliability (blacklist exclusion). This yields 706M candidate pairs (67% retention). After image downloading with 67% success rate, we obtain 475M high-quality pairs.

2

Textual-Level Purification & Filtering

Four-stage refinement pipeline: Linguistic Structure (Chinese identification and Simplified Chinese standardization), Text Quality (requires nouns, ≤5 [UNK] tokens), Information Density (entropy-based filter H ≥ 6e-4), and Safety (NSFW detection and sensitive content filtering). Reduces corpus from 475M to 397M pairs (16.4% reduction), significantly enhancing signal-to-noise ratio.

3

Image-Level Filtration

Multi-stage filtering across four dimensions: Visual Fidelity (aspect ratio 1:3-3:1, shortest edge >100px, pixel intensity σ ≥ 2, Laplacian variance ≥ 1000 for blur detection), Information Density (image entropy H ≥ 3), Image Redundancy (Union-Find clustering with distance threshold β=0.1, retaining one representative per cluster), and Content Safety (86M-parameter NSFW detector). Reduces dataset from 397M to 178M pairs (44.8% retention).

4

Cross-Modality Filtration

Chinese-CLIP-L14 computes image-text similarity scores. Pairs within interval [1.06, 1.24] are retained: scores below 1.06 indicate weak semantic correlation, while those exceeding 1.24 often correspond to OCR-heavy images. Prunes 25M pairs, culminating in approximately 100M high-quality image-text pairs.

5

Final Result

A curated collection of ~100M pairs (10TB storage), ready for state-of-the-art vision-language pre-training.

Statistics of Dataset

Data Characteristics

Image Resolution Distribution

Image Resolution

Broad spectrum of visual scales: most images within 300-500 pixels, with considerable proportion exceeding 1,024 pixels. This wide coverage supports robust, scale-invariant features for vision-language learning.

Text Length Distribution

Text Length Distribution

Total 2.2B Chinese words with average 22 words per sample. Distribution spans 5-60 tokens, majority between 6-40, with samples at both extremes for semantic richness.

Topic Modeling

Topic Modeling Visualization

We analyze semantic diversity using BERTopic on 10M randomly sampled pairs. Text embeddings from Chinese-CLIP-L/14 are reduced via UMAP and clustered with HDBSCAN (min cluster size: 1,000). Six prevalent topics span fashion, technology, cuisine, and more, indicating wide real-world domain coverage for robust vision-language representation learning.

Main Results

Dataset Caltech101 CIFAR10 Country211 DTD Food101 MNIST Flowers Pets RESISC45 Cars Memes VOC2007 Avg.
Model Architecture: SigLIP2-B/32@256
Baseline 77.0 85.1 8.2 35.9 55.1 81.9 37.6 61.9 56.3 76.3 49.4 69.0 57.8
Wukong 78.6 91.7 9.5 42.6 61.2 83.0 61.4 71.3 58.1 75.1 53.8 75.6 63.5
Zero* 79.3 92.2 10.8 45.1 64.7 86.3 63.2 76.7 58.9 74.5 49.6 77.3 64.9
TaiSu* 78.5 90.9 5.7 43.5 53.6 83.5 52.4 62.9 53.3 58.9 54.0 77.3 59.5
DanQing 79.7 93.0 9.9 46.4 66.6 83.4 58.5 78.7 61.4 76.0 54.4 77.1 65.4
Model Architecture: SigLIP2-B/16@256
Baseline 77.3 85.4 10.7 35.3 60.8 83.9 38.1 65.0 59.8 81.0 51.0 71.0 59.9
Wukong 78.4 90.3 12.7 44.8 68.7 81.5 63.6 76.0 59.0 80.8 55.0 78.4 65.8
Zero* 79.5 91.3 13.9 45.6 70.5 84.6 65.5 78.9 60.6 80.2 51.0 79.0 66.7
TaiSu* 78.6 89.3 7.0 44.6 58.1 82.2 54.3 65.9 55.8 62.1 54.2 79.2 60.9
DanQing 80.2 93.2 13.3 48.0 71.6 83.5 62.2 81.8 63.5 81.7 53.2 79.6 67.7
Model Architecture: SigLIP2-L/16@256
Baseline 76.7 88.5 15.9 44.8 72.0 80.8 49.7 84.3 63.9 87.4 49.2 68.9 65.2
Wukong 80.3 96.1 20.5 48.2 78.3 84.9 74.3 84.5 65.7 86.5 55.0 78.1 71.0
Zero* 82.4 96.3 22.6 48.9 81.9 86.4 75.9 89.5 65.3 87.8 52.0 79.7 72.4
TaiSu* 81.7 94.8 13.1 44.3 68.9 74.2 64.5 79.1 59.4 70.7 55.6 79.7 65.5
DanQing 83.5 96.7 22.4 49.2 83.8 85.2 75.0 90.0 64.8 88.7 55.8 79.9 72.9
Zero-shot image classification performance using models pretrained on different datasets. * indicates random sampling of 100 million image-text pairs. The best and second best scores are in boldface and underlined.
Dataset Flickr30K-CN MSCOCO-CN MUGE Avg.
Text → Image Image → Text Text → Image Image → Text Text → Image Image → Text
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Model Architecture: SigLIP2-B/32@256
Baseline 45.4 71.2 80.6 67.7 88.9 94.6 49.6 77.2 87.8 51.8 81.1 90.9 38.3 61.7 69.9 35.3 60.7 69.8 67.9
Wukong 49.8 75.8 83.7 68.2 89.8 95.6 54.4 81.6 90.7 56.5 84.8 83.2 55.1 77.9 85.1 44.0 71.2 80.1 74.3
Zero* 49.5 76.5 84.4 68.7 90.5 95.1 53.9 84.0 92.1 56.9 84.6 93.3 54.5 77.7 84.9 42.1 69.4 78.5 74.3
TaiSu* 60.5 84.2 90.3 77.8 94.4 97.2 65.7 90.7 96.0 65.5 88.9 94.5 56.2 78.2 84.7 44.1 71.2 80.3 78.9
DanQing 54.2 79.0 86.6 73.0 92.2 96.3 60.1 84.5 93.8 61.0 88.3 96.3 54.8 78.1 84.9 45.3 72.1 80.7 76.7
Model Architecture: SigLIP2-B/16@256
Baseline 51.3 76.6 84.7 73.5 93.3 96.7 51.9 79.7 89.6 54.7 82.5 91.9 41.6 64.6 73.4 38.9 64.3 73.5 71.3
Wukong 56.5 81.8 88.4 74.8 94.2 97.8 57.5 83.3 92.0 61.0 86.0 93.7 60.1 81.7 87.7 48.8 75.3 83.2 78.0
Zero* 58.2 83.7 90.4 74.9 93.4 96.9 58.7 86.0 94.4 60.0 84.8 93.1 59.6 80.8 86.8 46.2 72.9 81.3 77.9
TaiSu* 68.2 89.0 93.9 83.8 97.2 99.4 68.8 93.0 97.1 67.1 90.1 95.9 60.3 81.0 86.8 48.4 74.9 83.0 82.1
DanQing 61.1 84.9 90.9 80.6 95.0 97.9 62.3 86.6 94.4 64.7 88.5 96.1 60.4 81.3 87.3 50.3 76.3 83.9 80.1
Model Architecture: SigLIP2-L/16@256
Baseline 53.5 78.1 85.5 79.6 95.7 98.3 51.7 79.9 89.0 55.4 81.9 90.5 50.2 71.1 78.5 45.6 70.4 78.5 74.1
Wukong 62.8 86.2 91.5 81.7 96.2 98.5 61.0 85.9 93.5 62.9 88.7 95.1 66.6 84.6 90.1 55.8 80.7 87.4 81.6
Zero* 64.3 87.9 93.4 78.4 95.5 98.6 61.6 87.2 94.7 62.1 87.2 94.6 65.9 85.3 90.3 53.9 79.0 86.2 81.5
TaiSu* 72.6 91.7 95.8 87.8 98.7 99.7 71.4 92.6 97.3 69.2 91.6 96.8 66.0 85.0 90.0 55.1 80.1 86.7 84.9
DanQing 70.2 90.3 94.7 86.3 98.7 99.6 65.9 90.5 95.4 68.0 92.6 97.4 67.5 84.9 90.1 56.8 81.2 87.5 84.3
Cross-modal retrieval performance on short-caption datasets for models pretrained on various large-scale Chinese image-text datasets. * indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.
Dataset DCI-CN DOCCI-CN Avg.
Text → Image Image → Text Text → Image Image → Text
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Model Architecture: SigLIP2-B/32@256
Baseline 7.7 18.2 23.8 8.7 19.3 25.3 11.0 26.5 36.0 14.3 32.1 41.6 22.0
Wukong 10.2 22.7 29.6 11.3 23.8 30.7 16.3 34.8 44.8 15.7 35.7 46.3 26.8
Zero* 10.9 24.4 32.0 11.2 23.7 30.9 17.1 35.8 45.8 17.8 38.5 50.1 28.2
TaiSu* 11.3 24.0 31.5 12.5 26.2 33.1 16.8 37.1 47.4 16.6 38.8 49.1 28.7
DanQing 13.1 27.4 35.0 12.6 26.1 33.7 19.8 42.0 52.8 18.7 40.4 51.5 31.1
Model Architecture: SigLIP2-B/16@256
Baseline 8.7 19.6 25.9 10.4 21.0 26.8 13.0 30.2 39.8 16.6 35.8 46.8 24.6
Wukong 12.2 25.3 32.6 12.8 25.8 32.6 17.7 39.1 49.4 18.0 38.9 50.1 29.5
Zero* 12.9 26.9 34.8 12.9 25.6 33.0 18.8 39.5 49.5 19.0 40.5 51.7 30.4
TaiSu* 13.2 27.1 34.7 14.6 28.6 36.0 19.1 40.5 51.5 18.9 41.1 51.9 31.4
DanQing 15.3 30.7 38.4 15.0 29.3 36.9 23.6 47.3 57.8 22.3 44.9 56.6 34.8
Model Architecture: SigLIP2-L/16@256
Baseline 16.7 30.9 37.5 16.3 30.6 38.0 29.3 53.5 64.3 29.0 54.4 64.9 38.8
Wukong 23.1 41.0 48.6 21.8 38.3 46.0 37.1 64.2 74.2 33.3 59.2 69.8 46.4
Zero* 24.8 43.0 51.4 24.5 41.6 49.9 37.6 66.4 75.8 38.4 67.4 77.8 49.9
TaiSu* 26.6 44.4 52.2 26.0 43.2 51.1 41.1 67.3 76.8 37.1 63.2 73.2 50.2
DanQing 31.3 50.7 58.4 30.5 49.9 58.2 48.7 76.4 84.5 44.8 72.2 81.5 57.3
Cross-modal retrieval performance on long-caption datasets for models pretrained on various datasets. * indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.
Dataset MMBench (Dev) MME-RW CMMMU OCRBench Avg.
CN EN
Model Architecture: SigLIP2-L/16@256 + Qwen2-7B
Baseline 72.9 73.6 43.1 38.7 15.0 48.7
Wukong 73.5 75.6 43.8 39.7 15.0 49.5
Zero* 72.9 75.0 42.3 39.4 15.7 49.1
TaiSu* 73.5 75.3 42.8 39.8 15.1 49.3
DanQing 74.0 75.3 45.4 39.7 16.0 50.1
Performance of LLaVA-NeXT-style models on Chinese-centric LMM downstream benchmarks, utilizing vision encoders pretrained on various datasets. * indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.

Analysis

Text Quality Analysis

Text Semantic Word Density

Semantic Word Density

Comparison of 10M texts from DanQing, Wukong, and Zero. DanQing exhibits significantly higher semantic density (nouns, verbs, adjectives), enabling models to acquire more effective semantic information.

Text Perplexity Distribution

Perplexity (PPL) Distribution

Sentence-level PPL computed using Chinese BERT. DanQing has substantially more samples in [50, 200] range, indicating optimal linguistic complexity (neither simplistic nor incoherent).

Scaling Ability

Data Scaling Comparison

Data Scaling

SigLIP2-B/32 trained on varying scales (10M, 30M, 60M, 100M). DanQing consistently outperforms Wukong across all scales, with improvements more pronounced at larger scales. Wukong plateaus beyond 30M, while DanQing continues improving to 100M.

Model Scaling Comparison

Model Scaling

Using 30M subsets, we train SigLIP2 models (Base~86M, Large~303M, So~400M, g-opt~1B). DanQing outperforms Wukong across all model sizes with a steeper scaling curve, better leveraging increased model capacity.

Image Semantic Balance

Image Semantic Diversity

Clustering distribution of 10M images from DanQing and Wukong (10k clusters via FAISS). DanQing achieves a significantly more balanced and uniform semantic distribution than Wukong, effectively mitigating the long-tail effect. This increased uniformity suggests broader coverage of the visual manifold, essential for learning rare concepts.

Image-Text Alignment

Image-text similarity distribution for 10M samples using FG-CLIP2-L/16. DanQing consistently achieves higher similarity scores than Wukong, with significantly more samples exceeding the 0.15 threshold, demonstrating stronger semantic consistency between images and texts. The higher proportion in 0-0.05 range reflects novel 2024-2025 content, explaining improved retrieval performance.

Image-Text Similarity Distribution

New Concept Understanding

New Concept Understanding - Black Myth
New Concept Understanding - Xiaomi SU7

Evaluation of SigLIP2-L/16@256 models on emergent concepts (post-2024 buzzwords like "黑神话:悟空" and "小米SU7"). Models trained on DanQing consistently assign highest confidence to correct pairs, demonstrating that DanQing contains more up-to-date information, enabling models to internalize contemporary knowledge and generalize to recent real-world concepts.

License

The DanQing dataset is licensed under CC-BY-4.0 License. The full license can be found in the LICENSE.cc-by-4.0 file. The dataset is collected from Common Crawl web pages and may contain biased or sensitive content. The collected data is subject to the license to which each content belongs. Users are solely responsible for ensuring compliance with ethical and legal standards in their research or applications.

BibTeX


          coming soon...