DanQing

Introduction

Overview of Existing VLP Datasets

Dataset	Year	Lang	Avail	Size	Rate
CC3M	2018	EN	Yes	3.1M	≈60%
CC12M	2021	EN	Yes	12M	≈60%
RedCaps	2021	EN	Yes	12M	-
WIT	2021	Multi	Yes	11.5M	-
YFCC100M	2014	EN	Yes	100M	≈70%
COYO	2022	EN	Yes	700M	-
LAION-400M	2021	EN	Yes	400M	-
RealSyn	2025	EN	Yes	100M	-
Product1M	2021	CN	Yes	1M	-
WudaoMM	2022	CN	Yes	5M	-
M6-Corpus	2021	CN	No	60.5M	-
Wukong	2022	CN	Yes	100M	≈85%
TaiSu	2022	CN	Yes	166M	100%
Zero	2022	CN	Yes	250M	≈60%
DanQing	2025	CN	Yes	100M	100%

Motivation

Existing Chinese datasets (Zero, Wukong) are 3+ years old with two critical issues:

Temporal Irrelevance: Missing contemporary concepts
Dead Links: High proportion of invalid image URLs

DanQing: A Modern Solution

DanQing provides ~100M image-text pairs from 2024-2025 web data.

1.

Rigorous Filtration: Filters out 90% of raw data for high quality.

2.

Up-to-Date Semantics: Captures evolving trends and modern concepts.

3.

SOTA Performance: Outperforms in zero-shot classification, retrieval, and LMM reasoning.

Open-sourced under CC-BY 4.0 license: providing a foundation for next-generation Chinese AI models.

Data Construction Pipeline

1

Data Collection

Raw data from Common Crawl (2024-2025) processed in 7 parallel batches, filtered by "zho" language tag yields 1.05B initial pairs. Three coarse-grained filters are applied: Content Safety (1M-parameter binary classifier), Textual Constraints (5-60 words), and Source Reliability (blacklist exclusion). This yields 706M candidate pairs (67% retention). After image downloading with 67% success rate, we obtain 475M high-quality pairs.

2

Textual-Level Purification & Filtering

Four-stage refinement pipeline: Linguistic Structure (Chinese identification and Simplified Chinese standardization), Text Quality (requires nouns, ≤5 [UNK] tokens), Information Density (entropy-based filter H ≥ 6e-4), and Safety (NSFW detection and sensitive content filtering). Reduces corpus from 475M to 397M pairs (16.4% reduction), significantly enhancing signal-to-noise ratio.

3

Image-Level Filtration

Multi-stage filtering across four dimensions: Visual Fidelity (aspect ratio 1:3-3:1, shortest edge >100px, pixel intensity σ ≥ 2, Laplacian variance ≥ 1000 for blur detection), Information Density (image entropy H ≥ 3), Image Redundancy (Union-Find clustering with distance threshold β=0.1, retaining one representative per cluster), and Content Safety (86M-parameter NSFW detector). Reduces dataset from 397M to 178M pairs (44.8% retention).

4

Cross-Modality Filtration

Chinese-CLIP-L14 computes image-text similarity scores. Pairs within interval [1.06, 1.24] are retained: scores below 1.06 indicate weak semantic correlation, while those exceeding 1.24 often correspond to OCR-heavy images. Prunes 25M pairs, culminating in approximately 100M high-quality image-text pairs.

5

Final Result

A curated collection of ~100M pairs (10TB storage), ready for state-of-the-art vision-language pre-training.

Statistics of Dataset

Data Characteristics

Image Resolution

Broad spectrum of visual scales: most images within 300-500 pixels, with considerable proportion exceeding 1,024 pixels. This wide coverage supports robust, scale-invariant features for vision-language learning.

Text Length Distribution

Total 2.2B Chinese words with average 22 words per sample. Distribution spans 5-60 tokens, majority between 6-40, with samples at both extremes for semantic richness.

Topic Modeling

We analyze semantic diversity using BERTopic on 10M randomly sampled pairs. Text embeddings from Chinese-CLIP-L/14 are reduced via UMAP and clustered with HDBSCAN (min cluster size: 1,000). Six prevalent topics span fashion, technology, cuisine, and more, indicating wide real-world domain coverage for robust vision-language representation learning.

Dataset	Caltech101	CIFAR10	Country211	DTD	Food101	MNIST	Flowers	Pets	RESISC45	Cars	Memes	VOC2007	Avg.
Model Architecture: SigLIP2-B/32@256
Baseline	77.0	85.1	8.2	35.9	55.1	81.9	37.6	61.9	56.3	76.3	49.4	69.0	57.8
Wukong	78.6	91.7	9.5	42.6	61.2	83.0	61.4	71.3	58.1	75.1	53.8	75.6	63.5
Zero^*	79.3	92.2	10.8	45.1	64.7	86.3	63.2	76.7	58.9	74.5	49.6	77.3	64.9
TaiSu^*	78.5	90.9	5.7	43.5	53.6	83.5	52.4	62.9	53.3	58.9	54.0	77.3	59.5
DanQing	79.7	93.0	9.9	46.4	66.6	83.4	58.5	78.7	61.4	76.0	54.4	77.1	65.4
Model Architecture: SigLIP2-B/16@256
Baseline	77.3	85.4	10.7	35.3	60.8	83.9	38.1	65.0	59.8	81.0	51.0	71.0	59.9
Wukong	78.4	90.3	12.7	44.8	68.7	81.5	63.6	76.0	59.0	80.8	55.0	78.4	65.8
Zero^*	79.5	91.3	13.9	45.6	70.5	84.6	65.5	78.9	60.6	80.2	51.0	79.0	66.7
TaiSu^*	78.6	89.3	7.0	44.6	58.1	82.2	54.3	65.9	55.8	62.1	54.2	79.2	60.9
DanQing	80.2	93.2	13.3	48.0	71.6	83.5	62.2	81.8	63.5	81.7	53.2	79.6	67.7
Model Architecture: SigLIP2-L/16@256
Baseline	76.7	88.5	15.9	44.8	72.0	80.8	49.7	84.3	63.9	87.4	49.2	68.9	65.2
Wukong	80.3	96.1	20.5	48.2	78.3	84.9	74.3	84.5	65.7	86.5	55.0	78.1	71.0
Zero^*	82.4	96.3	22.6	48.9	81.9	86.4	75.9	89.5	65.3	87.8	52.0	79.7	72.4
TaiSu^*	81.7	94.8	13.1	44.3	68.9	74.2	64.5	79.1	59.4	70.7	55.6	79.7	65.5
DanQing	83.5	96.7	22.4	49.2	83.8	85.2	75.0	90.0	64.8	88.7	55.8	79.9	72.9

Zero-shot image classification performance using models pretrained on different datasets. ^* indicates random sampling of 100 million image-text pairs. The best and second best scores are in boldface and underlined.

Dataset	Flickr30K-CN						MSCOCO-CN						MUGE						Avg.
	Text → Image			Image → Text			Text → Image			Image → Text			Text → Image			Image → Text
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Model Architecture: SigLIP2-B/32@256
Baseline	45.4	71.2	80.6	67.7	88.9	94.6	49.6	77.2	87.8	51.8	81.1	90.9	38.3	61.7	69.9	35.3	60.7	69.8	67.9
Wukong	49.8	75.8	83.7	68.2	89.8	95.6	54.4	81.6	90.7	56.5	84.8	83.2	55.1	77.9	85.1	44.0	71.2	80.1	74.3
Zero^*	49.5	76.5	84.4	68.7	90.5	95.1	53.9	84.0	92.1	56.9	84.6	93.3	54.5	77.7	84.9	42.1	69.4	78.5	74.3
TaiSu^*	60.5	84.2	90.3	77.8	94.4	97.2	65.7	90.7	96.0	65.5	88.9	94.5	56.2	78.2	84.7	44.1	71.2	80.3	78.9
DanQing	54.2	79.0	86.6	73.0	92.2	96.3	60.1	84.5	93.8	61.0	88.3	96.3	54.8	78.1	84.9	45.3	72.1	80.7	76.7
Model Architecture: SigLIP2-B/16@256
Baseline	51.3	76.6	84.7	73.5	93.3	96.7	51.9	79.7	89.6	54.7	82.5	91.9	41.6	64.6	73.4	38.9	64.3	73.5	71.3
Wukong	56.5	81.8	88.4	74.8	94.2	97.8	57.5	83.3	92.0	61.0	86.0	93.7	60.1	81.7	87.7	48.8	75.3	83.2	78.0
Zero^*	58.2	83.7	90.4	74.9	93.4	96.9	58.7	86.0	94.4	60.0	84.8	93.1	59.6	80.8	86.8	46.2	72.9	81.3	77.9
TaiSu^*	68.2	89.0	93.9	83.8	97.2	99.4	68.8	93.0	97.1	67.1	90.1	95.9	60.3	81.0	86.8	48.4	74.9	83.0	82.1
DanQing	61.1	84.9	90.9	80.6	95.0	97.9	62.3	86.6	94.4	64.7	88.5	96.1	60.4	81.3	87.3	50.3	76.3	83.9	80.1
Model Architecture: SigLIP2-L/16@256
Baseline	53.5	78.1	85.5	79.6	95.7	98.3	51.7	79.9	89.0	55.4	81.9	90.5	50.2	71.1	78.5	45.6	70.4	78.5	74.1
Wukong	62.8	86.2	91.5	81.7	96.2	98.5	61.0	85.9	93.5	62.9	88.7	95.1	66.6	84.6	90.1	55.8	80.7	87.4	81.6
Zero^*	64.3	87.9	93.4	78.4	95.5	98.6	61.6	87.2	94.7	62.1	87.2	94.6	65.9	85.3	90.3	53.9	79.0	86.2	81.5
TaiSu^*	72.6	91.7	95.8	87.8	98.7	99.7	71.4	92.6	97.3	69.2	91.6	96.8	66.0	85.0	90.0	55.1	80.1	86.7	84.9
DanQing	70.2	90.3	94.7	86.3	98.7	99.6	65.9	90.5	95.4	68.0	92.6	97.4	67.5	84.9	90.1	56.8	81.2	87.5	84.3

Cross-modal retrieval performance on short-caption datasets for models pretrained on various large-scale Chinese image-text datasets. ^* indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.

Dataset	DCI-CN						DOCCI-CN						Avg.
	Text → Image			Image → Text			Text → Image			Image → Text
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Model Architecture: SigLIP2-B/32@256
Baseline	7.7	18.2	23.8	8.7	19.3	25.3	11.0	26.5	36.0	14.3	32.1	41.6	22.0
Wukong	10.2	22.7	29.6	11.3	23.8	30.7	16.3	34.8	44.8	15.7	35.7	46.3	26.8
Zero^*	10.9	24.4	32.0	11.2	23.7	30.9	17.1	35.8	45.8	17.8	38.5	50.1	28.2
TaiSu^*	11.3	24.0	31.5	12.5	26.2	33.1	16.8	37.1	47.4	16.6	38.8	49.1	28.7
DanQing	13.1	27.4	35.0	12.6	26.1	33.7	19.8	42.0	52.8	18.7	40.4	51.5	31.1
Model Architecture: SigLIP2-B/16@256
Baseline	8.7	19.6	25.9	10.4	21.0	26.8	13.0	30.2	39.8	16.6	35.8	46.8	24.6
Wukong	12.2	25.3	32.6	12.8	25.8	32.6	17.7	39.1	49.4	18.0	38.9	50.1	29.5
Zero^*	12.9	26.9	34.8	12.9	25.6	33.0	18.8	39.5	49.5	19.0	40.5	51.7	30.4
TaiSu^*	13.2	27.1	34.7	14.6	28.6	36.0	19.1	40.5	51.5	18.9	41.1	51.9	31.4
DanQing	15.3	30.7	38.4	15.0	29.3	36.9	23.6	47.3	57.8	22.3	44.9	56.6	34.8
Model Architecture: SigLIP2-L/16@256
Baseline	16.7	30.9	37.5	16.3	30.6	38.0	29.3	53.5	64.3	29.0	54.4	64.9	38.8
Wukong	23.1	41.0	48.6	21.8	38.3	46.0	37.1	64.2	74.2	33.3	59.2	69.8	46.4
Zero^*	24.8	43.0	51.4	24.5	41.6	49.9	37.6	66.4	75.8	38.4	67.4	77.8	49.9
TaiSu^*	26.6	44.4	52.2	26.0	43.2	51.1	41.1	67.3	76.8	37.1	63.2	73.2	50.2
DanQing	31.3	50.7	58.4	30.5	49.9	58.2	48.7	76.4	84.5	44.8	72.2	81.5	57.3

Cross-modal retrieval performance on long-caption datasets for models pretrained on various datasets. ^* indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.

Dataset	MMBench (Dev)		MME-RW	CMMMU	OCRBench	Avg.
Dataset	CN	EN	MME-RW	CMMMU	OCRBench	Avg.
Model Architecture: SigLIP2-L/16@256 + Qwen2-7B
Baseline	72.9	73.6	43.1	38.7	15.0	48.7
Wukong	73.5	75.6	43.8	39.7	15.0	49.5
Zero^*	72.9	75.0	42.3	39.4	15.7	49.1
TaiSu^*	73.5	75.3	42.8	39.8	15.1	49.3
DanQing	74.0	75.3	45.4	39.7	16.0	50.1

Performance of LLaVA-NeXT-style models on Chinese-centric LMM downstream benchmarks, utilizing vision encoders pretrained on various datasets. ^* indicates random sampling of 100 million image-text pairs. The best and second-best results are highlighted in bold and underlined, respectively.

Analysis

Text Quality Analysis

Semantic Word Density

Comparison of 10M texts from DanQing, Wukong, and Zero. DanQing exhibits significantly higher semantic density (nouns, verbs, adjectives), enabling models to acquire more effective semantic information.

Perplexity (PPL) Distribution

Sentence-level PPL computed using Chinese BERT. DanQing has substantially more samples in [50, 200] range, indicating optimal linguistic complexity (neither simplistic nor incoherent).

Scaling Ability

Data Scaling

SigLIP2-B/32 trained on varying scales (10M, 30M, 60M, 100M). DanQing consistently outperforms Wukong across all scales, with improvements more pronounced at larger scales. Wukong plateaus beyond 30M, while DanQing continues improving to 100M.

Model Scaling

Using 30M subsets, we train SigLIP2 models (Base~86M, Large~303M, So~400M, g-opt~1B). DanQing outperforms Wukong across all model sizes with a steeper scaling curve, better leveraging increased model capacity.

Image Semantic Balance

Clustering distribution of 10M images from DanQing and Wukong (10k clusters via FAISS). DanQing achieves a significantly more balanced and uniform semantic distribution than Wukong, effectively mitigating the long-tail effect. This increased uniformity suggests broader coverage of the visual manifold, essential for learning rare concepts.

Image-Text Alignment

Image-text similarity distribution for 10M samples using FG-CLIP2-L/16. DanQing consistently achieves higher similarity scores than Wukong, with significantly more samples exceeding the 0.15 threshold, demonstrating stronger semantic consistency between images and texts. The higher proportion in 0-0.05 range reflects novel 2024-2025 content, explaining improved retrieval performance.

New Concept Understanding

Evaluation of SigLIP2-L/16@256 models on emergent concepts (post-2024 buzzwords like "黑神话:悟空" and "小米SU7"). Models trained on DanQing consistently assign highest confidence to correct pairs, demonstrating that DanQing contains more up-to-date information, enabling models to internalize contemporary knowledge and generalize to recent real-world concepts.

License

The DanQing dataset is licensed under CC-BY-4.0 License. The full license can be found in the LICENSE.cc-by-4.0 file. The dataset is collected from Common Crawl web pages and may contain biased or sensitive content. The collected data is subject to the license to which each content belongs. Users are solely responsible for ensuring compliance with ethical and legal standards in their research or applications.

BibTeX


          @misc{danqing,
            title={DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset}, 
            author={Hengyu Shen and Tiancheng Gu and Bin Qin and Lan Wu and Yuling Wu and Shuo Tan and Zelong Sun and Jun Wang and Nan Wu and Xiang An and Weidong Cai and Ziyong Feng and Kaicheng Yang},
            year={2026},
            eprint={2601.10305},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2601.10305}, 
        }

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Catalogue