RICE

ICCV 2025 · Highlight

Paper ↗ GitHub ↗ 🤗 HuggingFace ↗ 🤗 Model Zoo ↗

Yin Xie*1 Kaicheng Yang*1 Xiang An*1 Kun Wu1 Yongle Zhao1 Weimo Deng1 Zimin Ran1 Yumeng Wang1 Ziyong Feng1 Miles Roy2 Elezi Ismail2 Jiankang Deng2

¹ GlintLab ² Imperial College London * Equal contribution

How to use it

Quickstart

Load the released RICE-ViT checkpoint from Hugging Face and extract visual features in a few lines.

from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

features = outputs.last_hidden_state[0]
print(features.shape)

Adopted By

RICE-ViT is used as the vision encoder in the following MLLMs:

Model	Organization	Downloads / month
LLaVA-OneVision-1.5-8B-Instruct	LMMs-Lab	16,562
VAETKI-VL-7B-A1B	NC-AI Consortium	949
Innovator-VL-8B-Instruct	InnovatorLab	64

The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.

RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.

Highlights

Region-based Processing — Processes diverse semantic regions within images — including objects, scenes, and text — using a single efficient forward pass through the vision transformer.
Joint OCR & Visual Semantics — Uniquely captures both general visual semantics and OCR semantics in a unified representation, enabling richer multimodal understanding of image content.
State-of-the-Art Performance — Achieves superior results across multiple vision benchmarks including image retrieval, VQA, and multimodal understanding tasks using the LLaVA-NeXT framework.

Architecture

Input Image

H×W×3

→

Vision Transformer

Single Forward Pass

↓

Token Grid

H×W + 1 (Class Token)

→

Region Transformer

Sample Regions by Mask
(ROI Align)

↓

Region Attention
Region-specific Visibility Mask

↓

Fixed-length Region Embeddings

Object Tokens OCR Tokens

→

Unified Supervision

Object Region Loss

Single-label Cluster Discrimination

Semantic Cluster Centers

OCR Region Loss

Multi-label Cluster Discrimination

Vocabularies / Token Embeddings

Figure 1. RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination. The model jointly captures general visual semantics, OCR semantics, and unified representations in a single forward pass.

LLaVA Experimental Results

Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.

RICE-ViT Vision Encoder

→

Qwen2.5-7B Language Model

→

Benchmark Results Output

Model Configuration		OCR & Document Understanding								General Vision Understanding
Method	Vision Tower	InfoVQA	DocVQA	ChartQA	TextVQA	OCRBench	OCRBenchV2	LiveXivQA	OCR Avg	AI2D	MMB^EN	MME^Cog	MME^Per	POPE	RealworldQA	MMStar	Other Avg
CLIP	ViT-L-14-336px	38.9	75.2	66.5	62.5	52.5	23.0	47.4	52.3	73.2	74.6	48.0	75.6	88.8	63.7	49.0	67.6
MLCD	ViT-L-14-336px	43.5	76.5	67.8	61.7	53.1	24.0	48.4	53.6	77.0	76.4	54.1	79.9	88.7	61.1	51.0	69.7
AIMv2	ViT-L-14-336px	35.4	77.2	72.7	65.9	57.2	23.9	47.3	54.2	75.4	78.6	48.3	75.0	88.4	62.2	50.2	68.3
RICE-ViT	ViT-L-14-336px	45.2	79.2	72.3	65.9	57.5	24.1	48.9	56.2	77.9	76.6	54.6	80.7	88.5	63.1	51.8	70.5
DFN5B	ViT-H-14-378px	38.6	70.9	64.4	59.4	47.3	21.9	46.2	49.8	73.5	73.4	45.8	76.9	88.6	59.9	49.1	66.7
SigLIP	ViT-SO400M-14-384px	41.4	76.7	69.3	64.7	55.4	24.0	48.4	54.3	76.2	77.0	46.1	79.9	88.8	63.7	47.3	68.4
SigLIPv2	ViT-SO400M-14-384px	43.7	79.1	70.2	66.2	58.7	25.4	48.6	56.0	77.0	77.1	46.6	80.4	89.3	63.4	52.8	69.5
RICE-ViT	ViT-L-14-378px	48.1	82.6	75.1	66.2	58.8	25.8	49.5	58.0	76.5	77.6	54.1	79.0	89.1	62.9	51.2	70.1
SigLIPv2	ViT-SO400M-16-560px	50.2	86.2	77.4	70.2	62.7	26.5	52.9	60.9	77.0	76.5	53.5	79.9	89.3	68.2	53.1	71.1
RICE-ViT	ViT-L-14-560px	53.2	87.4	78.1	69.0	60.7	26.1	53.0	61.1	76.9	78.6	56.3	79.3	88.9	65.1	50.5	70.8
Qwen-ViT from Qwen2.5-VL 7B	ViT-H-14-560px	55.9	85.8	78.8	73.7	66.2	26.8	53.4	62.9	78.8	78.4	62.0	80.8	88.6	64.2	55.0	72.5
RICE-ViT from OV-1.5 3B	ViT-L-14-560px	53.7	87.1	81.9	73.8	73.3	30.4	53.6	64.8	80.3	79.6	58.6	82.2	89.0	67.3	56.6	73.4

Referring Image Segmentation

Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.

RICE-ViT Vision Encoder

→

Qwen2.5-7B Language Model

→

SAM Segment Anything

→

Segmentation Results Output

Model Configuration			refCOCO			refCOCO+			refCOCOg
Vision Tower	LLM	Method	val	testA	testB	val	testA	testB	val	test
Previous Methods
CLIP	Vicuna-7B	GLaMM	79.5	83.2	76.9	72.6	78.7	64.6	74.2	74.9
CLIP	Vicuna-7B	VisionLLMv2	79.2	82.3	77.0	68.9	75.8	61.8	73.3	74.8
CLIP	Vicuna-7B	LLaVA-G-7B	77.1	–	–	68.8	–	–	71.5	–
CLIP	LLaMA2-13B	GSVA	79.2	81.7	77.1	70.3	73.8	63.6	75.7	77.0
CLIP	LLaMA2-13B	PixelLM-7B	73.0	–	–	66.3	–	–	69.3	–
ConvNext-L	InternLM2-7B	OMG-LLaVA	77.2	79.8	74.1	68.7	73.0	61.6	71.7	71.9
InternViT2.5-300M	InternLM2.5-7B	Sa2VA	81.6	–	–	76.2	–	–	78.7	–
InternViT2.5-6B	InternLM2.5-20B	Sa2VA	82.5	–	–	78.8	–	–	79.7	–
LLaVA-1.5 Framework
CLIP	Vicuna-7B	LISA	74.9	79.1	72.3	65.1	70.8	58.1	67.9	70.6
RICE	Vicuna-7B	LISA	76.3	80.3	75.1	67.4	72.7	60.6	69.0	73.4
Avg: +2.00 ↑ Improvement over CLIP			+1.4	+1.2	+2.8	+2.3	+1.9	+2.5	+1.1	+2.8
LLaVA-NeXT Framework
CLIP	Qwen2.5-7B	LISA	81.8	84.0	79.1	76.6	80.5	70.9	77.3	78.5
MLCD	Qwen2.5-7B	LISA	82.8	84.6	80.2	77.4	81.6	73.1	78.5	79.7
RICE	Qwen2.5-7B	LISA	83.5	85.3	81.7	79.4	82.8	75.4	79.8	80.4
Avg: +2.45 ↑ Improvement over CLIP			+1.7	+1.3	+2.6	+2.8	+2.3	+4.5	+2.5	+1.9
Avg: +1.30 ↑ Improvement over MLCD			+0.7	+0.7	+1.5	+2.0	+1.2	+2.3	+1.3	+0.7

Detection Probe

We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.

RICE-ViT Vision Encoder

→

Cascade Mask R-CNN Detector

→

COCO · LVIS · RF100 Output

Configuration			COCO		LVIS		Roboflow100 Benchmarks
Method	Arch	Res	Det AP	Seg AP	Det AP	Seg AP	Aerial	Video Games	Microscopic	Under Water	Documents	Electromagnetic	Real world	AVG
DINOv2	ViT-B/14	518	31.6	24.3	18.7	14.1	2.3	14.3	10.6	19.9	18.8	15.3	26.8	15.4
SigLIP	ViT-B/16	512	35.0	28.1	21.8	17.3	9.4	29.5	20.0	29.4	23.4	18.6	38.0	24.1
MLCD	ViT-B/16	512	35.6	28.6	22.1	17.8	11.4	19.9	14.9	21.0	13.3	15.8	25.0	17.3
RICE	ViT-B/16	512	38.9	31.5	26.5	21.4	14.9	31.7	23.4	30.7	27.0	18.7	39.1	26.5

Tracking Probe

We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.

RICE-ViT Vision Encoder

→

OSTrack Tracker

→

LaSOT · TrackingNet · GOT-10k · TNL2k Output

	LaSOT			TrackingNet			GOT-10k			TNL2k
Method	Suc.	Pre.	Norm. Pre.	Suc.	Pre.	Norm. Pre.	AO	SR-0.5	SR-0.75	Suc.	Pre.	Norm. Pre.
DINOv2	55.11	54.99	65.52	71.20	64.70	77.70	53.60	64.90	35.50	41.95	36.03	57.40
SigLIP	55.52	56.16	65.33	72.60	66.70	78.70	53.50	63.10	35.40	43.90	39.06	59.03
MLCD	58.05	60.75	68.31	75.30	70.20	80.80	53.80	62.80	39.70	45.22	40.62	60.64
RICE	60.24	63.16	69.66	76.30	71.80	81.30	55.40	63.50	41.60	45.70	41.95	61.18

Visualization

Citation

If you find this work useful, please cite our paper:

@inproceedings{yinxie_2025_rice,
  title     = {Region-based Cluster Discrimination for Visual Representation Learning},
  author    = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
  booktitle = {ICCV},
  year      = {2025}
}