RICE

ICCV 2025 · Highlight
Yin Xie*1 Kaicheng Yang*1 Xiang An*1 Kun Wu1 Yongle Zhao1 Weimo Deng1 Zimin Ran1 Yumeng Wang1 Ziyong Feng1 Miles Roy2 Elezi Ismail2 Jiankang Deng2
1 GlintLab    2 Imperial College London    * Equal contribution

How to use it

Quickstart

Load the released RICE-ViT checkpoint from Hugging Face and extract visual features in a few lines.

from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

features = outputs.last_hidden_state[0]
print(features.shape)

Adopted By

RICE-ViT is used as the vision encoder in the following MLLMs:

Model Organization Downloads / month
LLaVA-OneVision-1.5-8B-Instruct LMMs-Lab 16,562
VAETKI-VL-7B-A1B NC-AI Consortium 949
Innovator-VL-8B-Instruct InnovatorLab 64

The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.

RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.

Highlights


Architecture

Input Image
H×W×3
Vision Transformer
Single Forward Pass
Token Grid
H×W + 1 (Class Token)
Region Transformer
Sample Regions by Mask
(ROI Align)
Region Attention
Region-specific Visibility Mask
Fixed-length Region Embeddings
Object Tokens OCR Tokens
Unified Supervision
Object Region Loss
Single-label Cluster Discrimination
Semantic Cluster Centers
OCR Region Loss
Multi-label Cluster Discrimination
Vocabularies / Token Embeddings
Figure 1. RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination. The model jointly captures general visual semantics, OCR semantics, and unified representations in a single forward pass.

LLaVA Experimental Results

Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.

RICE-ViT Vision Encoder
Qwen2.5-7B Language Model
Benchmark Results Output
Model Configuration OCR & Document Understanding   General Vision Understanding  
Method Vision Tower
InfoVQA
DocVQA
ChartQA
TextVQA
OCRBench
OCRBenchV2
LiveXivQA
OCR Avg
AI2D
MMBEN
MMECog
MMEPer
POPE
RealworldQA
MMStar
Other Avg
CLIP ViT-L-14-336px 38.9 75.2 66.5 62.5 52.5 23.0 47.4 52.3 73.2 74.6 48.0 75.6 88.8 63.7 49.0 67.6
MLCD ViT-L-14-336px 43.5 76.5 67.8 61.7 53.1 24.0 48.4 53.6 77.0 76.4 54.1 79.9 88.7 61.1 51.0 69.7
AIMv2 ViT-L-14-336px 35.4 77.2 72.7 65.9 57.2 23.9 47.3 54.2 75.4 78.6 48.3 75.0 88.4 62.2 50.2 68.3
RICE-ViT ViT-L-14-336px 45.2 79.2 72.3 65.9 57.5 24.1 48.9 56.2 77.9 76.6 54.6 80.7 88.5 63.1 51.8 70.5
DFN5B ViT-H-14-378px 38.6 70.9 64.4 59.4 47.3 21.9 46.2 49.8 73.5 73.4 45.8 76.9 88.6 59.9 49.1 66.7
SigLIP ViT-SO400M-14-384px 41.4 76.7 69.3 64.7 55.4 24.0 48.4 54.3 76.2 77.0 46.1 79.9 88.8 63.7 47.3 68.4
SigLIPv2 ViT-SO400M-14-384px 43.7 79.1 70.2 66.2 58.7 25.4 48.6 56.0 77.0 77.1 46.6 80.4 89.3 63.4 52.8 69.5
RICE-ViT ViT-L-14-378px 48.1 82.6 75.1 66.2 58.8 25.8 49.5 58.0 76.5 77.6 54.1 79.0 89.1 62.9 51.2 70.1
SigLIPv2 ViT-SO400M-16-560px 50.2 86.2 77.4 70.2 62.7 26.5 52.9 60.9 77.0 76.5 53.5 79.9 89.3 68.2 53.1 71.1
RICE-ViT ViT-L-14-560px 53.2 87.4 78.1 69.0 60.7 26.1 53.0 61.1 76.9 78.6 56.3 79.3 88.9 65.1 50.5 70.8
Qwen-ViT from Qwen2.5-VL 7B ViT-H-14-560px 55.9 85.8 78.8 73.7 66.2 26.8 53.4 62.9 78.8 78.4 62.0 80.8 88.6 64.2 55.0 72.5
RICE-ViT from OV-1.5 3B ViT-L-14-560px 53.7 87.1 81.9 73.8 73.3 30.4 53.6 64.8 80.3 79.6 58.6 82.2 89.0 67.3 56.6 73.4

Referring Image Segmentation

Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.

RICE-ViT Vision Encoder
Qwen2.5-7B Language Model
SAM Segment Anything
Segmentation Results Output
Model Configuration refCOCO refCOCO+ refCOCOg
Vision Tower LLM Method val testA testB val testA testB val test
Previous Methods
CLIP Vicuna-7B GLaMM 79.5 83.2 76.9 72.6 78.7 64.6 74.2 74.9
CLIP Vicuna-7B VisionLLMv2 79.2 82.3 77.0 68.9 75.8 61.8 73.3 74.8
CLIP Vicuna-7B LLaVA-G-7B 77.1 68.8 71.5
CLIP LLaMA2-13B GSVA 79.2 81.7 77.1 70.3 73.8 63.6 75.7 77.0
CLIP LLaMA2-13B PixelLM-7B 73.0 66.3 69.3
ConvNext-L InternLM2-7B OMG-LLaVA 77.2 79.8 74.1 68.7 73.0 61.6 71.7 71.9
InternViT2.5-300M InternLM2.5-7B Sa2VA 81.6 76.2 78.7
InternViT2.5-6B InternLM2.5-20B Sa2VA 82.5 78.8 79.7
LLaVA-1.5 Framework
CLIP Vicuna-7B LISA 74.9 79.1 72.3 65.1 70.8 58.1 67.9 70.6
RICE Vicuna-7B LISA 76.3 80.3 75.1 67.4 72.7 60.6 69.0 73.4
Avg: +2.00 ↑ Improvement over CLIP +1.4 +1.2 +2.8 +2.3 +1.9 +2.5 +1.1 +2.8
LLaVA-NeXT Framework
CLIP Qwen2.5-7B LISA 81.8 84.0 79.1 76.6 80.5 70.9 77.3 78.5
MLCD Qwen2.5-7B LISA 82.8 84.6 80.2 77.4 81.6 73.1 78.5 79.7
RICE Qwen2.5-7B LISA 83.5 85.3 81.7 79.4 82.8 75.4 79.8 80.4
Avg: +2.45 ↑ Improvement over CLIP +1.7 +1.3 +2.6 +2.8 +2.3 +4.5 +2.5 +1.9
Avg: +1.30 ↑ Improvement over MLCD +0.7 +0.7 +1.5 +2.0 +1.2 +2.3 +1.3 +0.7

Detection Probe

We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.

RICE-ViT Vision Encoder
Cascade Mask R-CNN Detector
COCO · LVIS · RF100 Output
Configuration COCO LVIS Roboflow100 Benchmarks
Method Arch Res Det AP Seg AP Det AP Seg AP
Aerial
Video Games
Microscopic
Under Water
Documents
Electromagnetic
Real world
AVG
DINOv2 ViT-B/14 518 31.6 24.3 18.7 14.1 2.3 14.3 10.6 19.9 18.8 15.3 26.8 15.4
SigLIP ViT-B/16 512 35.0 28.1 21.8 17.3 9.4 29.5 20.0 29.4 23.4 18.6 38.0 24.1
MLCD ViT-B/16 512 35.6 28.6 22.1 17.8 11.4 19.9 14.9 21.0 13.3 15.8 25.0 17.3
RICE ViT-B/16 512 38.9 31.5 26.5 21.4 14.9 31.7 23.4 30.7 27.0 18.7 39.1 26.5

Tracking Probe

We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.

RICE-ViT Vision Encoder
OSTrack Tracker
LaSOT · TrackingNet · GOT-10k · TNL2k Output
  LaSOT TrackingNet GOT-10k TNL2k
Method Suc. Pre. Norm. Pre. Suc. Pre. Norm. Pre. AO SR-0.5 SR-0.75 Suc. Pre. Norm. Pre.
DINOv2 55.11 54.99 65.52 71.20 64.70 77.70 53.60 64.90 35.50 41.95 36.03 57.40
SigLIP 55.52 56.16 65.33 72.60 66.70 78.70 53.50 63.10 35.40 43.90 39.06 59.03
MLCD 58.05 60.75 68.31 75.30 70.20 80.80 53.80 62.80 39.70 45.22 40.62 60.64
RICE 60.24 63.16 69.66 76.30 71.80 81.30 55.40 63.50 41.60 45.70 41.95 61.18

Visualization

Semantic Feature Visualization
Figure 2. Semantic feature visualization using 2048-resolution images as input to ViT-B/16. Token features are projected onto RGB channels via PCA. Sequential frames (arranged vertically) show consistent attention on salient objects (ice skaters, deers, motorcyclists, cyclists), with stable color patterns maintained throughout the sequence.

Citation

If you find this work useful, please cite our paper:

@inproceedings{yinxie_2025_rice,
  title     = {Region-based Cluster Discrimination for Visual Representation Learning},
  author    = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
  booktitle = {ICCV},
  year      = {2025}
}