Load the released RICE-ViT checkpoint from Hugging Face and extract visual features in a few lines.
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state[0]
print(features.shape)
RICE-ViT is used as the vision encoder in the following MLLMs:
| Model | Organization | Downloads / month |
|---|---|---|
| LLaVA-OneVision-1.5-8B-Instruct | LMMs-Lab | 16,562 |
| VAETKI-VL-7B-A1B | NC-AI Consortium | 949 |
| Innovator-VL-8B-Instruct | InnovatorLab | 64 |
The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.
RICE introduces a novel approach to visual representation learning that jointly captures general visual semantics (objects, scenes), OCR semantics (text within images), and unified representations that seamlessly integrate both modalities. This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.
Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. We adopt LLaVA-NeXT's tiling strategy (up to 2×2+1 tiles) for handling high-resolution images.
| Model Configuration | OCR & Document Understanding | General Vision Understanding | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Vision Tower | InfoVQA |
DocVQA |
ChartQA |
TextVQA |
OCRBench |
OCRBenchV2 |
LiveXivQA |
OCR Avg
|
AI2D |
MMBEN
|
MMECog
|
MMEPer
|
POPE |
RealworldQA |
MMStar |
Other Avg
|
| CLIP | ViT-L-14-336px | 38.9 | 75.2 | 66.5 | 62.5 | 52.5 | 23.0 | 47.4 | 52.3 | 73.2 | 74.6 | 48.0 | 75.6 | 88.8 | 63.7 | 49.0 | 67.6 |
| MLCD | ViT-L-14-336px | 43.5 | 76.5 | 67.8 | 61.7 | 53.1 | 24.0 | 48.4 | 53.6 | 77.0 | 76.4 | 54.1 | 79.9 | 88.7 | 61.1 | 51.0 | 69.7 |
| AIMv2 | ViT-L-14-336px | 35.4 | 77.2 | 72.7 | 65.9 | 57.2 | 23.9 | 47.3 | 54.2 | 75.4 | 78.6 | 48.3 | 75.0 | 88.4 | 62.2 | 50.2 | 68.3 |
| RICE-ViT | ViT-L-14-336px | 45.2 | 79.2 | 72.3 | 65.9 | 57.5 | 24.1 | 48.9 | 56.2 | 77.9 | 76.6 | 54.6 | 80.7 | 88.5 | 63.1 | 51.8 | 70.5 |
| DFN5B | ViT-H-14-378px | 38.6 | 70.9 | 64.4 | 59.4 | 47.3 | 21.9 | 46.2 | 49.8 | 73.5 | 73.4 | 45.8 | 76.9 | 88.6 | 59.9 | 49.1 | 66.7 |
| SigLIP | ViT-SO400M-14-384px | 41.4 | 76.7 | 69.3 | 64.7 | 55.4 | 24.0 | 48.4 | 54.3 | 76.2 | 77.0 | 46.1 | 79.9 | 88.8 | 63.7 | 47.3 | 68.4 |
| SigLIPv2 | ViT-SO400M-14-384px | 43.7 | 79.1 | 70.2 | 66.2 | 58.7 | 25.4 | 48.6 | 56.0 | 77.0 | 77.1 | 46.6 | 80.4 | 89.3 | 63.4 | 52.8 | 69.5 |
| RICE-ViT | ViT-L-14-378px | 48.1 | 82.6 | 75.1 | 66.2 | 58.8 | 25.8 | 49.5 | 58.0 | 76.5 | 77.6 | 54.1 | 79.0 | 89.1 | 62.9 | 51.2 | 70.1 |
| SigLIPv2 | ViT-SO400M-16-560px | 50.2 | 86.2 | 77.4 | 70.2 | 62.7 | 26.5 | 52.9 | 60.9 | 77.0 | 76.5 | 53.5 | 79.9 | 89.3 | 68.2 | 53.1 | 71.1 |
| RICE-ViT | ViT-L-14-560px | 53.2 | 87.4 | 78.1 | 69.0 | 60.7 | 26.1 | 53.0 | 61.1 | 76.9 | 78.6 | 56.3 | 79.3 | 88.9 | 65.1 | 50.5 | 70.8 |
| Qwen-ViT from Qwen2.5-VL 7B | ViT-H-14-560px | 55.9 | 85.8 | 78.8 | 73.7 | 66.2 | 26.8 | 53.4 | 62.9 | 78.8 | 78.4 | 62.0 | 80.8 | 88.6 | 64.2 | 55.0 | 72.5 |
| RICE-ViT from OV-1.5 3B | ViT-L-14-560px | 53.7 | 87.1 | 81.9 | 73.8 | 73.3 | 30.4 | 53.6 | 64.8 | 80.3 | 79.6 | 58.6 | 82.2 | 89.0 | 67.3 | 56.6 | 73.4 |
Table 2. Performance comparison of referring image segmentation across vision-language models. Results are reported as IoU scores (%) on the refCOCO, refCOCO+, and refCOCOg benchmarks. Our RICE vision encoder consistently outperforms all competing approaches, achieving state-of-the-art results across all benchmarks when integrated with Qwen2.5-7B in the LLaVA-NeXT framework.
| Model Configuration | refCOCO | refCOCO+ | refCOCOg | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Vision Tower | LLM | Method | val | testA | testB | val | testA | testB | val | test |
| Previous Methods | ||||||||||
| CLIP | Vicuna-7B | GLaMM | 79.5 | 83.2 | 76.9 | 72.6 | 78.7 | 64.6 | 74.2 | 74.9 |
| CLIP | Vicuna-7B | VisionLLMv2 | 79.2 | 82.3 | 77.0 | 68.9 | 75.8 | 61.8 | 73.3 | 74.8 |
| CLIP | Vicuna-7B | LLaVA-G-7B | 77.1 | – | – | 68.8 | – | – | 71.5 | – |
| CLIP | LLaMA2-13B | GSVA | 79.2 | 81.7 | 77.1 | 70.3 | 73.8 | 63.6 | 75.7 | 77.0 |
| CLIP | LLaMA2-13B | PixelLM-7B | 73.0 | – | – | 66.3 | – | – | 69.3 | – |
| ConvNext-L | InternLM2-7B | OMG-LLaVA | 77.2 | 79.8 | 74.1 | 68.7 | 73.0 | 61.6 | 71.7 | 71.9 |
| InternViT2.5-300M | InternLM2.5-7B | Sa2VA | 81.6 | – | – | 76.2 | – | – | 78.7 | – |
| InternViT2.5-6B | InternLM2.5-20B | Sa2VA | 82.5 | – | – | 78.8 | – | – | 79.7 | – |
| LLaVA-1.5 Framework | ||||||||||
| CLIP | Vicuna-7B | LISA | 74.9 | 79.1 | 72.3 | 65.1 | 70.8 | 58.1 | 67.9 | 70.6 |
| RICE | Vicuna-7B | LISA | 76.3 | 80.3 | 75.1 | 67.4 | 72.7 | 60.6 | 69.0 | 73.4 |
| Avg: +2.00 ↑ Improvement over CLIP | +1.4 | +1.2 | +2.8 | +2.3 | +1.9 | +2.5 | +1.1 | +2.8 | ||
| LLaVA-NeXT Framework | ||||||||||
| CLIP | Qwen2.5-7B | LISA | 81.8 | 84.0 | 79.1 | 76.6 | 80.5 | 70.9 | 77.3 | 78.5 |
| MLCD | Qwen2.5-7B | LISA | 82.8 | 84.6 | 80.2 | 77.4 | 81.6 | 73.1 | 78.5 | 79.7 |
| RICE | Qwen2.5-7B | LISA | 83.5 | 85.3 | 81.7 | 79.4 | 82.8 | 75.4 | 79.8 | 80.4 |
| Avg: +2.45 ↑ Improvement over CLIP | +1.7 | +1.3 | +2.6 | +2.8 | +2.3 | +4.5 | +2.5 | +1.9 | ||
| Avg: +1.30 ↑ Improvement over MLCD | +0.7 | +0.7 | +1.5 | +2.0 | +1.2 | +2.3 | +1.3 | +0.7 | ||
We evaluate RICE against several state-of-the-art pretrained vision encoders across multiple benchmarks. All evaluations are conducted using the Cascade Mask R-CNN framework implemented in Detectron2. Experiments are performed on the COCO and LVIS validation sets, and on the Roboflow100 benchmarks. RICE achieves superior performance across all evaluation metrics, highlighting its strong representational quality for both natural images and specialized domains.
| Configuration | COCO | LVIS | Roboflow100 Benchmarks | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch | Res | Det AP | Seg AP | Det AP | Seg AP | Aerial |
Video Games |
Microscopic |
Under Water |
Documents |
Electromagnetic |
Real world |
AVG
|
| DINOv2 | ViT-B/14 | 518 | 31.6 | 24.3 | 18.7 | 14.1 | 2.3 | 14.3 | 10.6 | 19.9 | 18.8 | 15.3 | 26.8 | 15.4 |
| SigLIP | ViT-B/16 | 512 | 35.0 | 28.1 | 21.8 | 17.3 | 9.4 | 29.5 | 20.0 | 29.4 | 23.4 | 18.6 | 38.0 | 24.1 |
| MLCD | ViT-B/16 | 512 | 35.6 | 28.6 | 22.1 | 17.8 | 11.4 | 19.9 | 14.9 | 21.0 | 13.3 | 15.8 | 25.0 | 17.3 |
| RICE | ViT-B/16 | 512 | 38.9 | 31.5 | 26.5 | 21.4 | 14.9 | 31.7 | 23.4 | 30.7 | 27.0 | 18.7 | 39.1 | 26.5 |
We evaluate the temporal matching capability of local features within the general video object tracking framework OSTrack, adopting an attention probing approach to compare the four pre-trained models. Two standard vision transformer blocks are inserted between the frozen backbone and the prediction head to enhance information exchange between the template and search images. As shown in Table 4, RICE achieves the best performance across all metrics on LaSOT, TrackingNet, GOT-10k, and TNL2k.
| LaSOT | TrackingNet | GOT-10k | TNL2k | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Suc. | Pre. | Norm. Pre. | Suc. | Pre. | Norm. Pre. | AO | SR-0.5 | SR-0.75 | Suc. | Pre. | Norm. Pre. |
| DINOv2 | 55.11 | 54.99 | 65.52 | 71.20 | 64.70 | 77.70 | 53.60 | 64.90 | 35.50 | 41.95 | 36.03 | 57.40 |
| SigLIP | 55.52 | 56.16 | 65.33 | 72.60 | 66.70 | 78.70 | 53.50 | 63.10 | 35.40 | 43.90 | 39.06 | 59.03 |
| MLCD | 58.05 | 60.75 | 68.31 | 75.30 | 70.20 | 80.80 | 53.80 | 62.80 | 39.70 | 45.22 | 40.62 | 60.64 |
| RICE | 60.24 | 63.16 | 69.66 | 76.30 | 71.80 | 81.30 | 55.40 | 63.50 | 41.60 | 45.70 | 41.95 | 61.18 |
If you find this work useful, please cite our paper:
@inproceedings{yinxie_2025_rice,
title = {Region-based Cluster Discrimination for Visual Representation Learning},
author = {Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Roy, Miles and Ismail, Elezi and Deng, Jiankang},
booktitle = {ICCV},
year = {2025}
}