VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
VRAG.mp4
VRAG.mp4
- 🎉 We have released the demo, allowing you to customize your own VRAG.
- 🎉 Our framework integrates SOTA visual embedding models, enabling you to create your own retriever.
- 🎉 We are releasing our 7B model and will gradually release models in more sizes. Welcome to use!
- We introduce VRAG, a purely visual RAG agent that enables VLMs to progressively gather information from a coarse-grained to a fine-grained perspective.
- We propose VRAG-RL, a novel reinforcement learning framework tailored for training VLMs to effectively reason, retrieve, and understand visually rich information.
- We have released the training framework of VRAG-RL, a novel multi-turn and multimodal training framework with strong extensibility, capable of supporting training with various tools.
Please refer to run_demo.sh in the project root to quickly start the demo. Below is a step-by-step guide to help you run the demo on our example data:
Note: The Quick Start commands should be run from the project root (
VRAG/) directory.
# Create environment
conda create -n vrag python=3.10
# Clone project
git clone https://github.com/alibaba-nlp/VRAG.git
cd VRAG
# Install requirements for demo only
pip install -r VRAG-RL/requirements_demo.txtFirst, you need to launch the search engine, which utilizes the ColPali embedding model family. It is preferable to deploy the search engine independently on a single GPU.
# Deploy search engine server (run from project root VRAG/)
python search_engine/search_engine_api.pyThen download the model and deploy the server using vllm. For a 7B model, it can be deployed on a single A100 80G GPU.
vllm serve autumncc/Qwen2.5-VL-7B-VRAG --port 8002 --host 0.0.0.0 --limit-mm-per-prompt image=10 --served-model-name Qwen/Qwen2.5-VL-7B-InstructFinally, use Streamlit to launch the demo.
# Run from project root VRAG/
streamlit run demo/app.pyBelow is a step-by-step guide to help you run the VRAG on your own corpus, the entire process is divided into three steps:
- The 1st and 2nd step are aimed at building your own purely vision-based search engine,
- The 3rd step, similar to the quick start, is to launch the demo.
You should first convert your document to .jpg and store it in the search_engine/corpus/image/ directory using the script search_engine/corpus/pdf2images.py.
Our framework is built on the foundation of the Llama-Index. We preprocess the corpus in advance and then establish an index database.
The embedding models are located in search_engine/models/. You can test and use the search engine directly:
Try using the search engine in search_engine/search_engine.py (from project root):
from search_engine.search_engine import SearchEngine
# Initialize engine
search_engine = SearchEngine(
dataset_dir='search_engine/corpus',
node_dir_prefix='colqwen_ingestion',
embed_model_name='vidore/colqwen2-v1.0'
)
# Retrieve some results
recall_results = search_engine.batch_search(['some query A', 'some query B'])Once the corpus and models for the search engine are prepared, you can directly run the search engine API server:
# Run search engine server with FastAPI (from project root VRAG/)
python search_engine/search_engine_api.pyJust like in the quick start guide, you can run the demo after deploying the VLM service:
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8002 --host 0.0.0.0 --limit-mm-per-prompt image=10 --served-model-name Qwen/Qwen2.5-VL-7B-InstructUse Streamlit to launch the demo (from project root VRAG/).
streamlit run demo/app.pyOptionally, you can directly use our script for generation in demo/vrag_agent.py or integrate it into your own framework:
from demo.vrag_agent import VRAG
vrag = VRAG(
base_url='http://0.0.0.0:8002/v1',
search_url='http://0.0.0.0:8001/search',
generator=False
)
answer = vrag.run('What is the capital of France?')cd VRAG/VRAG-RL
# Install requirements for training
pip install -r requirements_train.txt
# Install the VRAG-RL package
pip install -e .Please download the original document repositories and queries for each benchmark separately from SlideVQA, ViDoSeek and MMLongBench-Doc. For training, we mixed part of the SlideVQA training set to create the training data. The SlideVQA-train can be used as an example to construct SFT data and RL data. During evaluation, we suggest merge all benchmark corpora into a single corpus to create a more challenging setting that simulates real-world scenarios.
Organize all data into the following format. Reference examples are provided in the examples/ directory.
{
"uid": "04d8bb0db929110f204723c56e5386c1d8d21587_2",
"query": "What is the temperature of Steam explosion of Pretreatment for Switchgrass and Sugarcane bagasse preparation?",
"reference_answer": "195-205 Centigrade",
"meta_info": {
"file_name": "Pretreatment_of_Switchgrass.pdf",
"reference_page": [10, 11],
"source_type": "Text",
"query_type": "Multi-Hop"
}
}Use the script scripts/hf_dataset_convert.py to convert the unified format to Parquet.
# Run from VRAG-RL/ directory
python scripts/hf_dataset_convert.pyFollow the above section to construct your own corpus and start the search engine.
To construct high-quality data using scripts/data_construct_pipeline.py, you can use DashScope based on Alibaba Cloud. You need to set the environment variable DASHSCOPE_API_KEY:
export DASHSCOPE_API_KEY=xxxPlease note that for expert models, we recommend using models with consistent coordinate systems. If different models are used, it is necessary to map the coordinates to the same coordinate system.
def convert_to_qwen25vl_format(bbox, orig_height, orig_width, factor=28, min_pixels=56*56, max_pixels=14*14*4*1280):
new_height, new_width = smart_resize(orig_height, orig_width, factor, min_pixels, max_pixels)
scale_w = new_width / orig_width
scale_h = new_height / orig_height
x1, y1, x2, y2 = bbox
x1_new = round(x1 * scale_w)
y1_new = round(y1 * scale_h)
x2_new = round(x2 * scale_w)
y2_new = round(y2 * scale_h)
x1_new = max(0, min(x1_new, new_width - 1))
y1_new = max(0, min(y1_new, new_height - 1))
x2_new = max(0, min(x2_new, new_width - 1))
y2_new = max(0, min(y2_new, new_height - 1))
return [x1_new, y1_new, x2_new, y2_new]Here, you can use the script scripts/cot_convert_sft.py to convert the sampled data into the LLaMA-Factory format and then proceed with training using LLaMA-Factory. When fine-tuning the Qwen2.5VL model, please pay special attention to the maximum and minimum values of the coordinates. You need to normalize the coordinates and images to the same scale, This is also the key to the crop&zoom action:
You can find relevant reference code in the https://github.com/QwenLM/Qwen3-VL/blob/main/qwen-vl-finetune/tools/process_bbox.ipynb, and our code also includes these functions, which you can use directly.
You can customize your own training reward function in verl/workers/reward_manager/rm.py. In this project, we simply modify the reward manager to implement a model-based reward. You can choose to deploy your own model with vLLM or use an API.
# works num for reward model, depends on your qps
reward_model.rm_workers_num=10 \
# reward model url, if you deploy your own model, you can use your own model here
reward_model.rm_url="https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions" \
# reward model key, if you deploy model with vLLM, you can use "EMPTY"
reward_model.rm_key=$DASHSCOPE_API_KEY \
# reward model name
reward_model.rm_model_name="qwen-max-latest" \You can customize your own rollout module in vrag_agent/generation.py. The main function is run_llm_loop, which contains Generation -> Parse Action -> Observation -> Check Termination:
- Generation
generate_with_gpu_paddingpads the training batch and performs generation. - Parse Action
execute_predictionsinterprets the model's output and call API based on various actions to obtain the raw observation. - Observation
process_next_obsinserts the retrieved or cropped images into the context after processing. - In the final step, check if there are any trajectories with unfinished interactions. For those trajectories that have not retrieved images, add image padding to facilitate batch generation by the vLLM engine.
# Run from VRAG-RL/ directory
./train_grpo_qwen2_5_vl_7b.shThis work is implemented based on ViDoRAG, LLaMA-Factory, Search-R1, and verl. We greatly appreciate their valuable contributions to the community.
@misc{wang2025vragrlempowervisionperceptionbasedrag,
title={VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning},
author={Qiuchen Wang and Ruixue Ding and Yu Zeng and Zehui Chen and Lin Chen and Shihang Wang and Pengjun Xie and Fei Huang and Feng Zhao},
year={2025},
eprint={2505.22019},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22019},
}


