Efficient Vision-Language Understanding with Cross-Modality Modulation

Fast multimodal reasoning, fine-grained visual grounding, and practical deployment ready.

GitHub (Coming Soon) Paper (ECV'26 @ CVPR) HuggingFace (Coming Soon)

March 25, 2026 Accepted

Our paper "Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation" (v1) accepted to ECV'26 @ CVPR 2026! 🎉

Key Features

⚡

Efficient Inference

Replaces heavy cross-attention fusion with a lightweight Cross-modal Modulator (CMM) and state-space modeling for fast generation.

🎯

Fine-grained Grounding

Computes token–grid correlations and uses Top-K grid selection so each text token focuses on the most relevant visual regions.

🧩

Cross-modality Modulation

Uses Feature-wise Linear Modulation (FiLM) to inject image-conditioned signals into text features without quadratic attention overhead.

Architecture

Main Components:

🎨 Vision Encoder (SigLIP) – extracts grid-level visual embeddings
🧩 Cross-modal Modulator (CMM) – token–grid correlation → FiLM → SSM → FiLM
🧠 Language Decoder (LFM2-350M-based) – multimodal reasoning and response generation

Training follows a two-stage recipe: (1) CMM warm-up with frozen backbones, and (2) end-to-end training.

Benchmark Results

Total parameters: ~0.8B

Benchmark	Split	Score
VQAv2	Test	76.6
POPE	Test	69.4
AI2D	Test	46.2
MMMU-val	Val	26.4
MME (Perception)	-	1376.2
SQA-Image	Test	56.7
MMB-dev	Dev	64.6

Usage

Minimal inference example template:

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from modeling.model import FireboltLMForCausalLM
 
# ... (See full example in repo) ...
 
if __name__ == "__main__":
    print(generate_answer(
        ckpt_dir="YOUR_CKPT",
        tokenizer_path="YOUR_TOKENIZER",
        processor_path="YOUR_PROCESSOR",
        image_path="demo.jpg",
        question="What is the destination city searched in this image?"
    ))

Usage Guide (Coming Soon)

Roadmap

🚀 Firebolt-VL V2

Active development is underway for V2, focusing on even greater efficiency and broader multimodal capabilities. Stay tuned!