V1 V2 (Dev)

Efficient Vision-Language Understanding with Cross-Modality Modulation

Fast multimodal reasoning, fine-grained visual grounding, and practical deployment ready.

GitHub (Coming Soon) Paper (ECV'26 @ CVPR) HuggingFace (Coming Soon)
March 25, 2026 Accepted
Our paper "Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation" (v1) accepted to ECV'26 @ CVPR 2026! 🎉

Key Features

âš¡

Efficient Inference

Replaces heavy cross-attention fusion with a lightweight Cross-modal Modulator (CMM) and state-space modeling for fast generation.

🎯

Fine-grained Grounding

Computes token–grid correlations and uses Top-K grid selection so each text token focuses on the most relevant visual regions.

🧩

Cross-modality Modulation

Uses Feature-wise Linear Modulation (FiLM) to inject image-conditioned signals into text features without quadratic attention overhead.

Architecture

Firebolt-VL Architecture Diagram

Main Components:

  • 🎨 Vision Encoder (SigLIP) – extracts grid-level visual embeddings
  • 🧩 Cross-modal Modulator (CMM) – token–grid correlation → FiLM → SSM → FiLM
  • 🧠 Language Decoder (LFM2-350M-based) – multimodal reasoning and response generation

Training follows a two-stage recipe: (1) CMM warm-up with frozen backbones, and (2) end-to-end training.

Benchmark Results

Total parameters: ~0.8B

Benchmark Split Score
VQAv2 Test 76.6
POPE Test 69.4
AI2D Test 46.2
MMMU-val Val 26.4
MME (Perception) - 1376.2
SQA-Image Test 56.7
MMB-dev Dev 64.6

Usage

Minimal inference example template:

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from modeling.model import FireboltLMForCausalLM
 
# ... (See full example in repo) ...
 
if __name__ == "__main__":
    print(generate_answer(
        ckpt_dir="YOUR_CKPT",
        tokenizer_path="YOUR_TOKENIZER",
        processor_path="YOUR_PROCESSOR",
        image_path="demo.jpg",
        question="What is the destination city searched in this image?"
    ))
Usage Guide (Coming Soon)

Roadmap

🚀 Firebolt-VL V2

Active development is underway for V2, focusing on even greater efficiency and broader multimodal capabilities. Stay tuned!