Fast multimodal reasoning, fine-grained visual grounding, and practical deployment ready.
Replaces heavy cross-attention fusion with a lightweight Cross-modal Modulator (CMM) and state-space modeling for fast generation.
Computes token–grid correlations and uses Top-K grid selection so each text token focuses on the most relevant visual regions.
Uses Feature-wise Linear Modulation (FiLM) to inject image-conditioned signals into text features without quadratic attention overhead.
Main Components:
Training follows a two-stage recipe: (1) CMM warm-up with frozen backbones, and (2) end-to-end training.
Total parameters: ~0.8B
| Benchmark | Split | Score |
|---|---|---|
| VQAv2 | Test | 76.6 |
| POPE | Test | 69.4 |
| AI2D | Test | 46.2 |
| MMMU-val | Val | 26.4 |
| MME (Perception) | - | 1376.2 |
| SQA-Image | Test | 56.7 |
| MMB-dev | Dev | 64.6 |
Minimal inference example template:
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from modeling.model import FireboltLMForCausalLM
# ... (See full example in repo) ...
if __name__ == "__main__":
print(generate_answer(
ckpt_dir="YOUR_CKPT",
tokenizer_path="YOUR_TOKENIZER",
processor_path="YOUR_PROCESSOR",
image_path="demo.jpg",
question="What is the destination city searched in this image?"
))
Active development is underway for V2, focusing on even greater efficiency and broader multimodal capabilities. Stay tuned!