BraIn-to-Text (BIT) framework. Neural activity to sentences via single differentiable network. 10% word error rate (down from 24.69%). Contrastive learning cross-modal alignment with audio LLMs.
BraIn-to-Text (BIT): Neural Speech Decoding Framework
- Source: https://openreview.net/forum?id=Lp1noMpMUG
- Type: paper
- Date Ingested: 2026-04-05T20:00:00Z
- Tags: speech-bci, neural-decoding, word-error-rate, audio-llm
Key Contribution
BraIn-to-Text (BIT) framework decodes neural activity into sentences via a single differentiable network. Achieves 10% word error rate, down from previous state-of-the-art 24.69%. Uses contrastive learning for cross-modal alignment with audio LLMs.
Summary
BIT introduces a novel approach to neural speech decoding that translates brain activity directly into text through a single end-to-end differentiable network, dramatically reducing word error rates compared to previous pipeline approaches.
Technical Architecture
- End-to-end model: Single differentiable network from neural signals to text (no separate stages)
- Contrastive learning: Cross-modal alignment between neural embeddings and audio LLM representations
- Audio LLM integration: Leverages pre-trained audio language models as a bridge between neural and linguistic domains
- Training approach: Neural activity aligned to audio representations, then decoded to text
Key Results
- Word error rate: 10% — dramatic improvement from previous SOTA of 24.69%
- Improvement: ~60% relative reduction in errors
- Benchmark: Tested on standard neural speech decoding datasets
- Generalization: Single unified model rather than per-patient tuning
Methodology
- Neural signals (intracortical recordings from speech/motor areas) serve as input
- Contrastive learning aligns neural signal embeddings with audio embeddings from pre-trained LLMs
- The audio LLM's text decoder then converts aligned embeddings to word sequences
- End-to-end training eliminates error propagation between pipeline stages
Innovation
- Cross-modal alignment: Bridges the gap between neural signals and language by routing through audio representations
- Audio LLM leverage: Uses the linguistic knowledge already embedded in large audio-language models
- Unified training: Single loss function optimizes the entire pipeline jointly
- Scalability: Architecture can benefit from improvements in underlying audio LLMs
Significance
A 10% word error rate for brain-to-text decoding approaches the usability threshold for practical communication devices. The contrastive learning approach is elegant — rather than trying to decode neural signals directly (a very hard problem), BIT aligns neural representations with audio representations where powerful pre-trained models already exist. This "neural-to-audio-to-text" bridge dramatically reduces the amount of neural training data needed. If this approach generalizes across patients and recording modalities, it could accelerate the timeline for practical speech BCIs that restore communication for people with severe paralysis.