PAPER1970-01-01

BraIn-to-Text (BIT): Neural Speech Decoding Framework

COMPILED NOTES

BraIn-to-Text (BIT) framework. Neural activity to sentences via single differentiable network. 10% word error rate (down from 24.69%). Contrastive learning cross-modal alignment with audio LLMs.

BraIn-to-Text (BIT): Neural Speech Decoding Framework

Source: https://openreview.net/forum?id=Lp1noMpMUG
Type: paper
Date Ingested: 2026-04-05T20:00:00Z
Tags: speech-bci, neural-decoding, word-error-rate, audio-llm

Key Contribution

BraIn-to-Text (BIT) framework decodes neural activity into sentences via a single differentiable network. Achieves 10% word error rate, down from previous state-of-the-art 24.69%. Uses contrastive learning for cross-modal alignment with audio LLMs.

Summary

BIT introduces a novel approach to neural speech decoding that translates brain activity directly into text through a single end-to-end differentiable network, dramatically reducing word error rates compared to previous pipeline approaches.

Technical Architecture

End-to-end model: Single differentiable network from neural signals to text (no separate stages)
Contrastive learning: Cross-modal alignment between neural embeddings and audio LLM representations
Audio LLM integration: Leverages pre-trained audio language models as a bridge between neural and linguistic domains
Training approach: Neural activity aligned to audio representations, then decoded to text

Key Results

Word error rate: 10% — dramatic improvement from previous SOTA of 24.69%
Improvement: ~60% relative reduction in errors
Benchmark: Tested on standard neural speech decoding datasets
Generalization: Single unified model rather than per-patient tuning

Methodology

Neural signals (intracortical recordings from speech/motor areas) serve as input
Contrastive learning aligns neural signal embeddings with audio embeddings from pre-trained LLMs
The audio LLM's text decoder then converts aligned embeddings to word sequences
End-to-end training eliminates error propagation between pipeline stages

Innovation

Cross-modal alignment: Bridges the gap between neural signals and language by routing through audio representations
Audio LLM leverage: Uses the linguistic knowledge already embedded in large audio-language models
Unified training: Single loss function optimizes the entire pipeline jointly
Scalability: Architecture can benefit from improvements in underlying audio LLMs

Significance

A 10% word error rate for brain-to-text decoding approaches the usability threshold for practical communication devices. The contrastive learning approach is elegant — rather than trying to decode neural signals directly (a very hard problem), BIT aligns neural representations with audio representations where powerful pre-trained models already exist. This "neural-to-audio-to-text" bridge dramatically reduces the amount of neural training data needed. If this approach generalizes across patients and recording modalities, it could accelerate the timeline for practical speech BCIs that restore communication for people with severe paralysis.

RELATED · IN THE BASE

Summary

Technical Architecture

End-to-end model: Single differentiable network from neural signals to text (no separate stages)

Contrastive learning: Cross-modal alignment between neural embeddings and audio LLM representations

Audio LLM integration: Leverages pre-trained audio language models as a bridge between neural and linguistic domains

Training approach: Neural activity aligned to audio representations, then decoded to text

Key Results

Word error rate: 10% — dramatic improvement from previous SOTA of 24.69%

Improvement: ~60% relative reduction in errors

Benchmark: Tested on standard neural speech decoding datasets

Generalization: Single unified model rather than per-patient tuning

Methodology

Neural signals (intracortical recordings from speech/motor areas) serve as input

Contrastive learning aligns neural signal embeddings with audio embeddings from pre-trained LLMs

The audio LLM's text decoder then converts aligned embeddings to word sequences

End-to-end training eliminates error propagation between pipeline stages

Innovation

Cross-modal alignment: Bridges the gap between neural signals and language by routing through audio representations

Audio LLM leverage: Uses the linguistic knowledge already embedded in large audio-language models

Unified training: Single loss function optimizes the entire pipeline jointly

Scalability: Architecture can benefit from improvements in underlying audio LLMs

Significance