BraIn-to-Text (BIT): Neural Speech Decoding Framework

Paper
January 1, 1970
Original Source
Key Contribution

BraIn-to-Text (BIT) framework. Neural activity to sentences via single differentiable network. 10% word error rate (down from 24.69%). Contrastive learning cross-modal alignment with audio LLMs.

BraIn-to-Text (BIT): Neural Speech Decoding Framework

Key Contribution

BraIn-to-Text (BIT) framework decodes neural activity into sentences via a single differentiable network. Achieves 10% word error rate, down from previous state-of-the-art 24.69%. Uses contrastive learning for cross-modal alignment with audio LLMs.

Summary

BIT introduces a novel approach to neural speech decoding that translates brain activity directly into text through a single end-to-end differentiable network, dramatically reducing word error rates compared to previous pipeline approaches.

Technical Architecture

  • End-to-end model: Single differentiable network from neural signals to text (no separate stages)
  • Contrastive learning: Cross-modal alignment between neural embeddings and audio LLM representations
  • Audio LLM integration: Leverages pre-trained audio language models as a bridge between neural and linguistic domains
  • Training approach: Neural activity aligned to audio representations, then decoded to text

Key Results

  • Word error rate: 10% — dramatic improvement from previous SOTA of 24.69%
  • Improvement: ~60% relative reduction in errors
  • Benchmark: Tested on standard neural speech decoding datasets
  • Generalization: Single unified model rather than per-patient tuning

Methodology

  • Neural signals (intracortical recordings from speech/motor areas) serve as input
  • Contrastive learning aligns neural signal embeddings with audio embeddings from pre-trained LLMs
  • The audio LLM's text decoder then converts aligned embeddings to word sequences
  • End-to-end training eliminates error propagation between pipeline stages

Innovation

  • Cross-modal alignment: Bridges the gap between neural signals and language by routing through audio representations
  • Audio LLM leverage: Uses the linguistic knowledge already embedded in large audio-language models
  • Unified training: Single loss function optimizes the entire pipeline jointly
  • Scalability: Architecture can benefit from improvements in underlying audio LLMs

Significance

A 10% word error rate for brain-to-text decoding approaches the usability threshold for practical communication devices. The contrastive learning approach is elegant — rather than trying to decode neural signals directly (a very hard problem), BIT aligns neural representations with audio representations where powerful pre-trained models already exist. This "neural-to-audio-to-text" bridge dramatically reduces the amount of neural training data needed. If this approach generalizes across patients and recording modalities, it could accelerate the timeline for practical speech BCIs that restore communication for people with severe paralysis.

Tags

speech-bcineural-decodingword-error-rateaudio-llm
BraIn-to-Text (BIT): Neural Speech Decoding Framework | KB | MenFem