Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

ICLR 2026 Conference SubmissionAnonymous Authors
Speech RecognitionAudiovisual LearningLipreadingSemi-Supervised LearningPseudo-Labeling
Abstract:

Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes USR 2.0, an improved unified speech recognition framework that trains a single model for audio, visual, and audiovisual inputs using CTC-driven teacher forcing for pseudo-labelling. It resides in the 'Single-Model Unified Architectures' leaf, which contains only two papers including the original USR work. This leaf sits under 'Unified and Multi-Task Learning Frameworks', a relatively sparse branch with just four total papers across two leaves. The positioning suggests the paper operates in an emerging research direction focused on streamlined single-model approaches, rather than the more crowded multi-task or fusion-heavy alternatives found elsewhere in the taxonomy.

The taxonomy reveals several neighboring directions that contextualize this work. The sibling leaf 'Multi-Task Hybrid Architectures' contains systems combining primary recognition with auxiliary tasks, representing a more complex alternative to pure unified models. Adjacent branches like 'Multimodal Fusion Strategies' (15 papers across multiple leaves) and 'Self-Supervised and Transfer Learning' (4 papers) explore complementary themes of feature integration and data efficiency. The paper's focus on efficient pseudo-labelling connects it to 'Automatic Pseudo-Labeling' under self-supervised learning, while its attention-CTC coupling relates to fusion strategies, though it maintains the single-model philosophy that distinguishes it from feature-level fusion architectures.

Among 29 candidates examined, the contribution-level analysis reveals mixed novelty signals. The CTC-driven teacher forcing mechanism examined 10 candidates with 1 refutable match, suggesting some prior exploration of CTC-based pseudo-labelling strategies. The mixed sampling strategy shows stronger overlap, with 3 refutable candidates among 10 examined, indicating existing work on exposure bias mitigation in similar contexts. The overall USR 2.0 framework examined 9 candidates with no refutations, suggesting the specific combination of techniques may be novel. However, the limited search scope (29 papers, not exhaustive) means these statistics reflect only top-K semantic matches and immediate citations, not comprehensive field coverage.

Given the sparse taxonomy leaf and limited literature search, the work appears to offer incremental refinements to the unified architecture paradigm rather than opening entirely new research directions. The statistical evidence suggests individual components have precedent, though their integration within the USR framework may constitute a meaningful engineering contribution. The analysis covers semantic neighbors and citation links but cannot rule out relevant work in adjacent communities or non-indexed venues, particularly given the rapidly evolving nature of multimodal speech recognition research.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: unified speech recognition across audio, visual, and audiovisual modalities. The field has evolved from early fusion experiments to sophisticated architectures that handle multiple input streams within a single framework. The taxonomy reflects this maturity through several major branches: Unified and Multi-Task Learning Frameworks focus on single-model architectures that process audio-only, video-only, and combined inputs without separate pipelines; Multimodal Fusion Strategies explore how to combine information from different sensory channels at various stages; Self-Supervised and Transfer Learning branches address data efficiency by leveraging unlabeled data or pretrained representations; Cross-Modal Knowledge Transfer and Large Language Model Integration represent newer directions that borrow strengths from related domains; while branches like Robustness and Noise Handling, Multilingual Recognition, and Application-Specific Systems tackle practical deployment challenges. Historical works such as Deep Multimodal Learning[24] and End-to-End AVSR[38] laid foundational principles, whereas recent efforts like Auto-AVSR[2] and Unified Speech Recognition[1] demonstrate the shift toward flexible, end-to-end architectures. A particularly active line of work centers on single-model unified architectures that avoid modality-specific subnetworks, aiming for simplicity and scalability. CTC Pseudo-Labelling[0] sits squarely in this branch, emphasizing a streamlined training strategy that leverages pseudo-labels to unify audio and visual streams within one model. This contrasts with approaches like Matryoshka Multimodal[3], which explores nested representations for flexible inference, and MUTUD[5], which tackles multi-task learning across diverse speech tasks. Meanwhile, works such as MultiAVSR[7] and mWhisper-Flamingo[8] integrate large pretrained models or cross-lingual capabilities, highlighting trade-offs between architectural simplicity and the incorporation of external knowledge. The central tension across these directions involves balancing the elegance of a truly unified architecture against the performance gains from specialized fusion mechanisms or auxiliary tasks, with CTC Pseudo-Labelling[0] representing a minimalist stance that prioritizes end-to-end learning without heavy reliance on complex multi-stage pipelines.

Claimed Contributions

CTC-driven teacher forcing for pseudo-labelling

The authors introduce a method where CTC outputs guide the generation of attention-based pseudo-labels through teacher forcing, eliminating the need for slow autoregressive decoding during pseudo-labelling. This enables parallel generation of attention targets in a single forward pass while maintaining knowledge transfer effectiveness.

10 retrieved papers
Can Refute
Mixed sampling strategy

The authors propose a mixed sampling approach that alternates between CTC-driven mode and autoregressive mode during training. This strategy addresses the train-test mismatch introduced by CTC-driven teacher forcing while maintaining efficiency and robustness benefits.

10 retrieved papers
Can Refute
USR 2.0 framework

The authors develop USR 2.0, a unified speech recognition framework that combines CTC-driven teacher forcing and mixed sampling to achieve faster training (approximately 2× speedup), improved out-of-distribution robustness, and state-of-the-art performance across audio, visual, and audiovisual speech recognition tasks using a single model.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CTC-driven teacher forcing for pseudo-labelling

The authors introduce a method where CTC outputs guide the generation of attention-based pseudo-labels through teacher forcing, eliminating the need for slow autoregressive decoding during pseudo-labelling. This enables parallel generation of attention targets in a single forward pass while maintaining knowledge transfer effectiveness.

Contribution

Mixed sampling strategy

The authors propose a mixed sampling approach that alternates between CTC-driven mode and autoregressive mode during training. This strategy addresses the train-test mismatch introduced by CTC-driven teacher forcing while maintaining efficiency and robustness benefits.

Contribution

USR 2.0 framework

The authors develop USR 2.0, a unified speech recognition framework that combines CTC-driven teacher forcing and mixed sampling to achieve faster training (approximately 2× speedup), improved out-of-distribution robustness, and state-of-the-art performance across audio, visual, and audiovisual speech recognition tasks using a single model.