Special Sessions/Challenges

Special Sessions

Date: Monday, September 2

Time: 11:30 – 13:30

Hall: Yanis Club

Introduction

The main topic of this special session is speech-related biosignals, such as of articulatory or neurological activities during speech production or perception. Since these biosignals reflect human speech processes, they can serve as alternative modalities to the acoustic signal for speech-driven systems. Therefore, biosignal-enabled speech systems have the potential to enable spoken communication when the acoustic speech signal is not available or perceivable. For instance, Silent Speech Interfaces are developed to restore the ability of spoken communication for speech-impaired persons, e.g., after a laryngectomy. The aim is to enable spoken communication by predicting the acoustic speech signal from biosignals such as Electromyography (EMG), Electromagnetic Articulography (EMA), or Ultrasound Tongue Imaging (UTI). Likewise, biosignals related to speech perception, such as Electroencephalography (EEG), are investigated for neuro-steered hearing aids to detect selective auditory attention to single out and enhance the attended speech stream. Progress in the field of speech-related biosignal processing will lead to the design of novel biosignal-enabled speech communication devices and speech rehabilitations for everyday situations.

With the special session “Biosignal-enabled Spoken Communication”, we aim to bring together researchers working on biosignals and speech processing in order to exchange ideas on interdisciplinary topics.

Topics

Topics of interest for this special session include, but are not limited to:

  • Processing of biosignals capturing respiratory, laryngeal, or articulatory activity during speech, e.g., Electromagnetic Articulography (EMA), Electromyography (EMG), High Speed Nasopharyngoscopy (HSN), Ultrasound Tongue Imaging (UTI), videos of lip movements, etc.
  • Speech-related processing of biosignals reflecting brain activity, such as Electroencephalography (EEG), Electrocorticography (ECoG), or functional Magnetic Resonance Imaging (fMRI).
  • Application of biosignals for speech processing tasks, e.g., speech recognition, synthesis, enhancement, voice conversion, or auditory attention detection.
  • Utilization of biosignals to increase the explainability or performance of acoustic speech processing methods.
  • Development of novel machine learning algorithms, feature representations, model architectures, as well as training and evaluation strategies for improved performance or to address common challenges.
  • Application of methods successful in acoustic speech processing to biosignals, e.g., self-supervised learning, knowledge distillation, end-to-end training, etc.
  • Health-focused applications of biosignal processing, such as speech restoration, training and therapy, or (mental) health assessments.
  • Other applications, such as speech-related brain-computer interfaces (BCIs), speech communication in noisy environments, or acoustic-free speech communication for preserving privacy.

Website:

https://www.uni-bremen.de/en/csl/interspeech-2024-biosignals

Organizers:

  • Kevin Scheck (University of Bremen, Germany)
  • Peter Wu (UC Berkeley, USA)
  • Siqi Cai (National University of Singapore, Singapore)
  • Yashish Siriwardena (University of Maryland College Park, USA)
  • Tanja Schultz (University of Bremen, Germany)
  • Gopala Anumanchipalli (UC Berkeley, USA)
  • Carol Espy-Wilson (University of Maryland College Park, USA)
  • Satoshi Nakamura (Nara Institute of Science and Technology, Japan)
  • Prasanta Kumar Ghosh (Indian Institute of Science (IISc), Bangalore, India)
  • Alan W Black (Carnegie Mellon University, USA)
  • Haizhou Li (National University of Singapore, Singapore)

Date: Tuesday, September 3

Time: 13:30 – 15:30

Hall: Yanis Club

Introduction

Recent technological advancements, like foundational speech and large language models, are leading to the rapid integration of speech and language technologies into healthcare. This presents many exciting opportunities for personalized care, early disease detection, and improved communication between patients and providers. However, this integration also poses unique challenges regarding data quality, clinical translation, and responsible application of language models.

This special session focuses on two key themes in speech-healthcare research:

  1. From Collection and Analysis to Clinical Translation

This theme explores how various factors impact the analysis, interpretation, and ultimately, the clinical application of speech data and includes, for example:

  • Technical and human challenges in speech data collection
  • Understanding how health changes affect speech and language mechanisms.
  • Investigation of novel methods for capturing, analyzing, and quantifying speech patterns for health assessment
  • Novel approaches to utilizing speech technologies for disease detection, monitoring, and intervention
  • Considering practical feasibility to ensure findings reliably translate into clinical practice.
  1. Speech and Language Technology for Medical Conversations

This theme includes the growing field of “ambient intelligence,” where technologies like speech recognition and natural language processing (NLP) are used, for example, to:

  • Automatically transcribe and analyze doctor-patient conversations.
  • Generate accurate medical documentation from these conversations.
  • Provide feedback to medical students on their communication skills.
  • Offer diagnostic support based on spontaneous speech with physicians.
  • Automatically detect health disorders from conversation/spontaneous speech.
  • Develop novel applications for speech and language technology in healthcare.

By bringing together these themes, this session fosters discussion on innovative ideas, challenges, and opportunities for utilizing speech technologies to advance healthcare.

Website:

https://sites.google.com/view/splang-health-interspeech2024/

Organizers:

  • Nicholas Cummins (nick.cummins@kcl.ac.uk) – King’s College London
  • Thomas Schaaf (tschaaf@solventum.com) – 3M | Solventum
  • Visar Berisha – Arizona State University
  • Sneha Das – Technical University of Denmark
  • Judith Dineley – King’s College London
  • Matt Gormley – Carnegie Mellon University
  • Sandeep Konam – HITLOOP
  • Daniel Low – Harvard University
  • Bahman Mirheidari – University of Sheffield
  • Emily Mower Provost – University of Michigan
  • Paula Andrea Pérez-Toro – Friedrich-Alexander-Universität Erlangen-Nuremberg
  • Thomas Quatieri – MIT Lincoln Laboratory
  • Vikram Ramanarayanan – Modality AI
  • Chaitanya Shivade – Amazon.com
  • Helmer Strik – Radboud University
  • Tanya Talkar – Aural Analytics

Date: Tuesday, September 3

Time: 16:00 – 18:00

Hall: Yanis Club

 

Introduction

Significant strides in performance have been made by large-scale automatic speech recognition (ASR) models, resulting in low error rates for high-resource languages. This success is attributed to their training in a semi/unsupervised fashion on datasets that predominantly feature a concentrated set of languages. With this session, we aim to encourage researchers to propose approaches which can improve the performance of under-utilized languages with large-scale pretrained ASR models. While high-resource language performance can be continuously improved with scaling datasets and models, the same approach cannot be readily applied to under-represented languages, typically characterized by their low-resource nature. Therefore, a greater emphasis on technical advancements is necessary to achieve low error rates for under-represented languages. Building upon the success of previous Interspeech benchmarks like SUPERB and ML-SUPERB, this session extends the exploration to tackle the challenges linked with languages lacking prominence in the training data of pretrained models. Notably, there exists a substantial variance in performance across languages in multilingual benchmarks such as Fleurs. Thus, with this special session, we provide a platform for focused study of analysis and improvements for ASR for languages which are under-utilised in large space speech pretrained models. Additionally, we extend an invitation to researchers to contribute new datasets in under-represented languages, fostering collaboration within the research community and enabling substantial advancements in this domain.

Topics

Different fine-tuning, adaptation approaches on Self-supervised learning (SSL) models for ASR such as –

  • Adapter / prompt tuning
  • re-scoring approaches
  • data augmentation
  • zero-shot/few-shot transfer
  • test-time adaptation
  • pseudo-label training
  • ensembling

Analysis and datasets for under-utilised languages with SSL models

  • Detailed analysis of performance with pretrained models with under-represented datasets
  • Open sourcing new low-resource datasets and studying the performance of large pre-trained models

Website:

https://sites.google.com/view/is24-ssl-ul

Organizers:

  • Brian Kingsbury (IBM Research, USA)
  • Thomas Hain (The University of Sheffield, UK)
  • Hung-yi Lee (National Taiwan University, Taiwan)
  • David Harwath (The University of Texas at Austin, USA)
  • Peter Bell (University of Edinburgh, UK)
  • Jayadev Billa (Information Sciences Institute, University of Southern California, USA)
  • Chenai Chair (Mozilla Foundation, South Africa)
  • Prasanta Kumar Ghosh (Indian Institute of Science, Bangalore, India)

Date: Wednesday, September 4

Time: 10:00 – 12:00

Hall: Yanis Club

 

Introduction

This first-ever special session on speech and gender at Interspeech brings together scholars from across all of the speech disciplines. In addition to our technical programme, an interdisciplinary discussion panel can provide practical advice for understanding and talking about gender in speech research and dealing with it in practical applications.

Gender is an extremely important speaker characteristic in speech science and speech technology. It has been shown to affect speech production and interaction in meaningful and complex ways. Also essential is the compilation and use of inclusive and representative speech datasets to train and test speech technologies. Gender therefore relates to many popular research areas of the Interspeech community including dataset development, speech production and perception, algorithmic bias, privacy, and applications to healthcare.

Topics

This special session highlights the importance of gender as a speaker characteristic and provides a unique opportunity to address theoretical and practical questions relating to sex and gender in speech technology and speech science. Submissions can be full research papers or position papers.

Topics include:

  • Empirical investigations of algorithmic bias
  • Developing and using inclusive datasets
  • Gender, sexuality and other types of diversity in language and speech production and processing
  • Ensuring privacy and diversity for high-stakes speech applications
  • Applications for gender representation in media
  • Applications related to inclusion and healthcare (e.g., voice and speech therapy) related to gender and sexuality

Organizers:

  • Nina Markl (University of Essex, United Kingdom)
  • Cliodhna Hughes (University of Sheffield United Kingdom)
  • Odette Scharenborg (Delft University of Technology The Netherlands)

Date: Wednesday, September 4

Time: 13:30 – 15:30

Hall: Yanis Club

 

Deadline: 2 March 2024

Submissions: https://interspeech2024.org/paper-submission/

Introduction

The main goal of the special session is to bring together researchers who use computational modeling as a means to study human language acquisition, perception, or production.
Computational modeling is a research method that complements empirical and theoretical research on the aforementioned topics by creating algorithmic implementations of theories on human language processing, thereby enabling testing of the feasibility and scalability of these theories in practice. Models can also act as learnability proofs by demonstrating how language behaviors are achievable with a specific set of cognitive processes (“innate mechanisms and biases”), input/output data (“the environment model”), and potential
physiological constraints.
Due to the interdisciplinary nature of spoken language as a phenomenon, computational modeling research is currently dispersed across different conferences and workshops in areas such as speech technology, natural language processing, phonetics, psycholinguistics,
cognitive science, and neuroscience. Our primary aim is to bring together people working in these different areas but who use computational modeling as a part of their work.
Recent advances in machine learning and statistical modeling have resulted in increasingly powerful computational models capable of explaining several aspects of human language behaviors. This means that the methodologies used in computational modeling and
contemporary speech technologies are also becoming increasingly similar, such as using selfsupervised
learning to learn useful speech representations. At the same time, computational models of human speech processing benefit from technical expertise in speech processing and theoretical understanding of speech as a phenomenon – traits that are characteristic of
the Interspeech community. Therefore, another goal of this special session is also to encourage and facilitate cross-disciplinary collaboration between technology and science – oriented researchers in the study of human language.

Topics

We will accept submissions from any of the three areas and papers bridging them: language acquisition, speech perception, and speech production. The key requirement is that the submitted papers should use computational models to study human language processing, provide novel theoretical or conceptual insights to computational research, or describe novel technical tools, datasets, or other resources that support computational modeling research.
The emphasis will be on speech behaviors and studies involving speech audio, articulatory data, and other biosignals.

Special session at Interspeech 2024, 1–5 September, 2024, Kos Island, Greece.
Potential topics include both L1 and L2 processing, such as (but not limited to):
● Models of sub-word, lexical, and/or syntactic perceptual learning
● Models of articulatory learning
● Models of spoken word recognition or other aspects of speech perception
● Models of speech planning and motor control
● Multimodal models (e.g., visually-grounded speech learning, audiovisual perception)
● Articulatory models
● Models bridging language acquisition and adult speech behaviors
● Models of bilingual acquisition, comprehension and production
● Models of dyadic or multi-agent interaction
● Embodied models
● Self-supervised learning of speech representations
● Bayesian models of language acquisition and speech communication
● Data, evaluation practices, or other tools and resources for computational modeling
of human speech behaviors.
We also encourage machine learning and speech technology researchers to submit their
technical work on unsupervised or self-supervised acquisition of speech units (e.g., phones,
syllables, words, syntax) and analysis of language representations in deep learning models, if
these models can explain acquisition of language structures without supervised learning.

Organizers:

  • Okko Räsänen (Tampere University, Tampere, Finland)
  • Thomas Hueber (GIPSA-lab, Grenoble, France)
  • Marvin Lavechin (GIPSA-lab, Grenoble, France)

Date: Wednesday, September 4

Time: 16:00 – 18:00

Hall: Yanis Club

 

Introduction:

Large language models (LLMs) have achieved remarkable success in natural language processing through in-context learning. Recent studies have applied LLMs to other modalities including speech, leading to an emerging research topic for speech processing, i.e., spoken language models (SLMs). SLMs simplify the modeling of speech, making it easier to scale up to more data, languages, and tasks. A single model can often perform multiple speech processing tasks such as speech recognition, speech translation, speech synthesis, and natural dialogue modeling. By using pre-trained LLMs, certain types of SLMs exhibit strong instruction-following capabilities. This presents a promising avenue for developing “universal speech foundation models”, which take natural language instructions as input and proficiently execute diverse downstream tasks. This special session will provide an opportunity for researchers working in these fields to share their knowledge with each other, which can greatly benefit the entire speech community.

Topics

This special session aims to promote and advance the study of SLMs. We anticipate the session format to be panel and poster.

We welcome submissions on various topics related to spoken language models, including but not limited to:

  • Data creation
  • Speech representation learning (e.g., speech tokenizers)
  • Modeling architectures and algorithms
  • Training strategies (e.g., supervised fine-tuning, reinforcement learning)
  • Efficient adaptation of pre-trained models (e.g., adapters, low-rank adaptation)
  • Model compression (e.g., pruning, distillation, quantization)
  • Novel applications
  • Evaluation benchmarks and analysis methods
  • Fairness and bias

Website:

https://www.wavlab.org/activities/2024/interspeech2024-slm/

Organizers:

  • Yifan Peng (Carnegie Mellon University)
  • Siddhant Arora (Carnegie Mellon University)
  • Karen Livescu (TTI-Chicago)
  • Shinji Watanabe (Carnegie Mellon University)
  • Hung-yi Lee (National Taiwan University)
  • Yossi Adi (Hebrew University of Jerusalem)

Date: Thursday, September 5

Time: 10:00 – 12:00

Hall: Acesso

 

Introduction

The application of large language models has significantly advanced artificial intelligence, including developments in language and speech technologies. The purpose of this special session is to promote exploring how to leverage large language models and contextual features for phonetic analysis and speech science. It will bring together experts from diverse fields to discuss the opportunities and challenges arising from the integration of large language models in phonetic research, and draw more attention to AI for science within the phonetics community.

Topics

  1. Utilizing large language models in the development of tools and methods for speech science.
  2. Employing contextual representations for phonetic analysis.
  3. Discovering phonetic and linguistic properties embedded in pretrained and finetuned large language models.
  4. Investigating the interpretability and explainability of large language models from the perspective of human speech processes.
  5. Establishing connections between contextual representations and traditional phonetic features.
  6. Revisiting classic problems in phonetics and phonology through the re-analysis of speech data using large language models and contextual representations.
  7. Addressing the challenge of data sparsity in the application of large language models.
  8. Leveraging existing large language models to improve understanding of understudied and endangered languages.

Organizers:

  • Mark Liberman (University of Pennsylvania)
  • Mark Tiede (Haskins Laboratories and Yale University)
  • Jianwu Dang (Chinese Academy of Sciences)
  • Tan Lee (Chinese University of Hong Kong)
  • Jiahong Yuan (University of Science and Technology of China)

Date: Thursday, September 5

Time: 10:00 – 12:00

Hall: Yanis Club

 

Introduction

Speech foundation models are emerging as a universal solution to various speech tasks. Indeed, their superior performance has extended beyond ASR. For instance, Whisper has proven to be a noise-robust audio event tagger, showcasing its potential beyond its original training objectives. Despite the advancements, the limitations and risks associated with speech foundation models have not been thoroughly studied. For example, it has been found that wav2vec 2.0 exhibits biases in different paralinguistic features, emotions, and accents, while HuBERT lacks noise robustness in certain downstream tasks. Besides this, foundation models present challenges in terms of ethical concerns, including privacy, sustainability, fairness, and safety. Furthermore, risks and biases of one model may propagate in usage alongside other models, especially in a unified framework, such as Seamless.

Thus, it is necessary to investigate speech foundation models for de-biasing (e.g., consistent accuracy for different languages, genders, and ages), enhancing actuality (not making mistakes in critical applications), preventing malicious applications (e.g., using a TTS to attack speaker verification systems, not to use for surveillance), and addressing various other aspects.

In this special session, we specialize in responsible aspects of speech foundation models, which are not adequately covered by regular sessions. We aim to facilitate knowledge sharing from diverse speech areas and pioneer discussions on both tech and non-tech issues. Furthermore, in line with the IS 2024 “Speech and Beyond” theme, we aim to foster connections with other communities such as NLP and ML, which have long been investigating responsible and trustworthy models. Position papers from those communities with theoretical or conceptual justifications for bridging the gap between speech and NLP/ML are also welcome.

Topics

  • Limitations of speech foundation models and/or their solutions
    • inability to capture certain information
    • biases and inconsistent performance for different speech types
    • risks that propagate in actual use
  • Better utilization of speech foundation models
    • adaptation methods for low-resource/out-of-domain speech for fairness(e.g., parameter-efficient tuning)
    • joint training of multiple tasks for reliable and holistic speech perception
    • integration with language foundation models to address the limitations of speech (e.g., whisper + llama)
  • Interpretability, generalizability, and robustness of speech foundation models
    • whether and/or why the models perform well or poorly in certain types of speech, tasks, or scenarios (e.g., noisy, prosodic, pathological, multi-talker, far-field speech)
  • Foundation models for understudied speech tasks
    • prosody in context
    • dialog behaviors
    • speech-based healthcare
    • emotion in conversations
    • disfluencies and non-verbal vocalizations including filler, backchannels, laughter
  • Potential risks of employing speech foundation models
    • privacy breaches through the representations
    • lack of sustainability due to computing resources
    • lack of fairness concerning gender or other speaker characteristics
  • Strategies for integrating tech and non-tech elements to ensure model responsibility
    • theoretical or conceptual justifications are welcome

Website:

https://sites.google.com/view/responsiblespeech

Organizers:

  • Yuanchao Li (University of Edinburgh)
  • Jennifer Williams (University of Southampton)
  • Tiantian Feng (USC)
  • Vikramjit Mitra (Apple AI/ML)
  • Yuan Gong (MIT)
  • Bowen Shi (Meta AI)
  • Catherine Lai (University of Edinburgh)
  • Peter Bell (University of Edinburgh)

Date: Thursday, September 5

Time: 13:30 – 15:30

Hall: Yanis Club

 

Introduction

Speech technology is increasingly getting embedded in everyday life, with its applications spanning from critical domains like medicine, psychiatry, and education, to more commercial settings. This rapid growth has often largely contributed to the successful use of deep learning in modelling large amounts of speech data. However, the performance of speech technology related applications also largely varies depending on the demographics of the population the technology has been trained on and is applied to. That is, inequity in speech technology appears across age, gender, people with vocal disorders or from atypical populations, people with non-native accents.

This interdisciplinary session will bring together researchers working on child speech from speech science and speech technology. In line with the theme of Interspeech 2024, Speech and Beyond, the proposed session will address the limitations and advances of speech technology and speech science, focusing on child speech. Furthermore, the session will aid in the mutual development of speech technology and speech science for child speech, while benefiting and bringing both the communities together.

Topics

A series of oral presentations (or posters) on the following topics of interest (but not limited to):

  • Using speech science (knowledge from children’s speech acquisition, production, perception, and generally natural language understanding) to develop and improve speech technology applications.
  • Using techniques used for developing speech technology to learn more about child speech production, perception and processing.
  • Computational modelling of child speech.
  • Speech technology applications for children including (but not limited to), speech recognition, voice-conversion, language identification, segmentation, diarization etc.
  • Use and/or modification of data creation techniques, feature extraction schemes, tools and training architectures developed for adult speech for developing child speech applications.
  • Speech technology for children from typical and non-typical groups (atypical, non-native speech, slow-learners, etc.)

Website:

https://sites.google.com/view/sciencetech4childspeech-is24/home

Organizers (in alphabetical order)

  • Nina R. Benway (University of Maryland A. James Clark School of Engineering, USA)
  • Odette Scharenborg (Delft University of Technology, the Netherlands)
  • Sneha Das (Technical University of Denmark, Denmark)
  • Tanvina Patel (Delft University of Technology, the Netherlands)
  • Zhengjun Yue (Delft University of Technology, the Netherlands)

Challenges

Date: Monday, September 2

Time: 15:00 – 17:00

Hall: Yanis Club

Cognitive problems, such as memory loss, speech and language impairment, and reasoning difficulties, occur frequently among older adults and often precede the onset of dementia syndromes. Due to the high prevalence of dementia worldwide, research into cognitive impairment for the purposes of dementia prevention and early detection has become a priority in healthcare. There is a need for cost-effective and scalable methods for assessment of cognition and detection of impairment, from its most subtle forms to severe manifestations of dementia. Speech is an easily collectable behavioural signal which reflects cognitive function, and therefore could potentially serve as a digital biomarker of cognitive function, presenting a unique opportunity for application of speech technology. While most studies to date have focused on English speech data, the TAUKADIAL Challenge aims to explore speech as a marker of cognition in a global health context, providing data from two major languages, namely, Chinese and English. The TAUKADIAL Challenge’s tasks will focus on prediction of cognitive test scores and diagnosis of mild cognitive impairment (MCI) in older speakers of Chinese and English. The TAUKADIAL therefore seeks the participation of teams aiming to develop language-independent models based on comparable acoustic and/or linguistic features of speech. TAUKADIAL will provide a forum for discussion of different approaches to this task.

Website:

https://luzs.gitlab.io/taukadial/

Organizers:

  • Saturnino Luz (Usher Institute, Edinburgh Medical School, University of Edinburgh, Scotland)
  • Sofia de la Fuente Garcia (School of Health in Social Science, University of Edinburgh, Scotland)
  • Fasih Haider (School of Engineering, University of Edinburgh, Scotland)
  • Davida Fromm (Psychology Department, Carnegie Mellon University, USA)
  • Brian MacWhinney (Carnegie Mellon University, USA)
  • Chia-Ju Chou (Department of Neurology at Cardinal Tien Hospital, Taiwan)
  • Ya-Ning Chang Miin Wu (School of Computing, National Cheng Kung University, Taiwan)
  • Yi-Chien Li (Neurology Department, Cardinal Tien Hospital, Taiwan)

Date: Tuesday, September 3

Time: 13:30 – 15:0

Hall: Poster Area 2A

 

Introduction

In multilingual cultures, social interactions frequently comprise code-mixed or code-switched speech. These instances pose significant challenges for speech-based systems, such as speaker and language identification or automatic speech recognition, to extract various analytics to produce rich transcriptions. To address these challenges, we organized the first DISPLACE challenge in 2023. Despite extensive global efforts in system development, the tasks remain substantially challenging.  Inspired by the previous session of DISPLACE 2023 challenge, we have  launched the second DISPLACE challenge. 

The DISPLACE 2024 challenge draws attention to new limitations and advancements in multilingual speaker diarization, language diarization within multi-speaker settings, and automatic speech recognition in code-mixed/switched and multi-accent scenarios, evaluated using the same dataset. The challenge reflects the theme of Interspeech 2024 “Speech and Beyond- Advancing Speech Recognition and Meeting New Challenges” in its true sense.

For this challenge, we release more than 100 hours of data (both supervised and unsupervised)  for development and evaluation purposes. The unsupervised domain matched data is released for participants to use in model adaptation. There will be no training data given and the participants are free to use any resource for training the models. A baseline system and an online leaderboard will also be made available to the participants.  To the best of our knowledge, no publicly available dataset matches the diverse characteristics observed in the DISPLACE dataset, including code-mixing/switching, natural overlaps, reverberation, and noise.

Tracks:

  1. Speaker Diarization in multilingual scenarios.
  2. Language Diarization in multi-speaker settings
  3. Automatic Speech Recognition in multi-accent settings.

Website:

https://displace2024.github.io/

Organisers:

  • Kalluri Shareef Babu (Indian Institute of Science (IISc), India)
  • Shikha Baghel (National Institute of Technology Karnataka, India)
  • Deepu Vijayasenan (National Institute of Technology Karnataka, India)
  • Sriram Ganapathy (Indian Institute of Science (IISc), India)
  • S. R. Mahadeva Prasanna (Indian Institute of Technology Dharwad, India)
  • K. T. Deepak (Indian Institute of Information Technology Dharwad, India)

Date: Tuesday, September 3

Time: 16:00 – 18:00

Hall: Melambus

 

Introduction

In conventional speech processing approaches, models typically take either raw waveforms or high-dimensional features derived from these waveforms as input. For instance, spectral speech features continue to be widely employed, while learning-based deep neural network features have gained prominence in recent years. A promising alternative arises in the form of discrete speech representation, where speech signals within a temporal window can be represented by a discrete token as shown in this work.

Three challenging tasks are proposed for using discrete speech representations.

  1. Automatic speech recognition (ASR): We will evaluate the ASR performance of the proposed systems on the proposed data.
  2. Text-to-speech (TTS): We will evaluate the quality of the generated speech.
  3. Singing voice synthesis (SVS): We will evaluate the quality of the synthesized singing voice.

Participation is open to all. Each team can participate in any task. This challenge has preliminarily been accepted as a special session for Interspeech 2024, and participants are strongly encouraged to submit papers to the special session. The focus of the special session is to promote the adoption of discrete speech representations and encourages novel insights.

Topics

We welcome submissions on various topics related to discrete speech representation and downstream tasks, including but not limited to:

  • Discrete speech/audio/music representation learning
  • Discrete representation application for any speech/audio processing downstream tasks (ASR, TTS, etc.)
  • Evaluation of speech/audio discrete representation
  • Efficient discrete speech/audio discrete representation
  • Interpretability in discrete speech/audio discrete representation
  • Other novel usage of discrete representation in speech/audio

Website:

https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/

Organizers:

  • Xuankai Chang (Carnegie Mellon University, U.S.)
  • Jiatong Shi (Carnegie Mellon University, U.S.)
  • Shinji Watanabe (Carnegie Mellon University, U.S.)
  • Yossi Adi (Hebrew University, Israel)
  • Xie Chen (Shanghai Jiao Tong University, China)
  • Qin Jin (Renmin University of China, China)

New Track - Blue Sky

Introduction

This year, we also encourage the authors to consider submitting to the new BLUE SKY track of highly innovative papers with strong theoretical or conceptual justification in fields or directions that have not yet been explored.  Large-scale experimental evaluation will not be required for papers in this track. Incremental work will not be accepted. If you think your work satisfies these requirements, please consider submitting a paper on this challenging and competitive track. Please note that to achieve the objectives of this BLUE SKY track, we will ask the most experienced reviewers (mainly our ISCA Fellow members) to assess the proposals.

Chairpersons:

  • John H.L. Hansen (Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, USA)
  • Dr.-Ing. Tanja Schultz (Mathematics and Computer Science: Cognitive Systems Lab, University of Bremen, Germany)

Tuesday, September 3

  1. The MARRYS Helmet: A New Device for Researching and Training “Jaw Dancing” (2039) – Vidar Freyr Gudmundsson (University of Southern Denmark); Keve Márton Gönczi (University of Southern Den-mark); Malin Svensson Lundmark (Lund University); Donna M Erickson (Haskins labs); Oliver Niebuhr (University of Southern Denmark)
  • Session: Pathological Speech Analysis 1 (A13-O1)
  • Time: 10:00-10:40
  • Location: Panacea Amphitheater

The paper introduces a new device for analyzing, teaching, and training jaw movements: the MARRYS helmet. We outline the motivation for the development of the helmet, describe its key advantages and features relative to those of the Electromagnetic Articulograph (EMA) and illustrate by means of selected study portraits the possible uses of the MARRYS helmet in the various fields of the empirical and applied speech sciences.

Index Terms: MARRYS, EMA, articulation, jaw, prosody.

Thursday, September 5

  1. Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints inWav2Vec2.0 (2490) – Marianne LS de Heer Kloots (University of Amsterdam); Willem Zuidema (ILLC, UvA)
  • Session: Leveraging Large Language Models and Contextual Features for Phonetic Analysis (SS-4),
  • Time: 10:00-10:40
  • Location: Acesso

What do deep neural speech models know about phonology? Existing work has examined the encoding of individual linguistic units such as phonemes in these models. Here we investigate interactions between units. Inspired by classic experiments on human speech perception, we study how Wav2Vec2 resolves phonotactic constraints. We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts where only /l/, only /r/, or neither occur in English. Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds. Using simple measures to analyze model internals on the level of individual stimuli, we find that this bias emerges in early layers of the model’s Transformer module. This effect is amplified by ASR finetuning but also present in fully self-supervised models. Our approach demonstrates how controlled stimulus designs can help localize specific linguistic knowledge in neural speech models.