A workforce of researchers at Carnegie Mellon University is seeking to increase automated speech recognition to 2,000 languages. As of proper now, solely a portion of the estimated 7,000 to eight,000 spoken languages around the globe would profit from fashionable language applied sciences like voice-to-text transcription or automated captioning.
Xinjian Li is a Ph.D. pupil within the School of Computer Science’s Language Technologies Institute (LTI).
“A lot of people in this world speak diverse languages, but language technology tools aren’t being developed for all of them,” he stated. “Developing technology and a good language model for all people is one of the goals of this research.”
Li belongs to a workforce of consultants seeking to simplify the info necessities languages have to develop a speech recognition mannequin.
The workforce additionally contains LTI school members Shinji Watanabe, Florian Metze, David Mortensen and Alan Black.
The analysis titled “ASR2K: Speech Recognition for Around 2,000 Languages Without Audio” was introduced at Interspeech 2022 in South Korea.
A majority of the present speech recognition fashions require textual content and audio knowledge units. While textual content knowledge exists for hundreds of languages, the identical is just not true for audio. The workforce desires to remove the necessity for audio knowledge by specializing in linguistic components which might be widespread throughout many languages.
Speech recognition applied sciences usually concentrate on a language’s phoneme, that are distinct sounds that distinguish it from different languages. These are distinctive to every language. At the identical time, languages have telephones that describe how a phrase sounds bodily, and a number of telephones can correspond to a single phoneme. While separate languages can have totally different phonemes, the underlying telephones may very well be the identical.
The workforce is engaged on a speech recognition mannequin that depends much less on phonemes and extra on details about how telephones are shared between languages. This helps cut back the trouble wanted to construct separate fashions for every particular person language. By pairing the mannequin with a phylogenetic tree, which is a diagram that maps the relationships between languages, it helps with pronunciation guidelines. The workforce’s mannequin and the tree construction have enabled them to approximate the speech mannequin for hundreds of languages even with out audio knowledge.
“We are trying to remove this audio data requirement, which helps us move from 100 to 200 languages to 2,000,” Li stated. “This is the first research to target such a large number of languages, and we’re the first team aiming to expand language tools to this scope.”
The analysis, whereas nonetheless in an early stage, has improved present language approximation instruments by 5%.
“Each language is a very important factor in its culture. Each language has its own story, and if you don’t try to preserve languages, those stories might be lost,” Li stated. “Developing this kind of speech recognition system and this tool is a step to try to preserve those languages.”