[ad_1]

In 2019 we launched Recorder, an audio recording app for Pixel telephones that helps customers create, handle, and edit audio recordings. It leverages current developments in on-device machine studying to transcribe speech, acknowledge audio occasions, recommend tags for titles, and assist customers navigate transcripts.
Nonetheless, some Recorder customers discovered it tough to navigate lengthy recordings which have a number of audio system as a result of it is not clear who mentioned what. During the Made By Google occasion this yr, we introduced the “speaker labels” characteristic for the Recorder app. This opt-in characteristic annotates a recording transcript with distinctive and nameless labels for every speaker (e.g., “Speaker 1”, “Speaker 2”, and many others.) in actual time through the recording. It considerably improves the readability and usefulness of the recording transcripts. This characteristic is powered by Google’s new speaker diarization system named Turn-to-Diarize, which was first offered at ICASSP 2022.
![]() |
| Left: Recorder transcript with out speaker labels. Right: Recorder transcript with speaker labels. |
System Architecture
Our speaker diarization system leverages a number of extremely optimized machine studying fashions and algorithms to permit diarizing hours of audio in a real-time streaming trend with restricted computational assets on cellular units. The system primarily consists of three parts: a speaker flip detection mannequin that detects a change of speaker within the enter speech, a speaker encoder mannequin that extracts voice traits from every speaker flip, and a multi-stage clustering algorithm that annotates speaker labels to every speaker flip in a extremely environment friendly means. All parts run totally on the system.
![]() |
| Architecture of the Turn-to-Diarize system. |
Detecting Speaker Turns
The first part of our system is a speaker flip detection mannequin primarily based on a Transformer Transducer (T-T), which converts the acoustic options into textual content transcripts augmented with a particular token <st> representing a speaker flip. Unlike preceding personalized techniques that use role-specific tokens (e.g., <physician> and <affected person>) for conversations, this mannequin is extra generic and might be educated on and deployed to varied utility domains.
In most functions, the output of a diarization system is just not instantly proven to customers, however mixed with a separate automated speech recognition (ASR) system that’s educated to have smaller phrase errors. Therefore, for the diarization system, we’re comparatively extra tolerant to phrase token errors than errors of the <st> token. Based on this instinct, we suggest a brand new token-level loss operate that permits us to coach a small speaker flip detection mannequin with excessive accuracy on predicted <st> tokens. Combined with edit-based minimal Bayes danger (EMBR) coaching, this new loss operate considerably improved the interval-based F1 rating on seven analysis datasets.
Extracting Voice Characteristics
Once the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder mannequin to extract an embedding vector (i.e., d-vector) to symbolize the voice traits of every speaker flip. This method has a number of benefits over prior work that extracts embedding vectors from small fixed-length segments. First, it avoids extracting an embedding from a phase containing speech from a number of audio system. At the identical time, every embedding covers a comparatively giant time vary that accommodates ample indicators from the speaker. It additionally reduces the full variety of embeddings to be clustered, thus making the clustering step inexpensive. These embeddings are processed solely on-device till speaker labeling of the transcript is accomplished, after which deleted.
Multi-Stage Clustering
After the audio recording is represented by a sequence of embedding vectors, the final step is to cluster these embedding vectors, and assign a speaker label to every. However, since audio recordings from the Recorder app might be as brief as just a few seconds, or so long as as much as 18 hours, it’s vital for the clustering algorithm to deal with sequences of drastically totally different lengths.
For this we suggest a multi-stage clustering technique to leverage the advantages of various clustering algorithms. First, we use the speaker flip detection outputs to find out whether or not there are no less than two totally different audio system within the recording. For brief sequences, we use agglomerative hierarchical clustering (AHC) because the fallback algorithm. For medium-length sequences, we use spectral clustering as our primary algorithm, and use the eigen-gap criterion for correct speaker depend estimation. For lengthy sequences, we scale back computational value through the use of AHC to pre-cluster the sequence earlier than feeding it to the primary algorithm. During the streaming, we hold a dynamic cache of earlier AHC cluster centroids that may be reused for future clustering calls. This mechanism permits us to implement an higher certain on your entire system with fixed time and area complexity.
This multi-stage clustering technique is a vital optimization for on-device functions the place the funds for CPU, reminiscence, and battery could be very small, and permits the system to run in a low energy mode even after diarizing hours of audio. As a tradeoff between high quality and effectivity, the higher certain of the computational value might be flexibly configured for units with totally different computational assets.
![]() |
| Diagram of the multi-stage clustering technique. |
Correction and Customization
In our real-time streaming speaker diarization system, because the mannequin consumes extra audio enter, it accumulates confidence on predicted speaker labels, and will sometimes make corrections to beforehand predicted low-confidence speaker labels. The Recorder app routinely updates the speaker labels on the display screen throughout recording to replicate the most recent and most correct predictions.
At the identical time, the Recorder app’s UI permits the consumer to rename the nameless speaker labels (e.g., “Speaker 2”) to personalised labels (e.g., “automobile seller”) for higher readability and simpler memorization for the consumer inside every recording.
![]() |
| Recorder permits the consumer to rename the speaker labels for higher readability. |
Future Work
Currently, our diarization system largely runs on the CPU block of Google Tensor, Google’s custom-built chip that powers more moderen Pixel telephones. We are engaged on delegating extra computations to the TPU block, which is able to additional scale back the general energy consumption of the diarization system. Another future work path is to leverage multilingual capabilities of speaker encoder and speech recognition fashions to develop this characteristic to extra languages.
Acknowledgments
The work described on this publish represents joint efforts from a number of groups inside Google. Contributors embrace Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio Lopez Moreno, Diego Melendo Casado, Nino Tasca, Alex Gruenstein.




