ASR transcript synthesis

What and why

Recently I was involved with some work where the focus was on transcribing call centre audio. The application was solid; improve process transparency/QA via ASR and apply downstream analytics over the top of the ASR with the ultimate goal of better anticipating the nature and volume of busy periods within the call centre.

However, there was a significant delay in acquiring the audio. The organisation was (naturally) bound to third-party call-centre software, which resulted in delays when attempting to export the call audio. This left me amongst a team of nervous engineers who had constructed an untested ASR pipeline and we were rapidly running out of time to test it. We needed a way to test the pipeline with realistic (enough) data to get ahead of any bugs.

Project delays and client dynamics aside, I thought it would be interesting to see if we could manufacture some call audio which featured the same “contours” we were anticipating within the call-centre data. In my mind, the process would roughly look like this:

The gist is that we’re trying to reverse engineer a single source of transcript audio which we can use for our ASR pipeline, with the following attributes:

Two speakers. Where we assume that each channel is occupied by a single speaker. This assumption is easily broken via background noise/interference, speaker switching etc. but will do for the sake of testing our ASR pipeline.
Stereo channel separation. Similarly, it is assumed that each speaker’s audio is contained exclusively in either the left or right channel. Variations of this structure could define n-channels of audio to model conference calls and/or line transfers.

Transcript generation

I perused some transcript generation options:

Samsum summarization dataset. A labelled dataset that is typically used for fine-tuning summarization models, featuring inter-leaving messenger-style chat transcripts that are labelled with a short, multi-sentence summarisation. Using the samsum dataset, I thought that we could skip the synthesis and use the raw transcript data without invoking a model. So far so good; the data looks great!

["Hannah: Hey, do you have Betty's number?",
 'Amanda: Lemme check',
 'Hannah: <file_gif>',
 "Amanda: Sorry, can't find it.",
 'Amanda: Ask Larry',
 'Amanda: He called her last time we were at the park together',
 "Hannah: I don't know him well",
 'Hannah: <file_gif>',
 "Amanda: Don't be shy, he's very nice",
 'Hannah: If you say so..',
 "Hannah: I'd rather you texted him",
 'Amanda: Just text him 🙂',
 'Hannah: Urgh.. Alright',
 'Hannah: Bye',
 'Amanda: Bye bye']

However, I noted that samsum is provisioned with a non-commercial license, so no dice.

microsoft/DialoGPT medium/large. A large-scale pre-trained response generation model that can be used for multi-turn conversations. Sounds promising! Though after some tinkering with the basic demo I found the model to be unresponsive to changes in input and a little dull.

>> User:Does money buy happiness?
DialoGPT: Money buys happiness, but it also buys a lot of things that make you happy.
>> User:Fascinating! like what?
DialoGPT: Money buys happiness, but it also buys a lot of things that make you happy.
>> User:And then what?
DialoGPT: Then you buy more happiness.
>> User:Hmm I think I'm going to go now!
DialoGPT: I'm going to go now!
>> User:I just said that?
DialoGPT: I just said that!

ParlAI. Facebook's dialogue model training framework. Importantly, I noticed that ParlAI offers access to a model zoo containing pre-trained models, pre-existing task definitions and a CLI interface that allows users to generate episodes of “self-chat”. This last feature is particularly important as it allows for the automatic generation of transcripts, between two agents. The models essentially talk amongst themselves. An example transcription below, complete with priming context:

[context]: your persona: i have one brother.
your persona: i like tacos.
your persona: i have three sisters.
your persona: i love dogs.
   [context]: your persona: my bedroom is purple and lime green.
your persona: i am a vegan.
your persona: i like riding horses.
your persona: i'll graduate from high school in two years.
[TransformerGenerator_1]: hello ! i ' ve a dog . what kind of dogs do you have ? mine is a cat .
   [TransformerGenerator_2]: i don ' t have any pets . i ' m a veggie . do you like meat ?
[TransformerGenerator_1]: i do . i do not eat meat . what is your favorite food ? my three siblings eat meat a lot .
   [TransformerGenerator_2]: i love pizza . do they have purple and green in their room ? i have purple in mine .
[TransformerGenerator_1]: pizza is great . they ' re all vegan . i like to eat tacos and pizza .
   [TransformerGenerator_2]: i have a horse . what do you do for fun ? i ride horses in my free time .
[TransformerGenerator_1]: that ' s a nice hobby . i enjoy hanging out with my three sisters , who live on the farm .
   [TransformerGenerator_2]: do you go to school ? i finished high school two years ago . i want to graduate in two .
[TransformerGenerator_1]: i am a junior in high school . congratulations on graduation ! i am studying to be a vet .
   [TransformerGenerator_2]: that is awesome ! i was going to go to vet school but i decided to be an animal vet instead .

This is definitely.. a conversation of sorts, if not a little janky, non-sensical with tinges of bias (something that Meta > parlAI appears to be actively trying to mitigate). Consistency and bias aside, it’ll suffice for now. You’ll also notice that the output tokens from the model are joined together naively; this shouldn’t be so much of a problem within the downstream TTS (save for some awkward elongations/pauses due to the use of ' '.join()).

Text-to-speech

Originally I had a “microsoft Sam” style of program in mind to convert the transcript utterances into audio, I perused some options that could be invoked within a python script.

pyttsx3. A well-regarded package that works offline and makes use of espeak, ffmpeg and libspeak1 for the underlying synthesis. The best and worst thing is that it requires no internet connection to use and makes use of local libraries.
gTTS. A thin wrapper around google’s text-to-speech API. The nice thing about gTTS is that it only requires a network connection and nothing more. Despite this advantage, we’ll probably need ffmpeg to do some of the heavy lifting when it comes to audio splicing. In any case, noticed that gTTS also features a large range of possible voices (.google.com .google.ad .google.ae etc.), which could also be useful to spice up the speech contained within a group of transcript episodes.

Audio splicing

So now that we have decent transcription generation and Text-to-Speech mechanisms we can trace through the original plan. Given a conversation transcript:

transcript = """[TransformerGenerator_1]: hello ! i ' ve a dog . what kind of dogs do you have ? mine is a cat .
   [TransformerGenerator_2]: i don ' t have any pets . i ' m a veggie . do you like meat ?
[TransformerGenerator_1]: i do . i do not eat meat . what is your favorite food ? my three siblings eat meat a lot .
   [TransformerGenerator_2]: i love pizza . do they have purple and green in their room ? i have purple in mine .
[TransformerGenerator_1]: pizza is great . they ' re all vegan . i like to eat tacos and pizza .
   [TransformerGenerator_2]: i have a horse . what do you do for fun ? i ride horses in my free time .
[TransformerGenerator_1]: that ' s a nice hobby . i enjoy hanging out with my three sisters , who live on the farm .
   [TransformerGenerator_2]: do you go to school ? i finished high school two years ago . i want to graduate in two .
[TransformerGenerator_1]: i am a junior in high school . congratulations on graduation ! i am studying to be a vet .
   [TransformerGenerator_2]: that is awesome ! i was going to go to vet school but i decided to be an animal vet instead ."""

We’d like to process each interleaving speaker utterance using gTTS:

import time
from pathlib import Path
import gtts

# store temp TTS utterance audio
output_dir = Path("./output/synth_calls/sample_transcript")
output_dir.mkdir(exist_ok=True, parents=True) if output_dir.exists() == False else None

# TTS for each utterance
for idx, utterance in enumerate(transcript.split("\n")):
    time.sleep(1)  # ip ban?
    speaker = "speaker_1" if idx % 2 == 0 else "speaker_2"
    formatted_utterance = utterance.split(":")[1].strip()
    output_file = output_dir / f"{speaker}_utterance_{idx}.mp3"
    tts = gtts.gTTS(formatted_utterance, lang="en", tld="com.au", slow=True)
    tts.save(output_file)

creating temporary, sequential audio fragments:

Which we can then collate:

import librosa
import pandas as pd
import soundfile
from pydub import AudioSegment

audio_fragment_records = []
for file in list(output_dir.glob("./*.mp3")):
    # collate utterance audio files into raw samples
    y, s = librosa.load(str(file))  # FYI: assigns default sample rate
    audio_fragment_records.append(
        {"file": file.name, "sample_array": y, "sample_array_shape": y.shape[0]}
    )

audio_fragments = (
    pd.DataFrame(audio_fragment_records)
    # probably just the df index; but to be sure
    .assign(sequence_idx=lambda x: x.file.apply(lambda y: int(y.split("_")[-1][0])))
    .sort_values("sequence_idx")
    # speaker as channel
    .assign(channel=lambda x: x.sequence_idx.apply(lambda y: 1 if y % 2 == 0 else 2))
    .reset_index(drop=True)
)

And splice together into padded, channel-specific amplitude arrays, which are then summed into a single, stereo, audio file:

import numpy as np
from pydub import AudioSegment
import soundfile

# pad channel 1/2 chunks to ensure for interleaving pattern
channel_1_segments = []
channel_2_segments = []
for idx, e in audio_fragments.iterrows():

    if e.channel == 1:
        channel_1_segments.append(e.sample_array)
        # pad alternating channel (channel 2) with equivalent size zero array to create interleave
        channel_2_segments.append(np.zeros(e.sample_array.shape[0], dtype=np.float32))
    else:
        # odd indices are channel 2
        channel_2_segments.append(e.sample_array)
        # otherwise, channel 2 length zero array
        channel_1_segments.append(np.zeros(e.sample_array.shape[0], dtype=np.float32))

# temp save for channel 1/2 audio - saves as mono
default_sr = 22050
channel_1_padded = np.concatenate(channel_1_segments)
soundfile.write(output_dir / "channel_1_temp.wav", channel_1_padded, default_sr)

channel_2_padded = np.concatenate(channel_2_segments)
soundfile.write(output_dir / "channel_2_temp.wav", channel_2_padded, default_sr)

# consolidate into an interleaving, channel seperated source
left_channel = AudioSegment.from_wav(output_dir / "channel_1_temp.wav")
right_channel = AudioSegment.from_wav(output_dir / "channel_2_temp.wav")

stereo_sound = AudioSegment.from_mono_audiosegments(left_channel, right_channel)
stereo_sound.export(output_dir / "consolidated_final.wav")

A final example below:

Ba dum

And just like that, we’re able to synthesise an arbitrary number of WAV files we can use within an ASR data pipeline. The nice thing about the files we’ve created is that that they are PII-insensitive, and aren’t subject to the typical data controls often found in industry.

You can find the repo here, which contains step-by-step notebooks for each of the processes described above, some sample outputs, as well as a final script that ties it altogether if you’re in a rush.

Banner art developed with stable diffusion.