Poor mans ASR pt. 1

What and Why

Across a lot of projects I’m involved in, ASR is the cool new thing. As I’ve written about previously, ASR has the potential to improve process transparency and can be effectively coupled with downstream analytics, search and modelling applications.

Most of the use cases I’ve seen also feature a significant cost implication. Consider a small, hypothetical support team (10 people) fielding a modest, uniformly distributed 500 calls a day with an Average Handler Time of 10 minutes, in which we would be anticipating roughly 5000 minutes of audio a day. For each of the major cloud providers:

GCP. Where our call assumptions sum to 100k minutes/month, we opt for a default model and make use of channel-summed audio, including the 60 minutes of free transcription, we’re looking at a monthly ballpark cost of $3500 AUD. Not including supporting infrastructure (cloud storage, remote instances/run configurations).
AWS. Similarly, if we opt for standard models, batch processing, and an AP Sydney region we fall into AWS’s T1 bracket (≤ 250k minutes). Including the 60 minutes of free transcription time, and we’re looking at a monthly ballpark cost of $3600 AUD. Also not including supporting infrastructure.
Azure. Again, we opt for a standard model and account for the generous, 5 free hours of transcription time and we’re looking at a monthly ballpark cost of $2,400 AUD? This is substantially cheaper, and I suspect there are some hour-rounding errors reflecting the Azure pricing calculator.

Given these cost implications, I thought it would be interesting to explore creating and evaluating an open-source equivalent of the cloud provider ASR services. What follows is the poor-mans ASR.

Data collection

Requirements. Ideally, we have access to audio, as well as an accompanying transcript that we can use to evaluate any ASR model/API.

Youtube. Youtube seems like a natural resource worth perusing, with access to vast collections of audio via video that can be accessed with libraries like pytube. There are ways to access the video transcriptions that accompany most youtube videos, but as detailed here, these transcriptions themselves are generated via ASR. Not particularly Ideal if we want to evaluate the performance of an ASR model, which is ideally done using human-validated/generated ground truth.

Radio national podcasts. Eventually, I settled on scraping Radio National podcasts. Importantly these podcasts feature:

Diverse content. It’s the ABC! Teach me about unlearning chronic pain and chase it with a segment about re-wilding the scottish highlands. Practically, the broad scope of this content will provide a good test of generalisability for any ASR solution.
Variable speakers and audio quality. Clicking through the above examples^ we can see that these podcasts feature anywhere from 3-5 guests. Additionally, a quick listen to some of these podcasts reveals that a lot of the guests are calling into the show. A d d i t i o n a l l y, the podcasts have been compressed for web consumption into MP3 form. This means that we’ll be contending with telephony audio, that has been speaker-summed and down-sampled into low-quality MP3. This could be a real challenge. As an aside, here’s a nice podcast diving into Bell Labs’ historical design of telephony and the essential quality/size trade-off that results in most telephony sounding lo-fi and crackled.
Variable length. Similarly, radio national podcasts range in length from 9 to 60 minutes. The sheer length of the audio, as input to an ASR model, will almost certainly be another challenge.
Accompanying transcript. All Radio National podcasts feature an excellent accompanying transcript, which is punctuated and in all likelihood has been proofread by an accessibility specialist from the ABC. Quality assured!

RN Scraping. So Radio National it is. If we’re to peruse the relevant transcript sub-page via the main RN website, we can view a range of transcript pages:

The RN transcript home page.

A single, specific podcast transcript page. Featuring an embedded audio player, as well as a full transcript! Also, note that most podcast transcript pages feature a

Using beautiful soup, we can collect a list of episode-specific page URLs by iterating through the parent page links:

def get_podcast_page_urls(page_url, base_url):
    res = requests.get(page_url)
    soup = BeautifulSoup(res.content, "html.parser")

    podcast_page_urls = []
    for a in soup.find_all("a", href=True):
        if "/radionational/programs" in a["href"] and len(Path(a["href"]).parts) > 3:
            podcast_page_urls.append(f"{base_url}{a['href']}")
    return podcast_page_urls

Whilst we can then extract/download the MP3 file URL and transcript text for each episode by searching for the relevant tags:

def get_podcast_mp3_link(page_soup):
    audio_elements = page_soup.find("audio")
    mp3_candidate_links = [e["src"] for e in audio_elements]

    if len(mp3_candidate_links) > 1:
        pod_scrape_logger.warning("More than 1 candidate mp3 URL found")
    else:
        return mp3_candidate_links[0]


def download_podcast_mp3(mp3_url, audio_dir, file_name):
    doc = requests.get(mp3_url)
    with open(audio_dir / f"{file_name}.mp3", "wb") as f:
        f.write(doc.content)


def get_podcast_transcript(page_soup):
    results = page_soup.find(id="transcript")
    return results.get_text(separator="\n")

Post-processing. One thing I did notice is that the transcripts feature explicit speaker tags, as well as some odd miscellaneous new-line and production overlay tags (eg. [sound intro] describing the intro music to the podcast/segment). An example below:

Robyn Williams:
 Who got his PhD in forestry in Melbourne. Is he right? Well, here's a thought from the late James Lovelock:
James Lovelock:
 To me, clearance of the tropical forests is by far the most damaging thing that we are doing to the Earth and to people. You see, they are talked about in connection with the CO
2

All of these could pose problems for downstream evaluation, where we’ll be more concerned with raw tokens which hold their sequencing. Some additional post-processing is as follows:

def remove_excess_char(
    input_string,
):
    # new lines
    text = re.sub("[\n]{2,}", "\n", input_string)

    # tabs
    text = re.sub("[\t]{2,}", "\t", text)

    # carriage returns
    text = re.sub("[\r]{2,}", "\r", text)

    # vertical tabs
    text = re.sub("[\v]{2,}", "\v", text)

    # n-repetitive spaces
    for n in range(2, 10)[::-1]:
        text = text.replace(" " * n, " ")

    return text

def remove_transcript_artefacts(transcript):
    filtered = []
    for line in transcript.replace("\n:", ":\n").split("\n"):
        line = line.strip()

        # colon in initial fragment > speaker tag probably
        if ":" in line[:20]:
            line = line.split(":")[1].strip()

        # remove production audio overlay brackets/parens
        if "[" in line:

            line = re.sub("\[(.*?)\]", "", line)

        if "(" in line:
            line = re.sub("\(.*?\)", "", line)

        line = remove_excess_char(line)

        if len(line) == 0:
            continue

        if line.endswith(":"):
            # probably a speaker utterance mark
            continue

        filtered.append(line)

    return " ".join(filtered)

Additionally, we also want to perform some basic set checks to make sure each podcast features the show audio as well as the accompanying transcript text:

def prune_pairless_transcripts(audio_output_dir, transcript_output_dir):
    # get intersection
    all_audio = set([e.stem for e in audio_output_dir.glob("./*.mp3")])
    all_transcript = set([e.stem for e in transcript_output_dir.glob("./*.txt")])
    intersecting_transcripts = all_audio.intersection(all_transcript)

    for file in audio_output_dir.glob("./*.mp3"):
        if file.stem not in intersecting_transcripts:
            pod_scrape_logger.warning(
                f"Could not find {file.name} in audio/transcript intersection; removing"
            )
            file.unlink()

    for file in transcript_output_dir.glob("./*.txt"):
        if file.stem not in intersecting_transcripts:
            pod_scrape_logger.warning(
                f"Could not find {file.name} in audio/transcript intersection; removing"
            )
            file.unlink()

And we’d also like to filter out some of the larger podcasts for now to improve the iteration speed of our ASR pipeline prototyping. Conveniently, this can be accomplished by generating a dataset manifest; a concept borrowed from Nvidia’s NEMO library which we’ll see more of in part 2. In any case, manifests are useful because it allows us to “manipulate” (run preliminary queries > file actions) our transcripts as vanilla data frames:

def create_manifest(
    audio_output_dir, transcript_output_dir, podcast_min_len=5, podcast_max_len=15
):
    transcript_records = []
    for audio, transcript in zip(
        sorted(list(audio_output_dir.glob("./*.mp3"))),
        sorted(list(transcript_output_dir.glob("./*.txt"))),
    ):
        assert audio.stem == transcript.stem
        with open(transcript, "r") as f:
            transcript_text = f.read()

        transcript_records.append(
            {
                "transcript": transcript_text,
                "len_seconds": MP3(audio).info.length,
                "len_minutes": MP3(audio).info.length / 60,
                "audio_path": audio.resolve(),
                "transcript_path": transcript.resolve(),
            }
        )

    podcast_manifest = (
        pd.DataFrame(transcript_records)
        .assign(stem=lambda x: x.audio_path.apply(lambda y: y.stem))
        .assign(
            transcript_len=lambda x: x.transcript.apply(lambda y: len(y.split(" ")))
        )
        .query("len_minutes >= @podcast_min_len & len_minutes <= @podcast_max_len")
        .assign(
            wpm=lambda x: x.apply(lambda y: y.transcript_len / (y.len_minutes), axis=1)
        )
    )

    return podcast_manifest

def prune_transcripts_not_in_manifest(
    manifest, audio_output_dir, transcript_output_dir
):
    audio_file_names = [e.name for e in manifest.audio_path]
    transcript_file_names = [e.name for e in manifest.transcript_path]

    for file in audio_output_dir.glob("./*.mp3"):
        if file.name not in audio_file_names:
            pod_scrape_logger.warning(
                f"Could not find {file.name} in manifest; removing"
            )
            file.unlink()

    for file in transcript_output_dir.glob("./*.txt"):
        if file.name not in transcript_file_names:
            pod_scrape_logger.warning(
                f"Could not find {file.name} in manifest; removing"
            )
            file.unlink()

Ta da

So to recap we’ve scoped out an application/motive for building our own ASR pipeline, as well as assembled a small dataset of Radio National transcripts. Cool. Some known issues and future improvements about the above work:

Scraping relevance. The perennial problem with web scraping is that it is typically brittle to code changes within the input. Each change in the site source code potentially breaks the scraper, and there is a good chance if the above scripts were run again in a few months they would be broken and require some tweaking.
ABC podcast usage. Radio national podcasts are subject to the same non-commercial terms of use policy as other ABC content is. This means the transcripts are essentially only good for blog posts like this without obtaining additional permissions.
Trove archives. I noticed that trove features RN media listings which link back to the originating RN page. Trove provides an explicit search interface (elastic under the hood I’m pretty sure) for use, which would allow for more specific (time periods? content types? presenters?) searches of RN podcasts to be used.

You can find all the code from Part 1 here, which also includes a script to download ABC podcasts/transcripts. Jump on over to Part 2!

Banner art developed with stable diffusion.