Behold the humble nautilus
🐚

Behold the humble nautilus

What and why

Sometime in the throes of 2020 I impulse bought myself a subscription to Nautilus magazine. Nautilus is a current affairs science periodical, covering all types of great STEM subjects in a similar style to Quanta and other publications.

The problem was that I had very little time to actually read the articles. I tried a few alternatives, such as exporting the issues to my phone's ebook reader and shuffling my reading schedule around, but I still found myself reading precious little Nautilus. Around this time I realised that I always have time for podcasts, so maybe I should try and convert the issues into a listenable form instead. The following details precisely this process where I:

  • Acquire a bunch of Nautilus articles
  • Prep and parse the articles into an open format
  • Create audio renders of the articles using Text-to-speech (TTS)
  • Build a small streaming API to consume the articles

Nautilus Articles

Within the subbed Nautilus user portal exists a historical listing of all previous issues which can be accessed as part of your subscription. Each issue exists as a downloadable e-publication, where at the time of writing, there was something like 93 past issues. Downloading them all by hand is obviously a tedious exercise, so I used the downloadthemall Chrome extension to automate the process. So now I have all these past Nautilus editions, covering topics like Biology, Neuroscience, Insects and Philosophy. Excellent.

Article Parsing

In their present form, we’re unable to TTS the content of the Nautilus issues in their epub format, so some parsing is in order. I used ebooklib to extract the content of each issue’s epub, which returns an iterable of book “items”. Each “item” represents an article within the issue, and is extracted as a record with minimal metadata and the chapter content in HTML form.

def parse_issue_articles(ebook_path):
    book = epub.read_epub(ebook_path)
    article_records = []
    for item in book.get_items():
        # current_chapter = None
        if Path(item.get_name()).suffix == ".html" and "chap" not in item.get_name():
            logger.info(f"Processing {item.get_name()}")
            article_record = parse_article_record(item.get_body_content())

            # add issue metadata
            article_record.update(
                {"item_name": item.get_name(), "issue_title": book.title}
            )
            article_records.append(article_record)
    return pd.DataFrame(article_records)

The existing record form of the articles is close to what we want, though ideally we can extract a few other pieces of metadata that might make querying articles easier. So I wrote some additional code to extract/tag the segment (Editors Note, Numbers, Biology, Astronomy), author, headline and byline. Beautifulsoup has the fix:

def decode_soup_element(element):
    text_raw = element.get_text()
    decoded = codecs.decode(text_raw, "unicode_escape")
    return ftfy.fix_text(decoded).strip()


def parse_article_record(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    h1s = [decode_soup_element(e) for e in soup.find_all("h1")]
    h2s = [decode_soup_element(e) for e in soup.find_all("h2")]
    h3s = [decode_soup_element(e) for e in soup.find_all("h3")]
    h4s = [decode_soup_element(e) for e in soup.find_all("h4")]
    ps = [decode_soup_element(e) for e in soup.find_all("p")]

    article_record = {}
    if any("editor" in e.lower() for e in h1s):
        # editor segments
        article_record["headline"] = h2s[0]
        article_record["segment"] = h1s[0]
        article_record["author"] = h4s[1]
    else:
        article_record["headline"] = h2s[0] if h2s else None
        if h3s and "byline" not in article_record:
            article_record["byline"] = h3s[0]
        else:
            article_record["byline"] = None

        if h4s:
            article_record["segment"] = h4s[0]
            article_record["author"] = h4s[1] if len(h4s) > 1 else None

    article_record["article"] = "\n".join([e for e in ps if len(e) > 1])
    return article_record

Boom boom! We now have a single CSV full of individual article records that can be grouped and filtered by issue, author, segment etc.

Text-to-speech

TTS is another sub-field of generative AI that has really been having a moment. As such, TTS solutions range from paid, to open-source and generic to customised. I decided to review a smattering of options:

  • Coqui. A popular, open-source and well-regarded TTS kit. I quickly tested it on a fraction of an article consisting of 196 words. I got a realistic transcription in about 55 seconds, suggesting that I’m probably going to need some acceleration to convert all of the Nautilus articles. I tested a few single-speaker models, mainly from the Tacotron and Glow series of models, as well as a multi-speaker model trained on the VKTS dataset. An example below:
  • Tortoise. An emerging, open-source TTS library with a focus on diffusion-based models. The installation and use of Tortoise was particularly cumbersome and prone to replicability issues (RE: manual clone + install), and was generally much slower to run even with GPU acceleration. In terms of an open-source solution, I decided to discard Tortoise in favour of Coqui’s faster, more readily available models. No examples to show here, because man it was S L O W.
  • AWS Polly. AWS’s cloud-based TTS solution. I sampled their “Australian” offering of voices, which sadly only seemed to extend to “Nicole” and “Russel”. A little flat for my liking; an undesirable characteristic that permeates TTS solutions. Some samples below:
  • GCP TTS. GCP’s cloud-based TTS solution. I sampled voices from their Neural, News, Standard and Wavenet families of voices, erring for “Australian” voices where possible. GCP tended to feature a much broader suite of options compared to AWS here. A little exaggerated in delivery at times, but IMO generally better than AWS Polly. Some samples below:

Two things become apparent after testing all these TTS providers:

  • Cost. Whilst AWS Polly and GCP TTS both offer some entry-level concessions to lure people into using their services, the scale of what we have in mind would result in significant out-of-pocket expenses. As an example, I clocked a rough GCP estimate of ~US $353.80 to TTS for all our articles.
  • Variability. It become clear that even if I converted the articles from text to audio I would still have a sizeable listening task ahead of me. This being the case I wanted a solution which had good scope to vary the pace, style, intonation, accent and prosody of the voices. Excluding the cloud options, and factoring for speed/cost of inference this left Coqui’s VKTS multi-speaker model as a prime candidate, featuring over 100 distinct voices.

Scaling TTS

I fired up a single NVIDIA RTX-3090 on runpod.io costed at US $0.44 an hour and generally available. Some back-of-the-envelope run-time calculations as follows:

  • I logged the runtime for a random allocation of VKTS voices (of which there are over 100 voices) across a sample (20) of articles and calculated the median per-token TTS conversion rate of 0.0076 seconds
  • I extrapolated this time estimate across the remaining corpus of articles, with a projected conversion time of 7.4 hours
  • I arrived at a rough runpod estimate of US $3.25 to convert 3,755,382 tokens contained within 1,551 articles

Even if we were to incorporate some “fumble buffer” and double this runpod estimate (eg. idle GPU usage because of environment setup/installation, debugging/dev time) we’re still running these conversions at a fraction of what we could with GCP and AWS counterparts. And as a nice bonus, we have over 100 voices. So, that's good I think.

Anyway, the main TTS looks like this, where for each article we essentially randomly choose a VKTS voice, and write the output to disk:

def tts_coqui_vctk_multi_speaker(speaker_index, text, save_path):
    gpu_available = bool(torch.cuda.is_available())
    tts = TTS(model_name="tts_models/en/vctk/vits", gpu=gpu_available)
    if save_path.parent.exists() is False:
        save_path.parent.mkdir(parents=True)
    start = time.time()
    tts.tts_to_file(text=text, file_path=save_path, speaker=speaker_index)
    end = time.time()
    logger.info(
        f"Successuflly Synthesized Speech for {str(save_path)} using Cocqui VCTK {speaker_index} in {end - start} seconds"
    )

def tts_all_articles():
    # re-assign issue/article numbers, whoops
    df = (
        pd.read_csv(DATA_DIR / "naut_all.csv")
        .assign(issue_number=lambda x: x.issue_title.factorize()[0] + 1)
        .assign(article_number=lambda x: x.groupby("issue_number").cumcount() + 1)
    )
    output_dir = Path(__file__).parents[0] / "data/tts_output"
    log_records = []

    for idx, row in df.iterrows():
        issue_dir = output_dir / f"{row.issue_number}_{to_snake_case(row.issue_title)}"
        if issue_dir.exists() is False:
            issue_dir.mkdir(parents=True)

        article_fp = (
            issue_dir / f"{row.article_number}_{to_snake_case(row.headline)}.mp3"
        )
        if article_fp.exists():
            logger.info(f"{article_fp} already exists; skipping..")
            continue

        speaker_index = random.choice(COQUI_VKTS_SPEAKER_INDICES)
        try:
            start = time.time()
            tts_coqui_vctk_multi_speaker(speaker_index, row.article, article_fp)
            end = time.time()
            log_records.append(
                {
                    "issue_number": row.issue_number,
                    "article_number": row.article_number,
                    "n_tokens": len(row.article.split(" ")),
                    "time_elapsed": end - start,
                }
            )
        except Exception:
            logger.error(f"Unable to synthesize text for: {row.headline}")

    pd.DataFrame(log_records).to_csv(DATA_DIR / "tts_logs.csv", index=False)

Once completed, we push the outputs into an S3 bucket, alongside the original e-pubs, the parsed articles and the rest of the TTS logging.

Streaming API

We could probably stop there and just ferry the audio over onto a device, and in all honestly, this is what I’ll probably end up doing to save myself building a full-blown web/streaming application. But coming from a mainly text-centric background I feel like I have some knowledge gaps about handling “non-text” data within APIs, databases etc. So I thought to prototype a small streaming API.

The gist of the API (so far) is very simple; allow users to request a small, random sample of articles to get a feel for what an article “record” looks like, with an option to retrieve the article metadata and/or stream the article TTS:

@app.get("/")
def read_root():
    return RedirectResponse(url="/docs")


@app.get("/articles")
def read_articles(
    session: Session = Depends(get_db_session), api_key: str = Depends(api_key_header)
):
    result = session.execute(select(Article).limit(10))
    return result.scalars().all()


@app.get("/articles/{id}")
def read_article(
    id: int,
    session: Session = Depends(get_db_session),
    api_key: str = Depends(api_key_header),
):
    if article := session.get(Article, id):
        return article
    else:
        raise HTTPException(status_code=404, detail="Article not found")


@app.get("/articles/{id}/stream")
def stream_article(
    id: int,
    session: Session = Depends(get_db_session),
    api_key: str = Depends(api_key_header),
):
    article = session.get(Article, id)

    # Generate a pre-signed URL for the S3 object with a specific expiration time
    presigned_url = S3_CLIENT.generate_presigned_url(
        "get_object",
        Params={"Bucket": BUCKET_NAME, "Key": article.object_key},
        ExpiresIn=3600,  # URL expiration time in seconds (adjust as needed)
    )

    if not article:
        raise HTTPException(status_code=404, detail="Article not found")
    if not article.object_key:
        raise HTTPException(status_code=404, detail="No audio for this article")

    return RedirectResponse(presigned_url)

Here, the “streaming” is actually offloaded to the S3 service via a RedirectResponse using pre-signed URLs, a common pattern I’ve seen implemented in other APIs dealing with media-heavy content. I haven’t optimised any of the S3 config or stress-tested the API.

We also need a small database to store/query the articles, I’ve used a monolithic Article entity here:

class Article(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    headline: str
    segment: str
    author: str
    article: str
    item_name: str
    issue_title: str
    byline: str
    issue_number: int
    article_number: int
    object_key: str

    class Config:
        table_config = {"unique": ["headline", "author", "issue_number"]}

And tying it into the actual API, via a start-up routine where we create DB entities using SQLModel (a v.nice library by Tiangolo which pairs well with FastAPI), we pull our article CSV from S3 and ingest each article into a Postgres instance:

@app.on_event("startup")
def on_startup():
    with Session(engine) as session:
        SQLModel.metadata.create_all(engine)

    # re-assign S3 keys to each record
    contents = list_s3_bucket_contents(BUCKET_NAME)
    s3_keys = pd.DataFrame({"object_key": [e for e in contents if ".mp3" in e]}).assign(
        join_key=lambda x: x.object_key.apply(lambda y: Path(y).name)
    )
    # ingest all articles
    object_data = S3_CLIENT.get_object(Bucket=BUCKET_NAME, Key=NAUT_ALL_OBJECT_KEY)
    file_data = object_data["Body"].read()

    # Convert the file data to a pandas DataFrame
    data = pd.read_csv(StringIO(file_data.decode("utf-8")))

    df = (
        data.assign(issue_number=lambda x: x.issue_title.factorize()[0] + 1)
        .assign(article_number=lambda x: x.groupby("issue_number").cumcount() + 1)
        .assign(
            join_key=lambda x: x.apply(
                lambda row: f"{row.article_number}_{to_snake_case(row.headline)}.mp3",
                axis=1,
            )
        )
        .merge(s3_keys, how="inner", on="join_key")
        .drop(columns=["join_key"], axis=1)
    )
    articles = [Article(**e) for e in df.to_dict(orient="records")]

    with Session(engine) as session:
        session.add_all(articles)
        session.commit()

Anddddd we tie the whole thing together with a compose file like so:

version: "3.7"

services:
  api:
    build:
      dockerfile: Dockerfile
      context: ./app
    image: api
    env_file:
      - .env
    environment:
      DB_USER: ${DB_USER}
      DB_PASSWORD: ${DB_PASSWORD}
      DB_NAME: ${DB_NAME}
      DB_HOST: ${DB_HOST}
      DB_PORT: ${DB_PORT}
    ports:
      - "${API_PORT}:${API_PORT}"
    volumes:
      - ./app:/app
    depends_on:
      - db
    command: python api.py

  db:
    image: postgres:12.2
    env_file:
      - .env
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    ports:
      - "${DB_PORT}:${DB_PORT}"
    volumes:
      - db:/var/lib/postgresql/data

volumes:
  db:

Done

Now I can listen and learn and LISTEN about the wonders of science, courtesy of Nautilus and Coqui! Some obvious improvements/extensions:

  • TTS speed variation. I did notice that some of the TTS outputs come through a little fast given the technicality of the content. It might be worth probing around for a speed parameter adjustment within Coqui.
  • TTS cloning. Throughout the project I had a rough idea to voice clone somebody active in the science communication space to really “align” content and delivery, perhaps Doctor Karl or Brian Cox. There are huge, topical and emerging ethical issues in the generative AI space and in the end I thought better of it. Don’t clone without permission.
  • Interview-style TTS. I noticed that a lot of the articles are actually interviews with various, distinguished boffins. We have an abundance of “voices” available for selection, but this would require some reworking of the TTS pipeline (if “interview” in text > splice/interleave text? > assign voice ID > TTS each segment?), skipped for now, for simplicity sake.
  • Properly normalize the DB schema. Lots to do here, a refactor would probably see distinct Issue, Article and Author entities articulated.
  • Full-text search. I love a good search, so ideally, I would also be able to search for articles about a thing, or for articles from a particular author or similar to a pre-existing thing. This could be accomplished by properly indexing article headlines, bylines, authors and text and making these fields searchable with something like Postgres full-text search.

Anyway, we’re here now, aren’t we? Check the repo here.

Banner art developed with stable diffusion. High-level technical details developed in collaboration with GPT-4.