Property oracle

What and why

Like most Australians, I have acquired the property brain virus. It’s difficult to say exactly when I found myself afflicted, but I now find myself doom scrolling domain and realestate.com.au listings, searching for something that might get me out of the truly desperate rental market.

Part of these efforts involve doing lots of research about all aspects of the buying process; should I use a broker? how about a buyer's agent? What about a deposit? Is my deposit too small? What about hidden and ongoing costs? All very important, boring, and essential questions.

An interesting side effect the property brain virus has had on the nation is that there is an insufferable amount of podcasts covering all aspects of the market in greater or lesser quality. I’m not quite at the level of enthusiasm to actually listen to these podcasts, and as I’ve documented in other posts I don’t think I have the attention surplus to do so. So I thought to try and distil the content of these podcasts into a queryable, text-based form by:

Collating the audio from the most “reputable” property/first home buyer podcasts I could find
Transcribing the podcast audio
Chunking and ingesting the transcripts into a vector store
Affixing an LLM “head” on top of the vector store retrieval, which can be used to RAG (Retrieval Augmented Generation) relevant sections of these podcasts in response to silly questions I have

Technocratic solutions are the only solutions, watch me fly etc.

The podcasts

I originally turned to Reddit to gauge the hive mind’s opinion on such matters. This initially returned The First Home Guidebook by Amy Lunardi, which is actually very sick and I would recommend listening to it in the traditional way, but we ideally want to corroborate wisdom across a few different sources. So I found some other podcasts:

Nice, so we have a list of podcasts. But we don’t have the associated audio. A quick Google search yielded podcast downloader as a promising candidate library capable of downloading one, many, or all podcasts from a given RSS feed. I used getrssfeed.com to convert the Google podcast link for each of these podcasts into the equivalent RSS feed.

Small problem though, when I was attempting to scale the downloads to “every podcast” ever published for each show I could only manage to download the “latest” podcast no matter how hard I huffed and tweaked my config. Bummer, so I wrote my own parser to parse the XML of the RSS feeds. We first retrieve the RSS feed contents as an XML string, which we parse into a dictionary:

def etree_to_dict(t):
    d = {t.tag: {} if t.attrib else None}
    children = list(t)
    if children:
        dd = {}
        for dc in map(etree_to_dict, children):
            for k, v in dc.items():
                try:
                    dd[k].append(v)
                except KeyError:
                    dd[k] = [v]
        d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}}
    if t.attrib:
        d[t.tag].update((f"@{k}", v) for k, v in t.attrib.items())
    if t.text:
        text = t.text.strip()
        if children or t.attrib:
            if text:
                d[t.tag]["#text"] = text
        else:
            d[t.tag] = text
    return d

def parse_xml_objects(xml_string):
    root = ET.ElementTree(ET.fromstring(xml_string)).getroot()
    return etree_to_dict(root)

We can then simply retrieve the relevant fields and download the associated podcast audio:

def format_xml_objects(xml_dict):
    return (
        pd.DataFrame(xml_dict["rss"]["channel"]["item"])
        .pipe(lambda x: x[["title", "pubDate", "enclosure", "link", "description"]])
        .assign(enclosure=lambda x: x.enclosure.apply(lambda y: y["@url"]))
        .rename(columns={"enclosure": "url"})
    )

def download_file(url, title, directory):
    try:
        response = requests.get(url)
        file_path = directory / f"{title}.mp3"
        with open(str(file_path), "wb") as f:
            f.write(response.content)
        logger.info(f"Successfully downloaded {url}")
    except Exception:
        logger.warning(f"Error downloading {url}")

The nice thing about this approach is that I was also able to extract a series of metadata records associated with each piece of audio, and not just the audio itself. Tying the whole thing together looks like this:

def retrieve_podcasts():
    # create output dirs
    if not META_DIR.exists():
        META_DIR.mkdir(parents=True, exist_ok=True)
    if not PODCAST_DIR.exists():
        PODCAST_DIR.mkdir(parents=True, exist_ok=True)

    for podcast in PODCAST_META:
        # fetch/format metadata
        logger.info(f"Retrieving XML feed for: {podcast['name']}")
        res = requests.get(podcast["rss_link"])
        xml_dict = parse_xml_objects(res.text)
        meta_df = format_xml_objects(xml_dict).assign(
            title=lambda x: x.title.apply(snake_case_string)
        )
        meta_file_save = META_DIR / f"{snake_case_string(podcast['name'])}.csv"
        logger.info(
            f"Saving meta data to: {meta_file_save}, found {len(meta_df)} episodes"
        )
        meta_df.to_csv(meta_file_save, index=False)

        audio_download_dir = PODCAST_DIR / snake_case_string(podcast["name"])
        if not audio_download_dir.exists():
            audio_download_dir.mkdir(parents=True, exist_ok=True)

        # parallelize the downloads
        with ThreadPoolExecutor(max_workers=NUM_CORES) as executor:
            tasks = [
                (record.url, record.title, audio_download_dir)
                for idx, record in meta_df.iterrows()
            ]
            list(executor.map(lambda params: download_file(*params), tasks))

So now we have 457 podcasts, totalling ~236 hours of audio, with a mean/median podcast length of 31/33 minutes. Quite the haul.

Transcription

So now we need to transcribe the damn things. In a previous series of posts, I benchmarked a series of (then) SOTA ASR models. Fast forward less than a year, and most of those models have been superseded by openAI’s whisper lol. So Whisper it is.

The big draw for me is that Whisper incorporates punctuation out of the box, solving the “wall of text” and associated downstream problems often encountered with ASR systems. Specifically, I’ve been playing with the CTranslate2 variation of Whisper which provides some additional options around weight precision (int, floating-point, 8-bit etc), and parallelisation (n-workers).

Since the application I have in mind is scoped as a point-in-time exercise, I’ll take a few shortcuts with the transcription by:

Firing up a temporary runpod instance, which we’ll used to accelerate the transcription process (GPU vroom)
Cloning the repo directly into the instance (via VSCode remote attachment)
Pulling the audio onto the runpod instance (see the above code)
Transcribing the audio, using the following code:

device = "cuda" if bool(torch.cuda.is_available()) else "cpu"
model = WhisperModel(
    MODEL_SIZE, device=device, num_workers=NUM_WORKERS, compute_type=COMPUTE_TYPE
)

def transcribe_audio(audio_file):
    audio_file = Path(audio_file)
    segments, info = model.transcribe(str(audio_file), beam_size=5)
    transcript = [
        {"start": segment.start, "end": segment.end, "text": segment.text}
        for segment in segments
    ]
    return pd.DataFrame(transcript)

def transcribe_podcasts():
    for podcast in list(PODCAST_DIR.rglob("*/*.mp3")):
        start = time.time()
        file_name = snake_case_string(podcast.name.replace(".mp3", ".csv"))
        save_name = TRANSCRIPT_DIR / podcast.parent.name / file_name

        if not save_name.parent.exists():
            save_name.parent.mkdir(parents=True, exist_ok=True)

        if save_name.exists():
            continue

        try:
            transcription = transcribe_audio(podcast)
            transcription.to_csv(save_name, index=False)
            end = time.time()
            logger.info(f"Transcribed {podcast.name} in {end - start} seconds")
        except Exception:
            logger.error(f"Could not transcribe: {podcast.name}")

Before finally pushing all artefacts into an S3 bucket. We now have a transcript for each podcast episode which looks like this:

Vector Store and Ingestion

Following along with the Langchain docs we’ll instantiate the embedding and LLM components with minimal parameters below:

embedding_config = OpenAIEmbeddings(openai_api_key=os.environ["OPEN_API_KEY"])
llm = OpenAI(openai_api_key=os.environ["OPEN_API_KEY"])

Which interestingly gives us a text-embedding-ada-002 for our embedding model and a separate, text-davinci-003 for our LLM “head”. So what are these models, and why are they different from one another? Can’t we just use a single model for both the embedding and LLM parts of our application?

Well, according to openAI, text-embedding-ada-002 consolidates performance across text search, code search, sentence similarity and text classification use cases whilst being substantially cheaper than its predecessor.

On the other hand, the text-davinci-003 model is OpenAI’s “most powerful”, out-of-the-box model available within the API suite. It’s also significantly more expensive than the Ada series we’ll be using for embedding chunks of the transcripts. These differences make sense when we consider that they fulfil different functions within the broader application:

Embedding and retrieving transcript chunks. A volume-intensive exercise, where there is a fairly generous margin for error in the final usage in the sense that we’re only using the embeddings to retrieve X relevant search results (derived from the embeddings) as an intermediate output. Given these relaxed constraints, it’s helpful to use a model optimised for the generation of cheap embeddings, such as text-embedding-ada-002.
Transforming and contextualising relevant search results. By contrast, once we retrieve some candidate search results we want to use these results as the basis of some text generation that corresponds to the original prompt/question/input. This high-level transformation piece requires a model, optimised for “reasoning” tasks rather than just spitting out embeddings. text-davinci-003 fills this need.

Anyway, as far as embedding/ingesting our transcripts into a vector store Langchain provides wrappers for lots of different stores. Here, we’ll instantiate a Chroma instance. Chroma is open-source, and allows us to store the index on disk locally, which is useful for quick prototyping:

vectordb_persist_dir = DATA_DIR.parents[0] / "db"
if vectordb_persist_dir.exists() == False:
    vectordb_persist_dir.mkdir(parents=True, exist_ok=True)
vectordb = Chroma(
    embedding_function=embedding_config, persist_directory=str(vectordb_persist_dir)
)

We’ll also need to format and chunk each transcript dataframe. We need to chunk the inputs because the text-embedding-ada-002 model can only embed pieces of text with a max length of 8191 tokens and each transcript exceeds this limit. We’ll use a RecursiveCharacterTextSplitter in conjunction with a DataFrameLoader to chunk the transcripts:

def ingest_transcript_df_chromadb(transcript_df, transcript_col):
    documents = DataFrameLoader(
        transcript_df, page_content_column=transcript_col
    ).load()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
    )
    texts = text_splitter.split_documents(documents)
    logger.info(
        f"Supplied {len(transcript_df)} records, ingesting as {len(texts)} documents"
    )
    vectordb.add_documents(texts)
    vectordb.persist()

My initial thoughts about this were that we should try and chunk the inputs based upon some meaningful unit; perhaps each per-speaker utterance or silence-delimited section? I came across a school of thought suggesting that practically, how the inputs are chunked doesn’t actually matter that much and that unless document attributes like page, paragraph etc. are crucial for your application, we can just slide a chunking window over the inputs. So we’ll do that in the interest of time. We’ll also apply some minimal preprocessing to the transcripts (utterance merging) and ingest the rest of our transcripts:

def merge_adjacent_utterances(df):
    dff = (
        df.assign(start_shift=lambda x: x.start.shift(-1))
        .assign(delta=lambda x: x.start_shift - x.end)
        .assign(group=lambda x: (x.delta > MERGE_THRESHOLD).cumsum())
        .groupby("group")
        .agg({"start": "min", "end": "max", "text": "".join})
        .reset_index(drop=True)
    )
    logger.info(f"Reducing {len(df)} records to {len(dff)}")
    return dff


def estimate_cost_of_ingest(transcript_df):
    n_tokens = len([e for e in "".join(transcript_df.text.tolist()).split(" ") if e])
    logger.info(f"{n_tokens} tokens found in transcript")
    cost_estimate = (n_tokens / 1000) * ADA_COST_PER_1000_TOKENS
    logger.info(f"Estimate ingestion cost: US ${cost_estimate}")


def ingest_transcripts():
    for transcript in list(TRANSCRIPT_DIR.rglob("*/*.csv")):
        logger.info(f"Ingesting transcript: {transcript.name}")
        df = (
            pd.read_csv(transcript)
            .assign(start=lambda x: round(x.start, 2))
            .assign(end=lambda x: round(x.end, 2))
            .sort_values(["start", "end"])
        )
        transcript_df = (
            merge_adjacent_utterances(df)
            .assign(podcast=transcript.parent.name)
            .assign(episode=transcript.stem)
        )
        logger.info(f"Ingesting transcript: {transcript.name}")
        estimate_cost_of_ingest(transcript_df)
        chroma.ingest_transcript_df_chromadb(
            transcript_df, transcript_col=TRANSCRIPT_COL
        )

Langchain and LLM Outputs

The nice thing about the vector store + LLM architecture is that we can flexibly use the vector store in different ways. We could run a basic question/answer style pipeline over the query results or we could summarise the results of a query without changing the ingestion method. I experimented with a few “chains” and realised that I often found myself asking clarifying questions as I interacted with the content since I’m essentially trying to teach myself how property works. So I opted to use a ConversationalRetrievalChain, that essentially injects a memory component containing the chat history into a QA model.

Great, so this is what I want the application to do, but I have a question. Since we're issuing identical API requests to openAI and are limited by the same input restrictions for the final LLM call (4097 tokens for text-davinci-003), irrespective of the Langchain "chain" we're utilizing, I'm curious what the cost is of including this extra input (the conversation history) and if there's a compromise in relation to the vector store query results. Are we including fewer vector store query results? Are we progressively including fewer query results as the conversation “chain” grows?

Some clever design work from LangChain here, where they first condense the current chat history + latest question into a single query vector, which is then used in the same way as a standalone QA use-case. So regardless of how small/large the chat history is, it has no bearing on how many intermediate search results are fed as context into the final LLM API call.

Anyway, we’ll use the ConversationalRetrievalChain, and wrap the whole thing in a small typer app:

example_questions_joined = "\n* ".join(example_questions)
example_questions_joined = f"\n* {example_questions_joined}"


def app_intro():
    typer.echo(
        typer.style(
            (
                f"\n\nWelcome to Property Oracle! Ask me anything about Australian Property. "
                f"\n\nHere are some example questions you can ask:\n{example_questions_joined}"
                f"\n\nType 'exit' to exit or 'new' to clear the conversation history."
            ),
            fg=typer.colors.BRIGHT_YELLOW,
        )
    )


def main():
    app_intro()
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    qa = ConversationalRetrievalChain.from_llm(
        llm, vectordb.as_retriever(), memory=memory
    )
    while True:
        question = typer.prompt(typer.style("\nYou ", fg=typer.colors.BRIGHT_GREEN))

        if question.lower() == "exit":
            typer.echo(
                typer.style("Property Oracle: Goodbye!", fg=typer.colors.BRIGHT_YELLOW)
            )
            break
        elif question.lower() == "new":
            # flush memory, clear terminal
            typer.echo(
                typer.style(
                    "Property Oracle: Clearing chat history..",
                    fg=typer.colors.BRIGHT_YELLOW,
                )
            )
            memory.clear()
            os.system("clear")
            app_intro()
            continue  # restart the while loop

        result = qa({"question": question})
        typer.echo(
            typer.style(f"\nProperty Oracle:{result['answer']}"),
            fg=typer.colors.BRIGHT_YELLOW,
        )


if __name__ == "__main__":
    typer.run(main)

Now to take it for a test chat:

Not bad for what is essentially a slightly customised LangChain demo. Some initial impressions:

Continuity and context switching. I was pleasantly surprised with the agent’s ability to flex and move across topics. I had originally anticipated some “confusion” in this respect and built in a memory flush function, though in hindsight perhaps this is not required.
Brevity of response. I’m on the fence about this; on one hand, it allows you to quickly slice through the conversation, but on the other, I know there is definitely more preliminary material being returned as part of the query. I suppose I’m also biased toward ChatGPT-like experiences, where similar questions would fetch a few paragraphs as a response.
Debugging. The previous point raises an interesting question about debugging these types of “chains”, two thoughts on this. One is that LLM-based applications introduce new failure modes and are generally less transparent than traditional software systems. Additionally, this lack of transparency is something Langchain seeks to address via tracing. I was also able to debug the intermediate retrieval outputs to sanity-check the results for some of the questions. This generalisation/specificity trade-off in a software context strikes me as an essential dilemma of LLM-based applications. Lots of movement in this space, like this library from Microsoft.
Determinism. Another thing I noticed during development was that final LLM outputs are generally not deterministic, despite controlling for the input question and the intermediate retrieval. An example below:

"Should I use mortgage broker? And if so, why?"

Yes, you should use a mortgage broker because they provide a long-term relationship which can be beneficial as your life plan changes and they can provide advice on improving borrowing power and minimizing expenses. Additionally, they are usually paid by the banks, so there are usually no fees to see them.

Yes, you should use a mortgage broker because they can provide valuable advice and guidance throughout the process. They can help you figure out what you can afford and what type of assets are good investments, as well as coach you through improving your borrowing power and minimizing expenses. Additionally, most mortgage brokers do not charge any fees and are remunerated by the banks.

Yes, you should use a mortgage broker because they can provide you with valuable advice and guidance on how to improve your borrowing power, minimize expenses, and save money. Additionally, mortgage brokers are typically paid by the bank, so there should be no extra cost to you.

Though this indeterminism just seems to be par for the course when creating LLM applications. Not exactly a great outcome ¯\_(ツ)_/¯

Cesado

That's all for now. Some extensions to the existing work:

Open-source embedding and LLM components. Langchain provides a rich collection of interfaces for vector stores and LLM implementations, the openAI-based examples being the obvious starting points. The sentence-transformers library could be used to replace the embedding mechanism, whilst llama-ccp and models hosted in the huggingface repo format are explicitly supported within Langchain.
Hosted variations. The current work runs locally only, and perhaps it would be nice to share a resource like this with other jaded millennials seeking entry into the property market. I’ll probably create a follow-up entry where we take the above, existing work and host it via a simple web application.
Add meta-data handlers and reference handling. During the initial evaluation, I found it difficult to discern what was based upon actual sources, and what was a LLM hallucination. One way of guarding against this is to ensure reference sources are returned with outputs. A similar extension would be to expose specific bits of metadata (podcast, participants, topic tags) for use within the query, a method Langchain terms “self-querying”.

You can suss the repo here.

Banner art developed with stable diffusion. High-level technical details developed in collaboration with GPT-4.