Notion hoover

What and why

Since ChatGPT came out, I gotta say that staying on top of current ML trends and best practices has started to become a bit of a chore. There are some fantastic open-source and paid initiatives to track emerging research, but to my mind, they don’t cover the “applied engineering” aspects I’m more interested in. For this reason, my fix for this has been to follow a diverse set of topics and respected figures in the field across various social media feeds and to use their opinions as a starting point for informing myself. This generally works great, as most of these feeds feature some type of bookmarking function. Naturally, the bookmarks are siloed between platforms, and consolidating content between them is a bit tedious, so I thought I would try and consolidate these different pieces of content into one place. This post details how I:

Fetch data from Twitter, Reddit, GitHub and LinkedIn
Apply and evaluate an LLM-based relevancy classification for each piece of content, using OpenAI APIs and Argilla
Store the results in a notion database

Source content

Over time I’ve found that each of these platforms scratches different itches, with my general impressions for each being:

Twitter. Academics spruiking new research, devs advertising new closed/open-source software, builders and tinkerers. Dank memes. Some hype depending on who you follow.
Reddit. As above, but usually longer form opinions/pieces. Definitely not as current as Twitter, and I often find there is a lack of discussion about SOTA methods, erring on the side of beginner question/answer style threads. Generally still useful, plus I spend quite a bit of time here for other interests/reasons.
Github. Strictly related to tools and code (duh). I mainly use GitHub to star interesting projects and repos. Fairly recently (2020? MSFT acquisition time?) I’ve noticed Github has been building out feed-like functionality, which allows users to follow one another’s activity, at which point it becomes considerably more useful for this style of discovery work.
Linkedin. Corporate/paid software and developments. A bit bleh and if you spend too much time here it will rot your brain. In saying that I do find it useful for major announcements from “mainstream” tech companies. And similarly to Reddit, I’m kind of already here just to see what colleagues/friends are up to so I may as well bookmark useful stuff for later.

Source Ingestion

Zapier. Automation-as-a-service seemed like a good starting point to prototype with, so I investigated Zapier integrations and was disappointed to learn that (at the time of writing) only Twitter and Notion are supported, with a fairly strict cap on “Zaps” before prompting for payment. Bummer.
Twitter. I configured a twitter application, which generated a key, secret key, access token and access token secret. I used the tweepy client library to retrieve all liked tweets associated with my profile. Interestingly, my initial application was randomly rejected halfway through development, which was about the same time Elon was tampering with the twitter API business model.
Reddit. Similarly, I configured a Reddit application, which generated a client ID, and client secret, which is used in conjunction with my Reddit username, password and a user agent string. I used the praw library to retrieve all my saved comments/posts.
Github. Here, I only needed to create a personal access token. I used raw requests to access the GitHub API instead of a dedicated Python client.

As it turns out, extracting my own reacted posts from my own LinkedIn account was exceedingly difficult. The official LinkedIn API is generally geared towards business usage (boo), whilst the “consumer docs” are roundabout and generally unhelpful.

linkedin-api. An unofficial Python library exists, which goes a great way of fulfilling standard use cases like profile retrieval, message sending, connection deletion etc. but was still lacking access to the reacts/likes associated with my profile. I decided to investigate some scraping options.
Selenium. Specifically, the Python bindings for Selenium. I found Selenium useful as a starting point, but found the headful browser to be extremely slow compared to my regular browser, and importantly, slow compared to the drivers which ship with Playwright.
Playwright. An alternative to Selenium, similarly featuring headless/ful browsing options intended for front-end testing/development. I found Playwright to be the more modern, more actively maintained option for the two.

The following code pulls in my LinkedIn credentials, logs in to my profile, locates to my reacted posts, and then iterates through and parses my liked items. For each liked item, we then (annoyingly), have to “click” the “copy to clipboard” button associated with each item to retrieve the URL.

def parse_post(update_container):
    if text := get_post_description(update_container):
        return {
            "user": get_post_author(update_container),
            "url": get_post_url(update_container),
            "date_created": datetime.now(timezone.utc).strftime(
                "%Y-%m-%dT%H:%M:%S.%fZ"
            ),
            "type": "post",  # TODO: post taxonomy?
            "source_system": "linkedin",
            "text": get_post_description(update_container),
            "meta": {},
        }
    logger.warning("Unable to retrieve text for post")
    return None

def get_liked_posts():
    # Start Playwright with a headless Chromium browser
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=HEADLESS)
        page = browser.new_page()

        # Login to LinkedIn
        logger.info("Logging into LinkedIn..")
        page.goto("https://www.linkedin.com/login")
        page.fill("#username", os.environ["LINKEDIN_EMAIL"])
        page.fill("#password", os.environ["LINKEDIN_PASSWORD"])
        page.press("#password", "Enter")
        page.wait_for_selector("input.search-global-typeahead__input")

        # Go to reaction posts
        logger.info("Navigating to reaction page..")
        page.goto("https://www.linkedin.com/in/samhardyhey/recent-activity/reactions/")

        # Prevent against weird page failures?
        page.reload()

        # Scroll down a bit to load more posts
        page.evaluate("window.scrollBy(0, 10000)")
        time.sleep(N_WAIT_TIME)

        # Wait for any DM dialog to appear, then close it
        button_selector = "button.msg-overlay-bubble-header__control"
        button_elements = page.query_selector_all(button_selector)
        dm_button = button_elements[1]
        dm_button.click()

        # Get the update containers for each liked post
        logger.info("Retrieving update containers..")
        update_containers = page.query_selector_all(
            ".profile-creator-shared-feed-update__container"
        )
        update_containers = [c for c in update_containers if len(c.text_content()) > 2]

        # Parse each post and collect the results
        posts = []
        logger.info("Parsing update containers..")
        for update_container in update_containers:
            parsed_post = parse_post(update_container)
            posts.append(parsed_post)
        logger.info(f"LinkedIn: found {len(posts)} saved posts")

        # Clean up the browser
        browser.close()

    # Return the parsed posts as a Pandas DataFrame
    posts = [e for e in posts if e]  # filter None
    return pd.DataFrame(posts).drop_duplicates(subset=["user", "text"])

Notion Storage

I opted to use a notion database as the storage mechanism. To do so, I created an internal notion integration and explicitly incorporated an existing database for use within this integration. Conveniently, an unofficial Python client for notion exists which can be used to query databases, write pages etc. Given a formatted post record we first check to see if it exists within the notion database, before we format the record, and write to notion:

def format_notion_database_record(record):
    notion_text_char_limit = 1800  # slightly less than 2000
    meta = json.dumps(record["meta"]) if record["meta"] != {} else "None"
    text = record["text"][:notion_text_char_limit]
    date_created = (
        record["date_created"].to_pydatetime().isoformat()
        if type(record["date_created"]) == pd.Timestamp
        else record["date_created"]
    )
    return {
        "id": {"title": [{"text": {"content": secrets.token_hex(4)}}]},
        "text": {"rich_text": [{"text": {"content": text}}]},
        "user": {"rich_text": [{"text": {"content": record["user"]}}]},
        "url": {"url": record["url"]},
        "date_created": {"date": {"start": date_created}},
        "type": {"select": {"name": record["type"]}},
        "source_system": {"select": {"name": record["source_system"]}},
        "meta": {"rich_text": [{"text": {"content": meta}}]},
        "is_tech_related": {"select": {"name": record["is_tech_related"]}},
    }

@retry(
    wait=wait_random_exponential(min=1, max=60),
    stop=stop_after_attempt(NOTION_MAX_RETRY),
)
def write_notion_page(new_database_record, database_id):
    res = notion_client.pages.create(
        parent={"database_id": database_id},
        properties=new_database_record,
    )

I found the notion APIs to be.. interesting to say the least. Lots of nesting, lots of strings, weird integrity checks for some things, but not for others etc. I’ve also wrapped all calls to notion with a retry mechanism using the tenacity library, which I thought was pretty neat. Anyway, the main ingestion script looks like this:

if __name__ == "__main__":
    # 1. retrieve
    reddit_posts = get_saved_posts()
    twitter_posts = get_liked_tweets()
    github_repos = get_starred_repos()
    linkedin_posts = get_liked_posts()

    # 2. format
    all_records = pd.concat(
        [reddit_posts, twitter_posts, github_repos, linkedin_posts]
    ).to_dict(orient="records")
    logger.info(f"Found {len(all_records)} records to write to Notion")

    # 3. write
    for record in all_records:
        if find_record_by_property("text", record["text"]):
            logger.warning(
                f"Record **{record['text'][:200]}** already exists, skipping"
            )
            continue
        else:
            # 3.1 include a relevancy prediction
            time.sleep(API_THROTTLE)  # ~60 requests a minute
            truncated_input = " ".join(
                record["text"].split(" ")[:TOKEN_TRUNCATION]
            )  # input limits
            record["is_tech_related"] = chain.run({"text": truncated_input}).strip()

            # 3.2 format/write to notion
            new_database_record = format_notion_database_record(record)
            write_notion_page(new_database_record, database_id)

Where we first fetch content across platforms, format each piece of content into records, and then write each record into the notion database. You’ve also probably noted that at 3.1 we’re also calculating a relevancy prediction, more on that below.

Classification

In its present form, the script allows us to write bookmarked content to the notion, which is great! But now we have a new problem; there is no distinction made between ML-related content and everything else I’ve liked. So we need a way to classify the incoming content, so we can filter it later. At this point I’d probably reach for some traditional supervised learning to classify the relevancy of incoming records but saw a couple of interesting uses of the Open AI API that allow for zero-shot classification to be used, so we’ll try that.

Langchain. In their words, a framework for developing applications powered by language models. To my mind, the value it offers is that at a time when there is a lot of inertia and uncertainty in the space as to how LLM-based applications should be built, Langchain offers a convenient (if sometimes bloated) set of abstractions and opinions that can be used expedite the building process. That, and they seem to have seized the first-mover advantage, are open-source and have strong community/developer support. So the gist of what we want to do is define a prompt (which is just a fancy way of interpolating a string), that includes the text of the content we’re seeking to classify. Refreshingly straightforward and kind of boring actually:

_PROMPT_TEMPLATE = """You are subject matter expert specializing in computer science, programming and machine learning technologies.
You are to classify whether the following text:
{text}
Is likely to relate to computer science, programming or machine learning or not. Please provide one of two answers: tech, not_tech.
"""

prompt = PromptTemplate(input_variables=["text"], template=_PROMPT_TEMPLATE)
llm = OpenAI(
    model_name="text-davinci-003",
    temperature=0,
    openai_api_key=os.environ["OPEN_API_KEY"],
)
chain = LLMChain(llm=llm, prompt=prompt)

Which we can then invoke in our main script as above.

Evaluation. Ok mate, so you’re telling me I can skip the traditional, tedious process of supervised learning with an API call and still obtain competitive results? Maybe! We should test this. So I stood up an argilla annotation instance on my local machine. I did this with the standard docker-compose config, being mindful to tweak the platform arg to target AMD64 (M1 Mac) and to add data volumes to persist datasets between use. I wrote some more scripts to pull in the entire notion database as a dataframe (yuck):

@retry(
    wait=wait_random_exponential(min=1, max=NOTION_MAX_RETRY_TIME),
    stop=stop_after_attempt(NOTION_MAX_RETRY),
)
def notion_db_to_df(notion_client, database_id):
    # Create an empty list to hold all pages
    data = []

    # Initialize start_cursor as None to get the first page of results
    start_cursor = None

    while True:
        time.sleep(0.2)
        # Get a page of results
        response = notion_client.databases.query(database_id, start_cursor=start_cursor)
        results = response.get("results")

        # Convert the pages to records and add them to data
        for page in results:
            record = {
                prop_name: get_property_value(page, prop_name)
                for prop_name in page["properties"].keys()
            }
            data.append(record)

        if next_cursor := response.get("next_cursor"):
            # Otherwise, set 'start_cursor' to 'next_cursor' to get the next page of results in the next iteration
            start_cursor = next_cursor

        else:
            break

    # Convert the data to a dataframe and return it
    return pd.DataFrame(data)


def get_property_value(page, property_name):
    # for a notion page/db record
    prop = page["properties"][property_name]
    if prop["type"] == "title":
        return prop["title"][0]["text"]["content"] if prop["title"] else None
    elif prop["type"] == "rich_text":
        return prop["rich_text"][0]["text"]["content"] if prop["rich_text"] else None
    elif prop["type"] == "number":
        return prop["number"]
    elif prop["type"] == "date":
        return prop["date"]["start"] if prop["date"] else None
    elif prop["type"] == "url":
        return prop["url"]
    elif prop["type"] == "email":
        return prop["email"]
    elif prop["type"] == "phone_number":
        return prop["phone_number"]
    elif prop["type"] == "select":
        return prop["select"]["name"] if prop["select"] else None
    elif prop["type"] == "multi_select":
        return (
            [option["name"] for option in prop["multi_select"]]
            if prop["multi_select"]
            else []
        )
    else:
        return None

And upload this dataframe as a dataset within argilla:

def format_metadata(record):
    meta_cols = set(record.keys()) - {"text", "vector"}
    return {k: v for k, v in record.to_dict().items() if k in meta_cols}


def format_classification_record(record):
    record = TextClassificationRecord(
        prediction=[(record.is_tech_related, 1.0)],
        text=record.text,
        multi_label=False,
        metadata=format_metadata(record),
    )
    return record


def log_notion_db_to_argilla():
    classification_records = (
        notion_db_to_df(notion_client, database_id)
        .pipe(lambda x: x[x.text.apply(lambda y: len(y.split(" ")) > 5)])
        .apply(format_classification_record, axis=1)
        .tolist()
    )
    logger.info(f"Logging {len(classification_records)} records to Argilla")
    dataset_rg = DatasetForTextClassification(classification_records)
    log(
        records=dataset_rg,
        name=DATASET_NAME,
        tags={"overview": "Verify zero-shot LLM classifications"},
        background=False,
        verbose=True,
    )

Here, I’ve specifically opted to use the open AI tech relevancy predictions as a pre-annotation attribute on each record. This essentially turns the annotation exercise into a validation exercise, reducing the amount of data entry we need to undertake. So I ended up labelling ~260 records mainly from Reddit and Twitter in ~20 minutes, which I then pulled down from argilla and ran through a sklearn classification report:

def evaluate_argilla_dataset():
    # after some annotations, load in dataset
    labelled = (
        load(DATASET_NAME)
        .to_pandas()
        .pipe(lambda x: x[~x.annotation.isna()])
        .assign(prediction=lambda x: x.prediction.apply(lambda y: y[0][0]))
    )
    cr = classification_report(
        labelled.annotation, labelled.prediction, output_dict=True
    )
    cr = pd.DataFrame(cr).T
    print(tabulate(cr, headers="keys", tablefmt="psql"))

The classification report looks like this:

+--------------+-------------+----------+------------+------------+
|              |   precision |   recall |   f1-score |    support |
|--------------+-------------+----------+------------+------------|
| Not_tech     |    0.894737 | 0.990291 |   0.940092 | 103        |
| Tech         |    0.993464 | 0.926829 |   0.958991 | 164        |
| accuracy     |    0.951311 | 0.951311 |   0.951311 | 0.951311   |
| macro avg    |    0.9441   | 0.95856  |   0.949541 | 267        |
| weighted avg |    0.955378 | 0.951311 |   0.9517   | 267        |
+--------------+-------------+----------+------------+------------+

If you’re suspicious, I was too. But I manually inspected the results a few times, and I can confirm they are accurate. To recap; I was able to create near-perfect classification results using open-AI’s text-davinci-003 model, in about 30 minutes. Yee haw.

Finito

So now I can conveniently read my silly little tweets and hype-machine posts from the convenience of notion! Some future improvements/extensions:

Containerisation + scheduling. Initially, I had a docker image in mind that I could schedule regularly via a cloud function or somewhere that wasn’t just my local machine, but the LinkedIn scraping has created some complications. Specifically, the “copy to clipboard” function requires access to a host machine's clipboard to copy too. For most docker base images, this requires an X11 server to be installed to facilitate the X11 protocol that allows applications to copy to the clipboard and also creates an additional security risk, which is very annoying when I only need to run this script periodically, and in an ad hoc fashion. So I didn’t do that, but I could have.
Supervised model. The other thing I found was that (quite understandably), open AI rate limit their API’s on a 60 request/minute basis or X max token limit (varying per model), with the rate limits applying whenever either of these conditions is first met. This grinds quite a bit, as each record I’m trying to write to notion requires its own separate API call/classification. I think this blog post by Mathew Honnibal contains some sensible advice which is useful here; that LLMs can be useful as a starting point to prototype ideas with, but if the task can be clarified/replicated with traditional supervised learning, then these traditional models benefit from speed (RE: rate limiting), control and extensibility. All good things we want our software to be.

Anyway, you can find the repo here if you want to scratch around some more.

Banner art developed with stable diffusion. High-level technical details developed in collaboration with GPT-4.