What and why
Since ChatGPT came out, I gotta say that staying on top of current ML trends and best practices has started to become a bit of a chore. There are some fantastic open-source and paid initiatives to track emerging research, but to my mind, they don’t cover the “applied engineering” aspects I’m more interested in. For this reason, my fix for this has been to follow a diverse set of topics and respected figures in the field across various social media feeds and to use their opinions as a starting point for informing myself. This generally works great, as most of these feeds feature some type of bookmarking function. Naturally, the bookmarks are siloed between platforms, and consolidating content between them is a bit tedious, so I thought I would try and consolidate these different pieces of content into one place. This post details how I:
- Fetch data from Twitter, Reddit, GitHub and LinkedIn
- Apply and evaluate an LLM-based relevancy classification for each piece of content, using OpenAI APIs and Argilla
- Store the results in a notion database
Source content
Over time I’ve found that each of these platforms scratches different itches, with my general impressions for each being:
- Twitter. Academics spruiking new research, devs advertising new closed/open-source software, builders and tinkerers. Dank memes. Some hype depending on who you follow.
- Reddit. As above, but usually longer form opinions/pieces. Definitely not as current as Twitter, and I often find there is a lack of discussion about SOTA methods, erring on the side of beginner question/answer style threads. Generally still useful, plus I spend quite a bit of time here for other interests/reasons.
- Github. Strictly related to tools and code (duh). I mainly use GitHub to star interesting projects and repos. Fairly recently (2020? MSFT acquisition time?) I’ve noticed Github has been building out feed-like functionality, which allows users to follow one another’s activity, at which point it becomes considerably more useful for this style of discovery work.
- Linkedin. Corporate/paid software and developments. A bit bleh and if you spend too much time here it will rot your brain. In saying that I do find it useful for major announcements from “mainstream” tech companies. And similarly to Reddit, I’m kind of already here just to see what colleagues/friends are up to so I may as well bookmark useful stuff for later.
Source Ingestion
- Zapier. Automation-as-a-service seemed like a good starting point to prototype with, so I investigated Zapier integrations and was disappointed to learn that (at the time of writing) only Twitter and Notion are supported, with a fairly strict cap on “Zaps” before prompting for payment. Bummer.
- Twitter. I configured a twitter application, which generated a key, secret key, access token and access token secret. I used the tweepy client library to retrieve all liked tweets associated with my profile. Interestingly, my initial application was randomly rejected halfway through development, which was about the same time Elon was tampering with the twitter API business model.
- Reddit. Similarly, I configured a Reddit application, which generated a client ID, and client secret, which is used in conjunction with my Reddit username, password and a user agent string. I used the praw library to retrieve all my saved comments/posts.
- Github. Here, I only needed to create a personal access token. I used raw requests to access the GitHub API instead of a dedicated Python client.
As it turns out, extracting my own reacted posts from my own LinkedIn account was exceedingly difficult. The official LinkedIn API is generally geared towards business usage (boo), whilst the “consumer docs” are roundabout and generally unhelpful.
- linkedin-api. An unofficial Python library exists, which goes a great way of fulfilling standard use cases like profile retrieval, message sending, connection deletion etc. but was still lacking access to the reacts/likes associated with my profile. I decided to investigate some scraping options.
- Selenium. Specifically, the Python bindings for Selenium. I found Selenium useful as a starting point, but found the headful browser to be extremely slow compared to my regular browser, and importantly, slow compared to the drivers which ship with Playwright.
- Playwright. An alternative to Selenium, similarly featuring headless/ful browsing options intended for front-end testing/development. I found Playwright to be the more modern, more actively maintained option for the two.
The following code pulls in my LinkedIn credentials, logs in to my profile, locates to my reacted posts, and then iterates through and parses my liked items. For each liked item, we then (annoyingly), have to “click” the “copy to clipboard” button associated with each item to retrieve the URL.
def parse_post(update_container):
if text := get_post_description(update_container):
return {
"user": get_post_author(update_container),
"url": get_post_url(update_container),
"date_created": datetime.now(timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%S.%fZ"
),
"type": "post", # TODO: post taxonomy?
"source_system": "linkedin",
"text": get_post_description(update_container),
"meta": {},
}
logger.warning("Unable to retrieve text for post")
return None
def get_liked_posts():
# Start Playwright with a headless Chromium browser
with sync_playwright() as p:
browser = p.chromium.launch(headless=HEADLESS)
page = browser.new_page()
# Login to LinkedIn
logger.info("Logging into LinkedIn..")
page.goto("https://www.linkedin.com/login")
page.fill("#username", os.environ["LINKEDIN_EMAIL"])
page.fill("#password", os.environ["LINKEDIN_PASSWORD"])
page.press("#password", "Enter")
page.wait_for_selector("input.search-global-typeahead__input")
# Go to reaction posts
logger.info("Navigating to reaction page..")
page.goto("https://www.linkedin.com/in/samhardyhey/recent-activity/reactions/")
# Prevent against weird page failures?
page.reload()
# Scroll down a bit to load more posts
page.evaluate("window.scrollBy(0, 10000)")
time.sleep(N_WAIT_TIME)
# Wait for any DM dialog to appear, then close it
button_selector = "button.msg-overlay-bubble-header__control"
button_elements = page.query_selector_all(button_selector)
dm_button = button_elements[1]
dm_button.click()
# Get the update containers for each liked post
logger.info("Retrieving update containers..")
update_containers = page.query_selector_all(
".profile-creator-shared-feed-update__container"
)
update_containers = [c for c in update_containers if len(c.text_content()) > 2]
# Parse each post and collect the results
posts = []
logger.info("Parsing update containers..")
for update_container in update_containers:
parsed_post = parse_post(update_container)
posts.append(parsed_post)
logger.info(f"LinkedIn: found {len(posts)} saved posts")
# Clean up the browser
browser.close()
# Return the parsed posts as a Pandas DataFrame
posts = [e for e in posts if e] # filter None
return pd.DataFrame(posts).drop_duplicates(subset=["user", "text"])
Notion Storage
I opted to use a notion database as the storage mechanism. To do so, I created an internal notion integration and explicitly incorporated an existing database for use within this integration. Conveniently, an unofficial Python client for notion exists which can be used to query databases, write pages etc. Given a formatted post record we first check to see if it exists within the notion database, before we format the record, and write to notion:
def format_notion_database_record(record):
notion_text_char_limit = 1800 # slightly less than 2000
meta = json.dumps(record["meta"]) if record["meta"] != {} else "None"
text = record["text"][:notion_text_char_limit]
date_created = (
record["date_created"].to_pydatetime().isoformat()
if type(record["date_created"]) == pd.Timestamp
else record["date_created"]
)
return {
"id": {"title": [{"text": {"content": secrets.token_hex(4)}}]},
"text": {"rich_text": [{"text": {"content": text}}]},
"user": {"rich_text": [{"text": {"content": record["user"]}}]},
"url": {"url": record["url"]},
"date_created": {"date": {"start": date_created}},
"type": {"select": {"name": record["type"]}},
"source_system": {"select": {"name": record["source_system"]}},
"meta": {"rich_text": [{"text": {"content": meta}}]},
"is_tech_related": {"select": {"name": record["is_tech_related"]}},
}
@retry(
wait=wait_random_exponential(min=1, max=60),
stop=stop_after_attempt(NOTION_MAX_RETRY),
)
def write_notion_page(new_database_record, database_id):
res = notion_client.pages.create(
parent={"database_id": database_id},
properties=new_database_record,
)
I found the notion APIs to be.. interesting to say the least. Lots of nesting, lots of strings, weird integrity checks for some things, but not for others etc. I’ve also wrapped all calls to notion with a retry mechanism using the tenacity library, which I thought was pretty neat. Anyway, the main ingestion script looks like this:
if __name__ == "__main__":
# 1. retrieve
reddit_posts = get_saved_posts()
twitter_posts = get_liked_tweets()
github_repos = get_starred_repos()
linkedin_posts = get_liked_posts()
# 2. format
all_records = pd.concat(
[reddit_posts, twitter_posts, github_repos, linkedin_posts]
).to_dict(orient="records")
logger.info(f"Found {len(all_records)} records to write to Notion")
# 3. write
for record in all_records:
if find_record_by_property("text", record["text"]):
logger.warning(
f"Record **{record['text'][:200]}** already exists, skipping"
)
continue
else:
# 3.1 include a relevancy prediction
time.sleep(API_THROTTLE) # ~60 requests a minute
truncated_input = " ".join(
record["text"].split(" ")[:TOKEN_TRUNCATION]
) # input limits
record["is_tech_related"] = chain.run({"text": truncated_input}).strip()
# 3.2 format/write to notion
new_database_record = format_notion_database_record(record)
write_notion_page(new_database_record, database_id)
Where we first fetch content across platforms, format each piece of content into records, and then write each record into the notion database. You’ve also probably noted that at 3.1 we’re also calculating a relevancy prediction, more on that below.
Classification
In its present form, the script allows us to write bookmarked content to the notion, which is great! But now we have a new problem; there is no distinction made between ML-related content and everything else I’ve liked. So we need a way to classify the incoming content, so we can filter it later. At this point I’d probably reach for some traditional supervised learning to classify the relevancy of incoming records but saw a couple of interesting uses of the Open AI API that allow for zero-shot classification to be used, so we’ll try that.
- Langchain. In their words, a framework for developing applications powered by language models. To my mind, the value it offers is that at a time when there is a lot of inertia and uncertainty in the space as to how LLM-based applications should be built, Langchain offers a convenient (if sometimes bloated) set of abstractions and opinions that can be used expedite the building process. That, and they seem to have seized the first-mover advantage, are open-source and have strong community/developer support. So the gist of what we want to do is define a prompt (which is just a fancy way of interpolating a string), that includes the text of the content we’re seeking to classify. Refreshingly straightforward and kind of boring actually:
_PROMPT_TEMPLATE = """You are subject matter expert specializing in computer science, programming and machine learning technologies.
You are to classify whether the following text:
{text}
Is likely to relate to computer science, programming or machine learning or not. Please provide one of two answers: tech, not_tech.
"""
prompt = PromptTemplate(input_variables=["text"], template=_PROMPT_TEMPLATE)
llm = OpenAI(
model_name="text-davinci-003",
temperature=0,
openai_api_key=os.environ["OPEN_API_KEY"],
)
chain = LLMChain(llm=llm, prompt=prompt)
Which we can then invoke in our main script as above.
Evaluation. Ok mate, so you’re telling me I can skip the traditional, tedious process of supervised learning with an API call and still obtain competitive results? Maybe! We should test this. So I stood up an argilla annotation instance on my local machine. I did this with the standard docker-compose config, being mindful to tweak the platform arg to target AMD64 (M1 Mac) and to add data volumes to persist datasets between use. I wrote some more scripts to pull in the entire notion database as a dataframe (yuck):
@retry(
wait=wait_random_exponential(min=1, max=NOTION_MAX_RETRY_TIME),
stop=stop_after_attempt(NOTION_MAX_RETRY),
)
def notion_db_to_df(notion_client, database_id):
# Create an empty list to hold all pages
data = []
# Initialize start_cursor as None to get the first page of results
start_cursor = None
while True:
time.sleep(0.2)
# Get a page of results
response = notion_client.databases.query(database_id, start_cursor=start_cursor)
results = response.get("results")
# Convert the pages to records and add them to data
for page in results:
record = {
prop_name: get_property_value(page, prop_name)
for prop_name in page["properties"].keys()
}
data.append(record)
if next_cursor := response.get("next_cursor"):
# Otherwise, set 'start_cursor' to 'next_cursor' to get the next page of results in the next iteration
start_cursor = next_cursor
else:
break
# Convert the data to a dataframe and return it
return pd.DataFrame(data)
def get_property_value(page, property_name):
# for a notion page/db record
prop = page["properties"][property_name]
if prop["type"] == "title":
return prop["title"][0]["text"]["content"] if prop["title"] else None
elif prop["type"] == "rich_text":
return prop["rich_text"][0]["text"]["content"] if prop["rich_text"] else None
elif prop["type"] == "number":
return prop["number"]
elif prop["type"] == "date":
return prop["date"]["start"] if prop["date"] else None
elif prop["type"] == "url":
return prop["url"]
elif prop["type"] == "email":
return prop["email"]
elif prop["type"] == "phone_number":
return prop["phone_number"]
elif prop["type"] == "select":
return prop["select"]["name"] if prop["select"] else None
elif prop["type"] == "multi_select":
return (
[option["name"] for option in prop["multi_select"]]
if prop["multi_select"]
else []
)
else:
return None
And upload this dataframe as a dataset within argilla:
def format_metadata(record):
meta_cols = set(record.keys()) - {"text", "vector"}
return {k: v for k, v in record.to_dict().items() if k in meta_cols}
def format_classification_record(record):
record = TextClassificationRecord(
prediction=[(record.is_tech_related, 1.0)],
text=record.text,
multi_label=False,
metadata=format_metadata(record),
)
return record
def log_notion_db_to_argilla():
classification_records = (
notion_db_to_df(notion_client, database_id)
.pipe(lambda x: x[x.text.apply(lambda y: len(y.split(" ")) > 5)])
.apply(format_classification_record, axis=1)
.tolist()
)
logger.info(f"Logging {len(classification_records)} records to Argilla")
dataset_rg = DatasetForTextClassification(classification_records)
log(
records=dataset_rg,
name=DATASET_NAME,
tags={"overview": "Verify zero-shot LLM classifications"},
background=False,
verbose=True,
)
Here, I’ve specifically opted to use the open AI tech relevancy predictions as a pre-annotation attribute on each record. This essentially turns the annotation exercise into a validation exercise, reducing the amount of data entry we need to undertake. So I ended up labelling ~260 records mainly from Reddit and Twitter in ~20 minutes, which I then pulled down from argilla and ran through a sklearn classification report:
def evaluate_argilla_dataset():
# after some annotations, load in dataset
labelled = (
load(DATASET_NAME)
.to_pandas()
.pipe(lambda x: x[~x.annotation.isna()])
.assign(prediction=lambda x: x.prediction.apply(lambda y: y[0][0]))
)
cr = classification_report(
labelled.annotation, labelled.prediction, output_dict=True
)
cr = pd.DataFrame(cr).T
print(tabulate(cr, headers="keys", tablefmt="psql"))
The classification report looks like this:
+--------------+-------------+----------+------------+------------+
| | precision | recall | f1-score | support |
|--------------+-------------+----------+------------+------------|
| Not_tech | 0.894737 | 0.990291 | 0.940092 | 103 |
| Tech | 0.993464 | 0.926829 | 0.958991 | 164 |
| accuracy | 0.951311 | 0.951311 | 0.951311 | 0.951311 |
| macro avg | 0.9441 | 0.95856 | 0.949541 | 267 |
| weighted avg | 0.955378 | 0.951311 | 0.9517 | 267 |
+--------------+-------------+----------+------------+------------+
If you’re suspicious, I was too. But I manually inspected the results a few times, and I can confirm they are accurate. To recap; I was able to create near-perfect classification results using open-AI’s text-davinci-003
model, in about 30 minutes. Yee haw.
Finito
So now I can conveniently read my silly little tweets and hype-machine posts from the convenience of notion! Some future improvements/extensions:
- Containerisation + scheduling. Initially, I had a docker image in mind that I could schedule regularly via a cloud function or somewhere that wasn’t just my local machine, but the LinkedIn scraping has created some complications. Specifically, the “copy to clipboard” function requires access to a host machine's clipboard to copy too. For most docker base images, this requires an X11 server to be installed to facilitate the X11 protocol that allows applications to copy to the clipboard and also creates an additional security risk, which is very annoying when I only need to run this script periodically, and in an ad hoc fashion. So I didn’t do that, but I could have.
- Supervised model. The other thing I found was that (quite understandably), open AI rate limit their API’s on a 60 request/minute basis or X max token limit (varying per model), with the rate limits applying whenever either of these conditions is first met. This grinds quite a bit, as each record I’m trying to write to notion requires its own separate API call/classification. I think this blog post by Mathew Honnibal contains some sensible advice which is useful here; that LLMs can be useful as a starting point to prototype ideas with, but if the task can be clarified/replicated with traditional supervised learning, then these traditional models benefit from speed (RE: rate limiting), control and extensibility. All good things we want our software to be.
Anyway, you can find the repo here if you want to scratch around some more.
Banner art developed with stable diffusion. High-level technical details developed in collaboration with GPT-4.