Collagey segmenty

What and Why

In my downtime, I like to make collages. Specifically, I like to collage with old biological paintings of plants and animals, the British Biodiversity Heritage Library has so far been the best place to collect these types of materials. I’ve been tinkering with a few workflows over time, but things usually look like this:

Collecting images. Ideally, of a decent resolution, from the biodiversity website or some other source like Flickr.
Cropping the images. Procreate has a pretty decent auto-selection tool which makes the cropping/extracting process intuitive enough, though it is slow-going (single-image process) and a bit hit and miss in terms of intuiting an effective threshold value.
Arranging into a collage. Yay.

I thought there might be a better way to speed up the first two aspects of this process, using the Flickr API and some computer vision techniques. I sketched a plan as follows:

Collecting images. Automatically, using the Flickr API, which crucially, allows access to specific profiles such as the Biodiversity Heritage Library.
Cropping the images. Using.. some sort of.. computer vision? At the time of writing (and probably still) CV is quite new to me but I had been eyeing off a few standard libraries like OpenCV and Skimage which I figured would act as good starting points.

Collecting images

Flickr developer token and authentication. I started by booting up my Flickr account, and re-using a developer token I had used for an old Uni assignment; Jim Hogan’s cloud computing API mashup as I recall. Creating new developer tokens is pretty straightforward. The Flickr API can be accessed within python using the flickrapi package like so:

flickr = flickrapi.FlickrAPI(api_key, api_secret, format="etree")
flickr.authenticate_console() # 401 error anyway? but still works?

Why this much detail about the authentication? Well once run, this snippet will pop out into a new browser requesting manual authorisation, but, upon piping, this authentication process back into your python script will throw a pretty nasty string decode error. Despite this, the core authentication process appears to execute just fine. Bit of a mystery to me, and since we’re notebook bound for the most part this type of jankiness isn’t so much of a problem.

Retrieve a selection of BDHL albums. The BDHL flickr profile contains the profile ID within the URL, allowing us to paginate through all albums currently hosted on the BDHL. Each page contains up to 500 albums! (overwhelming), hence the random down-sample at the end of this snippet. This initial API call retrieves album metadata.

n_albums = 10
user_id = "61021753@N02"

# retrieve some biodiversity albums
bdhl = flickr.photosets.getList(user_id=user_id, page=1) # paginated
bdhl_df = pd.DataFrame([dict(e.items()) for e in list(bdhl.find('photosets'))]).sample(n=n_albums, random_state=42)

BDHL album curation. Upon running the first version of this download script I ended up pulling quite a few albums which featured a small number of images, photos/photo-realistic images or albums which don’t feature a useful “pre-segmentation”, which will inform the ease of downstream segmentation. So instead of randomly sampling albums, I manually perused the available albums and explicitly provided a list of album IDs that I thought were a better fit for the project.
Retrieve image metadata. Additionally, we’ll have to walk through all/some of the album contents to retrieve image metadata, including the download URL we’ll eventually use to pull the images down with:

def get_image_url_etree(image_id):
    sizes = flickr.photos.getSizes(photo_id=image_id)
    largest_available_size = (
        pd.DataFrame([dict(e.items()) for e in list(sizes.find("sizes"))])
        .sort_values(by=["width", "height"], ascending=True)
        .iloc[0]
    )
    return largest_available_size.to_dict()


def retrieve_image_meta_data(album_id, n_images_per_album=30, min_resolution=800):
    image_records = []
    try:
        images_raw = list(flickr.walk_set(album_id))
    except:
        flickr_retrieval_logger.error(f"Unable to walk images for album: {album_id}")
        return pd.DataFrame()  # cat empty df all the same? bit yikes

    for image in tqdm(
        images_raw, desc=f"Retrieving image meta data for album: {album_id}"
    ):
        image = dict(image.items())  # silly e-tree format
        try:
            largest_size = get_image_url_etree(image["id"])
            image["image_meta"] = largest_size
            image_records.append(image)
        except:
            flickr_retrieval_logger.error(
                f"Unable to retrieve image size for: {image['id']}"
            )

    images = (
        pd.DataFrame(image_records)
        .assign(album_id=album_id)
        .assign(download_url=lambda x: x.image_meta.apply(lambda y: y["source"]))
        # filter out small images
        .assign(width=lambda x: x.image_meta.apply(lambda y: int(y["width"])))
        .assign(height=lambda x: x.image_meta.apply(lambda y: int(y["height"])))
        .query("height >= @min_resolution & width >= @min_resolution")
        .pipe(
            lambda x: x.sample(n=n_images_per_album, random_state=42)
            if x.shape[0] > n_images_per_album
            else x
        )
    )
    if len(images) == 0:
        flickr_retrieval_logger.warning(
            f"No images meet the minimum resolution of {min_resolution}; {len(image_records)} initial records found"
        )
    return images

# walk the albums, retrieve individual photo details
all_photos = []
for idx, album in bdhl_df.iterrows():
    all_photos.append(retrieve_image_meta_data(album.id))

Some pretty gross code going on here such as request try/excepts, and in particular lots of “walk” API calls and dictionary casting, which is required when using the XML eTree version of the Flickr API. The gist is simple enough though; for each album > retrieve metadata for each photo > collate > filter for decent dimensions > and take a random sample to keep the outputs manageable.

Download images. Once we have a shiny new dataframe full of photo metadata we can download each image separately.

def download_flickr_image(url, save_path):
    response = requests.get(url)
    with open(save_path, "wb") as file:
        file.write(response.content)
        file.close()


def download_image_record(record, download_dir):
    # mkdir album save dir if doesn't exist
    if (download_dir / record.album_id).exists() == False:
        (download_dir / record.album_id).mkdir(parents=True, exist_ok=True)

    save_path = f"{(download_dir / record.album_id / record.id).as_posix()}{Path(record.download_url).suffix}"
    if Path(save_path).exists() == True:
        flickr_retrieval_logger.info(f"Previously saved: {save_path}; skipping")
    else:
        try:
            download_flickr_image(record.download_url, save_path)
        except Exception:
            flickr_retrieval_logger.error(
                f"Unable to download image at: {record.download_url}"
            )

# download each photo
download_dir = Path("../output/bdhl_flickr_downloads/")
download_dir.mkdir(
    parents=True, exist_ok=True
) if download_dir.exists() == False else None
all_photos.apply(
    lambda y: download_photo_record(y, download_dir), axis=1
)

Cropping the images

The images. Loading, arranging and displaying the contents of one of these curated albums (a surprisingly tedious exercise with matplotlib?) as follows:

all_images = list(
    Path("../output/bdhl_flickr_downloads/72157719480387299/").rglob("*.jpg")
)
ipyplot.plot_images([str(e) for e in all_images], max_images=10, img_width=200)

And here they are! Rhinos, badgers, bats and snakes! A real zoo on our hands.

Threshold-based segmentation. As always, it’s good to start with something simple. I initially set out to define a threshold which would be used to create a binary mask over the image, separating the animals from their background. Here, the lower bound of the threshold is manually set (lower_thresh), whilst the upper bound is set as 255 (black), pixel values within this range return True values, and vice versa for values outside. Inverting the resulting binary array gives us a mask, which we can then smooth using an elliptical kernel. After experimenting with kernel sizes, I found that smaller kernels tended to provide better results, but already during these early stages, I could tell we would be in for a lot of hit-and-miss style development.

def remove_background_white(img_path):
    img = cv2.imread(str(img_path))
    lower_thresh = 120
    lower = np.array([lower_thresh, lower_thresh, lower_thresh])
    upper = np.array([255, 255, 255])

    # select everything within thresh
    thresh = cv2.inRange(img, lower, upper)

    # smooth mask morphology
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (4, 4))
    morph = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
    mask = 255 - morph

    # apply mask to image
    result = cv2.bitwise_and(img, img, mask=mask)

    return thresh

Pretty good for a first pass; and lightning quick to run! Though there are problems; in particular animals with dynamic values are not “stamped” in their whole; the albatross in the bottom right-hand corner for instance.

Filter-based segmentation. The threshold-based results were a good starting point, but I was a little concerned about having to toggle the threshold lower bound as a magic number. So, back to the drawing board, this time, I chanced upon skimage, a very capable library with plenty of out-the-box options for this particular problem. In particular, I was initially interested in the filter module, which contains lots of.. stuff. Here, the approach is still to use the available filter algorithms to calculate a threshold value that can be used to capture the majority of the page background, but we’re seeking to localise how this filtering is done on a per-image basis (no magic threshold numbers). Skimage has our back, and provides a way to test/compare multiples filters:

# find a decent filter
from skimage.filters import try_all_threshold

for image in all_images[:10]:
    image = cv2.imread(str(image))
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    fig, ax = try_all_threshold(gray, figsize=(12, 12), verbose=False)
    plt.show()

Excuse the awkward formatting. I noted that still, none of the out-of-the-box filtering algorithms was capable of capturing the outlines of animals with dynamic values.

Note the improved segmentation via the use of Triangle, within the Racoon drawing (lowest, left) compared to the original binary filter.

Images with dark, consistent values are captured well by most filter algorithms.

It’s difficult to “scientifically” pick a winner here, but after reviewing a variety of outputs I settled on Triangle with Mean as a close second. These two filters seemed to strike a good balance between capturing whole animals with dynamic values and not capturing incidental/spurious background elements.

Edge detection and watershed segmentation. Restless as I am, I kept digging and explored possible ways to segment and extract elements from images. I chanced upon an official skimage tutorial, which made use of canny edge detection and watershed segmentation. This provided some new, interesting results, though I still couldn’t reliably segment pictures without trading off quality WRT dynamic values and/or featuring entangled backgrounds.

Far out, man. I thought the default colour scheme which came out of the watershed segmentation was quite nice as a consolation.

Even for images featuring very dark values within the subject matter (rhinos) there are problems.

Transformer-based segmentation. At this point, I was starting to appreciate some of the subtleties of image segmentation, and how difficult it was to tease out results which aligned with your “human” expectations of what a segmentation should look like. As a last-ditch effort, I pulled in a Segformer model, which had been trained on the scene-centric ADE20K dataset. Yielding some.. interesting results. In all likelihood, my dataset was probably a poor match for ADE20K, but I thought this rounded off my list of approaches ranging from simple to complex.

Final transformation pipeline and background removal. Moving onto the final filtering pipeline, which reads an image, calculates a mask using mean thresholding, applies the mask, and removes the masked (black) pixels via an alpha channel, before saving the output.

def get_threshold_mask(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    mask = gray > threshold_mean(gray)
    gray[mask] = 0
    return mask


def apply_mask(image, mask):
    # invert
    mask = mask.astype(np.uint8)
    mask = (~mask.astype(bool)).astype(np.uint8)

    # smooth morphology
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2))
    mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel)

    # apply
    masked = cv2.bitwise_and(image, image, mask=mask)

    return masked


def remove_background(masked_image):
    # save with transparent background
    tmp = cv2.cvtColor(masked_image, cv2.COLOR_BGR2GRAY)
    _, alpha = cv2.threshold(tmp, 0, 255, cv2.THRESH_BINARY)
    b, g, r = cv2.split(masked_image)
    rgba = [b, g, r, alpha]
    masked_tr = cv2.merge(rgba, 4)
    return masked_tr

for image_file in tqdm(all_images, desc="Segmenting images"):
    image = cv2.imread(str(image_file))
    mask = get_threshold_mask(image)
    masked = apply_mask(image, mask)
    bg_removed = remove_background(masked)
    cv2.imwrite(str(FILTER_DIR / f"{image_file.name}_mask.png"), bg_removed)

Finito

Ultimately, the results when considered together were a little patchy, but this particular application is a case of good enough being good enough; I was still able to post-process enough of the outputs (with a large time reduction) into collages which I thought were pretty cool!

I think there’s still scope to explore model-heavy instance segmentation and/or semantic segmentation. Or potentially some more exotic edge detection algorithms like HED. But for now, if I process enough images there are enough quality outputs to cherry-pick for use.

Anyway, you can find all of the above code here, as well as a sample of BDHL images. Enjoy!