Auspex is a simple example of building a data pipeline starting from extracting data from PDFs, right down to feature engineering. It’s built with Python, and uses a few external APIs for ease.

The process is linear, and starts by extracting data from daily PDF reports from Vista box office (using Tabula), and then adds further film details using the OMDB API.

Additionally it retrieves trailer links for each film through the Youtube API - then downloads those trailers, uploads them to Google Cloud Storage, and runs them through the Google Vision API. Collecting trailer annotations, which is a text description for every frame. This was inspired by this piece on predicting box office using these annotations from Google and 20th Century Fox.

This process is also set up for the UK Box Office dashboard database, and additionally for upcoming releases.

I have some exploratory data analysis to share using this pipeline, as well as the machine learning model to predict upcoming releases. Given the current collapse in box office data, this doesn’t feel like a priority, as future real world test data is now probably months away.

The source code is available on GitHub.