This is a few notes on building a scraping pipeline. A scraping pipeline is an application that takes data from a website and spits it out in a format suitable for another project.
These thoughts are based on a current project where I need to scrape various websites and build a triple store. And example is the website ug.dk that holds Danish educations, institutions, jobs, campuses etc. I have build a setup that can create an updated triple store with fallback on old data for resources that is not available.
To build this scraper, there are some core principles that has lead decision making.
- Take care of the host. Don't needlessly burden the site that provide data
- Interruption, resources might not always be available on the internet
- Fun, it should be fun to develop and I want to reduce iteration time.
Think incremental Scraping
Get a full minimal pipeline up running ASAP. From there it is a matter of enriching data and incrementally enhance the pipeline.
Check both the code and the data into git. When a full version of the scraper is done, and data is scraped, commit the resulting data. Diffs can be used to see what data has changed since last scrape or it can be used to revert data back when introducing bugs in the pipeline. For that reason, save data in a human readable format eg. JSON. Make sure that serializer saves it in a prettified format to make beautiful diffs.
Find a good tradeoff between saving entities as individual files or bundling them in individual files. GitHub handles files up to 100MB, on the other hand, having several hundred thousands files can be less practical.
Separate data fetch and processing
Make a script to save data to the disk in a format that closely mimics the format from the source website. Make sure that runs are non destructive, enriching / updating existing entities instead of destroying existing data and re-fetching.
Make another script that takes data saved from the website and transforming it into the target format. Let the scraper fetch files while working on this. Work on this script on a subset of the source files to ensure fast iteration times.