The Data Platform - Mads Buch [dot] Com

Main point: We need good data platforms to ingest, process, and analyze data.

Understanding our surroundings is an art. Good artists need good tools, not making art is not possible without good tools, but because good tools enable expression.

The same is the case for data science. We need a good data platform. It should help us separate concerns. It should assist in and assure the quality of the deliverables, and it should let people focus on their work and not too many other things.

But what concerns are central to understanding data? The first concern is taking the data from the source and process it into a format we can use. This is called ingestion. Next is the intermediary representation. This acts as an indirection and lets us disregard the data source. this will integrate data so we can work on data across data sets and wit will apply anonymity for datasets where it is needed. Lastly, we need a product. This is the concern of either understanding the data, getting models that are trained and can be used in applications, or something third.

Ingestion: The first step is to ingest data into the system. This consists of figuring out how the data should be modeled and build the data package. Concerns here are in particular data hygiene and local integrity.

An ideal data platform should give the fullest freedom to express the model. Ie. the engineer should not be made to fit the data on an existing model. Furthermore, the data platform should use the ontology to assist with data hygiene.

We might even want to work with data under different ontologies. The data platform should enable the ability for the user to handle multiple ingestions of the same data modeled slightly differently to accommodate different needs.

Internal Representation: The internal representation should provide an indirection to the source. it should with as little effort as possible make sure that identical entities are resolved to the same or at least make sure that reasonable relations are setup. In particular, this is a place to keep track of provenance ie. where data comes from and what intermediary steps it has been through.

This is also the place to apply differential privacy to the data in need. This ensures that the analysts can work with the data free of friction and release the results they see without further redaction. This is highly important when training statistical models such as neural networks in personally identifiable data as there is a risk to covertly release sensitive data in the model.

Analysis: The last part is where the data is utilized to create value. This involves tooling to quickly get an intuitive sense of the data, the ontology, how data interrelates, what data is available, and the quality of the data.

The data scientist would need tools for extracting the data for their needs. This could be exemplars to train models on etc.

At Spor we are trying to solve some of these problems. In particular, we have devised methods for resolving data from disparate datasets while still retaining the ability to reason about the data locally. We are developing everything to work in the browser but scale to the cluster. Where data resides is important. Especially when you want to work on private data such as emails or personal notes.