This page collects my thoughts on the theme of data science and engineering. Data science is multi disciplinary. Ranging from philosophical and theoretical topics of ontology to practical considerations about scaling a platform to meet requirements.

Part 1: Data Interpretation

  • The Probability Monad: The probability monad provides a clean interface to lift sound code into probability distributions. It lets us avoid working with opaque statistical concepts.
  • Fraud and Networks: A case study on how to catch identity theft using techniques from the complex networks sciences.
  • Adversarial Statistics: Techniques to increase reliability of produces statistical results.

Part: Data Processing

The Data Platform

For practical purposes the data platform is essential. Not only does it reduce time when changing analysis etc. It also provides the needed foundation to scale data analysis to span appropriate data sets.

One part is about processing and ingesting data. This could be loading data into data frames, scraping websites, utilizing APIs, but also using techniques such as entity recognition and linking for extracting information from text, object recognition to extract information from image material, and speech recognition in conjunction with NLP techniques to get data out of auditory resources. The problems in this level are very engineering specific. It is about keeping code up to date, etc.

  • Building a Scraper: Thoughts on the process of building a scraper that needs to run continuously and provide new data.

Next is processing it into an intermediary representation. This is necessary when the amounts of data and sources are considerable. This step does a few things. First and foremost, it integrates data. It relates entities from different data sources that are related. This is desirable as we do not want to do this for large amounts of data every time we want to ask a question. Secondly it ensure scalability. For sufficiently large amounts of data and sufficiently complex pipelines it is not feasible to run everything from a single laptop (or phone). Problems here are on scale.

Lastly, the production step. This step produces data for further use. This could be for reports, it could be data for training models, or the actual models to integrate in an application. The data platform could also produce endpoints for realtime data insight etc. Problems here are of semantic nature. How do we ensure correctness of produced claims? Articles relevant to this are.

Service Architectures

There are a number of ways data is put to use. One way is through pieces of written works. This can be data journalists who make a piece using data they interpret. It can also be assembling predictive models using machine learned models.