This page collects my thoughts on the theme of data science and engineering. Data science is multi-disciplinary. Ranging from philosophical and theoretical topics of ontology to practical considerations about scaling a platform to meet requirements.
This page is evolving. It takes my individual posts on the subjects and aggregates it into a coherent whole. A whole that often changes shape.
We can consider everything stored on a computer data. Everything that can be represented binary. Programs, configuration files, images, sound, documents, etc. In essence everything on a computer is data and might constitute insights used to solve a problem.
One dimension of data characterization is the signal/noise ratio. The ration is said t be high, ie. high signal low noise, when small amounts of data is valuable while noisy data has many bits that does not provide data.
canonical examples of high noise data would be sound and images, while a semantic triple, usually, has a high signal.
AI applications mostly consider high signal data. It should be noted, that the
signal can be increased. This is the task og machine learning. Ie. instance
recognition in images reduces noise by constructing triples. This triple
image.png depicts a chair.
Depending on the nature of the data there are several ways to interpret data. Sometimes statistical aggregation suffices, other times we employ an explorative approach to assess the potential.
As for statistical methods, the most widely used are probably those available in spreadsheets. However, more generally the Probability Monad provides tooling for writing these distributions as programs.
Other techniques are those based on networks. Modeling data as networks formatted under an ontology can be advantageous as it lays out clear semantics of the data. It clearly specifies what classes entities in the data belong to and how they relate to other entities.
These techniques are being used in fraud detection. Fraud and Networks is an article that touches on some of the aspects of doing that. Furthermore, the network approach can also be used to increase trust in statistical results. I investigate some of the opportunities in adversarial Statistics.
Data processing concerns how to process large amounts of data. But more than that is als concerns how to reuse existing assets. This is essential as it allows fore serendipity and finding patterns across different data sets. To do this an organization can employ a data platform. A common place to put data assets for cross leverage between use cases.
Furthermore, an integral problem to data processing is scale. Data often comes in large amounts. For this, both the storage and processing architectures need to scale. Scaling can only to an extend be separated from the data and processing techniques and is a serious concern when thinking up data processing architectures.
The Data Platform
For practical purposes the data platform is essential. Not only does it reduce time when changing analysis etc. It also provides the needed foundation to scale data analysis to span appropriate data sets.
- The data platform: A deeper walk-through on desirable properties of a data platform.
One part is about processing and ingesting data. This could be loading data into data frames, scraping websites, utilizing APIs, but also using techniques such as entity recognition and linking for extracting information from text, object recognition to extract information from image material, and speech recognition in conjunction with NLP techniques to get data out of auditory resources. The problems in this level are very engineering-specific. It is about keeping code up to date, etc.
- Building a Scraper: Thoughts on the process of building a scraper that needs to run continuously and provide new data.
Next is processing it into an intermediary representation. This is necessary when the amounts of data and sources are considerable. This step does a few things. First and foremost, it integrates data. It relates entities from different data sources that are related. This is desirable as we do not want to do this for large amounts of data every time we want to ask a question. Secondly, it ensures scalability. For sufficiently large amounts of data and sufficiently complex pipelines, it is not feasible to run everything from a single laptop (or phone). Problems here are on scale.
Lastly, the production step. This step produces data for further use. This could be for reports, it could be data for training models, or the actual models to integrate into an application. The data platform could also produce endpoints for real-time data insight etc. Problems here are of semantic nature. How do we ensure the correctness of produced claims? Articles relevant to this are.
There are a number of ways data is put to use. One way is through pieces of written works. This can be data journalists who make a piece using data they interpret. It can also be assembling predictive models using machine-learned models.