Proposal - Statistics on the Knowledge Graph

TL;DR: If we are a bit careful about provenance, we can view a knowledge graph as the virtual machine for a probabilistic program with the ontology defining the semantics.

The fundamental idea is that we can write programs in a simple probabilistic programming language. As such, we require the probability monad as the only composition pattern for the language. We allow functions, written as distributions, to draw from themselves to yield the possibility of traversing inductive structures. In the end, we wite distributions over entities.

We provide an expectation function. This function takes a distribution and a measure function and returns the average measure over the distribution. An example could be on companies, and the measure function would yield 1.0 when the company is bankrupt and 0.0 when it is not. The result would yield the weighted proportion of bankrupt companies in the distribution.

Build-in operations are based on the meta ontology. Here we define what values properties of an ontology can take. An example is a text-field. This will have an equality function. Another property could be a location-field. These could have an in-range operator such that we can write in-range (location company) (location address) 50m or the like.

We augment the system with two basic distributions — one for the ontology classes and one for entities. The first one yields random elements at uniform distribution from a class, the second one yields exactly and only the entity from which it has been constructed — It is the singleton distribution.

Furthermore, relations between classes in the ontology represent functions from distributions to distributions. A scenario is when we have a knowledge graph representing companies with a hasEmployee relation liking companies to persons. The language would provide a function hasEmployee : Dist Company -> Dist Person, which from a distribution over companies returns a distribution over persons.

Lastly, error handling: Distributions might be empty. Handling those empty distributions in different ways can have a tangible impact on the resulting distributions, and we need to be careful about how we do it. In the initial version of the language, we handle these situations explicitly in order to surface decisions.

Outside of the core language, there are some concerns we also need to address in the execution system. One such is provenance. The statistical results are only as precise as the data program is run on. A library is made available to create the dataset (execution runtime). This library comes with many tools to ensure the soundness and completeness w.r.t. to the data source and ways to indicate the soundness and completeness.