What are the foundational structures of probabilities? How do we design a language making it easy to model probabilistic problems? Oftentimes the modeling happens directly in terms of vectors and matrices, but there are better ways.
In this article, we explore a structure, the monad, that spans probabilities. It is based on the article Stochastic Lambda Calculus and Monads of Probability Distributions by Ramsey and Pfeffer.
The source for this article is available as a gist on Github.
A monad is a mathematical structure with wide applications in functional programming. The reason is that it helps to hide away some state and provides primitives for progressing the structure in a linear fashion.
It is useful to know about monads. Not just for understanding the probability monad, but because they are an important tool to architect applications using functional techniques.
A monad is a well-elaborated subject, and as such, I am not elaborating on it more than its algebraic structure. To get a proper overview on there exists plenty resources.
For our practical use of the monad, we need to implement following functions in a way such that they satisfy the monadic laws.
(>>=) :: m a -> (a -> m b) -> m b -- pronounced bind return :: a -> m a
These are what constitutes monads.
When talking about events and the probability that they happen we often talk in terms of distributions. In its most abstract form a distribution is simply a total function, i.e. all inputs need to be mapped to outputs. This is also the view we take on it and the monad turns out to be a way of packaging these programs to only allow certain operations on them.
We only discuss discrete probability distributions here. This makes it considerably easier to reason about but also poses some limitations as we can only model continuous distributions to an arbitrary degree.
On the probability distributions, we want to do certain queries. Some often used ones are
support, and the
sample query. The
support queries are much easier to think about when only
considering discrete distributions.
Lastly, we also need an operation for building distributions. In this work,
we take offset in a single function called
choose which act as a construct
that binary chooses between two distribution with a certain probability.
To end choosing we use the return construction, which is the same as the Dirac distribution. In some papers and languages, Dirac is used as the name for return.
To get an overview of the implementation of the probability monad it makes sense to look at the class hierarchy we are considering.
The following figure shows the monads we are implementing.
The lowermost three boxes are the most interesting. Those are the queries, the operations we can use on a distribution to make sense of it.
The Expectation Monad
This section elaborates on a measure theoretic approach to the probability monad. We represent the distribution as a continuation that takes a measure function and returns an expectation.
First, we will take a look at our monad type.
-- Probability Monad newtype PExp a = P (( a -> Double) -> Double)
It is worth looking into what happens here. We have the constructor
PExp, that takes a function, a measure function, and gives the
The monadic structure of probability distributions is in this setting implemented as follows.
instance Monad P where return x = P (\h -> h x) (P d) >>= k = P (\h -> let apply (P f) arg = f arg g x = apply (k x) h in d g)
From here the expectation monad is easily implemented as the whole monad type is built around it.
instance ExpMonad PExp where expectation h (PExp d) = d h
We could now start to do experiments base on this. But we would much
rather like to play with
sample also. These, however,
are quite difficult to implement using the type of
PExp so we are
going to attack this from another angle.
The above implementation of the monad type is not very suited for other than the expectation query. To make something more suited, we will try to stay true to the paper and implement a type such that we can hold data as it is defined
data P a where R :: a -> P a B :: P a -> (a -> P b) -> P b C :: Probability -> P a -> P a -> P a
The quick reader will see that we now use GADTs instead or ordinary ADTs. This is due to the Bind constructor. This poses a change of the type, and can not be implemented by ADTs.
In the instances for the monad and probability monad we simply defer everything to used directly by the queries.
-- P is a monad instance Monad P where return x = R x d >>= k = B d k -- P is a probability monad instance ProbabilityMonad P where choose p d1 d2 = C p d1 d2
This allows for definitions of the queries to be directly copied from the
paper. In addition to the
expectation query we had before, we can now also
instance SupportMonad P where support (R x) = [x] support (B d k) = concat [support (k x) | x <- support d] support (C p d1 d2) = support d1 ++ support d2 ...
For the full source I once again refer to the gist on Github..
In this section we will take a look on how to construct some interesting distributions and how the queries work on them.
As often when working with probabilities we will look on dices. To make
it easy we make a data type deriving
Enum for easy enumeration.
data Dice = One | Two | Three | Four | Five | Six deriving (Enum, Eq, Show, Read, Ord)
We can from here make a uniform distribution over the sides of a dice like following.
dist :: P Dice dist = uniform [One .. Six]
uniform method is a simple function that makes a uniform distribution
over a list using the
choose function and the length of the list.
A support query can be performed as follows.
example01a = let dist :: P Dice dist = uniform [One .. Six] in support dist
Yielding the result of
[One,Two,Three,Four,Five,Six]. To create a posterior
distribution from a prior, we use the bind. This translates into elegant and
easily readable dependent distributions.
example02a = let dist :: P Dice dist = do d <- uniform [One .. Six] return (if d == Six then One else d) in support dist
This yields the result
[One,Two,Three,Four,Five,One] Note how
twice. This make immediate sense from the definition of the distribution
though we would have expected duplications removed. However, this is quite
easily done in Haskell.
The last example in the gist is one of the more interesting. The distribution defined is potentially infinite. Intuitively it counts up with a probability of a half and continues doing that.
walk x = do bit <- uniform [True, False] if bit then return x else walk (x + 1)
So what happens when we throw the various queries on it?
The support query should return a list of all potential outcomes of the
distribution, and so it does. running
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,^CInterrupted.. If not
stopped it would continue counting up.
An important remark here is that
order of the
False in the inner distribution. If we switch it
such that the inner distribution reads
uniform [False, True] it will never
return anything. Intuitively one can understand it as the last element is
unfold first. In the current ordering the walk function creates a list which
is reducible though infinite.
Secondly we have the expectation query. issuing this yields an infinite computation without any result. This is because we attempt to unroll the full structure to get the expectation of an event.
Lastly we have the sample query. This query works completely as expected yielding a list of samples. Issuing
example03c will return
[0,1,3,0,0,2,0,1,0,1] which intuitively makes sense.
Though the expectation query is not possible through the current implementation we could actually still get an idea about the expectations using the Monte Carlo Technique. Sampling 10.000 elements of the walk distribution and taking the frequency gives us following.
> mc [(0,4937), (1,2483), (2,1294), (3,630), (4,326), (5,159), (6,87), (7,34), (8,27), (9,10), (10,5), (11,5), (12,2), (15,1)]
Which is what we would expect.
We have discussed Ramsey and Pfeffer's paper in this article. It provides a fundamental understanding based in programming languages on distributions and probabilities. It is interesting because it makes languages like Hakary much easier to understand. Furthermore it also greatly simplifies the hole world of probabilities.