-
Notifications
You must be signed in to change notification settings - Fork 1
/
propbody.tex
150 lines (110 loc) · 12.2 KB
/
propbody.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
% !TeX root = ./proposal.tex
\vfil
\begin{center}
{\Large Computer Science Tripos -- Part II -- Project Proposal} \\
\vspace{0.4in}
{\huge \bf A probabilistic programming language in OCaml } \\
\vspace{0.4in}
% {\large A. Roy, Christ's College} \\
{\large 2349E}
\vspace{0.1in}
% {\large \today} \\
\end{center}
\vspace{0.4in}
\vfil
\textbf{Project Originator:} Dr. R. Mortier
\vspace{0.1in}
\textbf{Project Supervisor:} Dr R. Mortier
\vspace{0.1in}
% \textbf{Director of Studies:} Dr R. Mortier
\textbf{Director of Studies:} (name removed)
\vspace{0.1in}
\textbf{Project Overseers:} Dr~J.~A.~Crowcroft \& Dr~T.~Sauerwald
\vfil
\clearpage
% Main document
\section*{Introduction}
A probabilistic programming language (PPL) is a framework in which one can create statistical models and have inference run on them automatically. A PPL can take the form of its own language (i.e. a separate DSL), or be embedded within an existing language (such as OCaml). The ability to write probabilistic programs within OCaml would allow us to leverage the benefits of OCaml, such as expressiveness, a strong type system, and memory safety. The use of a numerical computation library, Owl, will allow us to perform inference in a performant way.
PPLs work well when working with \textit{generative} models, meaning the model describes how some data is generated. This means the model can be run 'forward' to generate outputs based on the model. The more interesting application, however, is to run it 'backwards' in order to infer a distribution for the model.
The power of a probabilistic programming language comes in being able to describe models using the programming language - in my case, probabilistic models would be described in OCaml code, with basic distributions being able to be combined using functions and operators as in OCaml code.
\section*{Starting Point}
There do exist PPLs for OCaml, such as IBAL \cite{ibal}, as well as PPLs for other languages, such as WebPPL - JS \cite{mobus2018structure}, Church - LISP \cite{goodman2012church} or Infer.Net - F\# \cite{wang2011using} to name a few. My PPL can draw on some of the ideas introduced by these languages, particularly in implementing efficient inference engines.
I will be using an existing OCaml numerical computation library (Owl). This library does not contain methods for probabilistic programming in general, although it does contain modules which will help in the implementation of an inference engine such as efficient random number generation and lazy evaluation.
I have experience with the core SML language, which will aid in learning basic OCaml due to similarities in the languages, however I will still have to learn the modules system. 1B Foundations of Data Science also gives me an understanding of basic statistics and Bayesian inference. I do not have experience with domain specific languages in OCaml, although the 1B compilers course did implement a compiler in OCaml.
\section*{Substance and Structure of the Project}
I will be building a PPL in OCaml, essentially writing a domain specific language. There are 2 main components to the system, namely the modelling API (language design) and the inference engine.
\subsection*{Modelling}
The modelling API is used to represent a statistical model. For example, in mathematical notation, a random variable representing a coin flip may be represented as $X \sim N(0,1)$, but in a PPL we need to represent this as code. An example would be
\begin{verbatim}
Variable<double> x = Variable.GaussianFromMeanAndVariance(0, 1)
\end{verbatim}
in the Infer.Net language. In OCaml, there will be many different options for representing distributions, and a choice will need to be made about whether to create a separate domain specific language (DSL) or whether to embed the language in OCaml as a library.
I will also need to make sure the design of the modelling language is suitable for the implementation of the inference engine. For example, WebPPL \cite{mobus2018structure} uses a continuation passing style transformation, recording continuations when probabilistic functions are called in order to build an execution trace. This then allows the engine to perform inference. A similar approach could be applied here. There are many alternative approaches to build execution traces, such as algebraic effects \cite{DBLP:journals/corr/abs-1811-06150} or monads \cite{scibior2015practical}, and one such method will need to be chosen. However I decide to implement this, I will need to ensure that features of OCaml can be used appropriately.
\subsection*{Inference Engine}
There are many different options for a possible inference engine. A decision also need to be made about whether to use a trace-based model (as mentioned) or a graph based model (such as Edward, where a computational graph is generated). This decision will need to be made before implementing the inference engine since it will affect the modelling language.
In both these cases, I will need to decide how to convert from a program into a data structure that allows inference to be performed, and then actually carry it out. Ideally, the inference algorithm used is separated from the definition of the models, so that different algorithms can be chosen, or new algorithms added in the future.
\section*{Evaluation}
The PPL developed here will be compared to existing PPLs - for example, IBAL (written in OCaml), comparing performance for programs describing the same models. I will also use the PPL developed on example problems in isolation to ensure it can be used correctly and delivers correct results. An example would be to use it on an established dataset (e.g. the stop-and-search dataset used in 1B Foundations of Data Science) to attempt to fit a model.
I will also want to quantify exactly what kind of problems need to be supported by my PPL and make sure these kind of programs can be run. I will also support a minimum number of standard distributions, e.g. bernoulli, normal, geometric, etc. or enable users to define custom distributions.
\subsection*{Success Criteria}
The project will succeed if a usable probabilistic programming language is created. Usable is defined by the following:
\begin{itemize}
\item \textbf{Language Features}: I will aim to support some subset of language features, such as 'if' statements (to allow models to be conditional), operators and functions. Some of these features may only be available by special added keywords (e.g. a custom 'if' function).
\item \textbf{Available distributions}: I will aim to make sure my PPL has at minimum the bernoulli and normal distributions available as basic building blocks to build more complex probabilistic programs.
\item \textbf{Correctness of inference}: I will use the PPL developed on sample problems mentioned before to ensure correct results are produced. This would be determined by comparing to results produced in other PPLs. I will aim to include at least one inference algorithm.
\item \textbf{Performance}: This is a quantitative measure, comparing programs written in my PPL to equivalent programs in other PPLs. I can use the spacetime program to profile my OCaml code. Performing inference should be possible within a reasonable amount of time, even though the project does not have a significant focus on performance. I will also benchmark the performance with regards to scalability, i.e. ensure the performance is still reasonable as traces/graphs get larger.
\end{itemize}
\section*{Extensions}
There are several extensions which could be considered, time permitting:
\begin{enumerate}
\item There could be more options for the inference engine, i.e. implementing more than one inference algorithm. Different algorithms are suited to different inference tasks, so this would be a worthwhile extension
\item Optimisations could be considered to ensure the performance of inference was better than other comparable languages - especially looking into making use of multicore systems.
\item I could add more distributions, as well as the ability to create custom distributions
\item Include the ability to visualise results using the plotting module in owl.
\item Include the ability to visualise the model in which inference is being performed (e.g. the factor graph)
\end{enumerate}
\section*{Schedule}
Planned starting date is {28/10/19}, the Monday after handing in project proposal. Work is broken up into roughly 2 week sections.
\subsection*{Michaelmas Term}
\begin{itemize}
\item {\bf Weeks 3-4 (28/10/19 -- 10/11/19)} \\ Set up IDE and local environment - installing Owl and practicing using it. Read the first 10 chapters of Real World OCaml, available online. Read papers on past PPLs and implementations, both ocaml or otherwise. Set up project repository and directory structure.
\\ \textit{Milestone: learn OCaml basics, set up project}
\item \textbf{Weeks 5-6 (11/10/19 -- 24/11/19)}\\ Design a basic modelling API and write module/function signatures. Decide how to implement this (e.g. DSL vs library). Specify what language features I will include as a baseline. Research inference algorithms and make sure they will fit into the modelling API designed.
\\ \textit{Milestone: specification of which language features and inference algorithms will be implemented}
\item \textbf{Weeks 7-8 (25/10/19 -- 08/12/19)}\\ Begin to implement the modelling API, and allow running a model 'forward', i.e. generating samples.
\\ \textit{Milestone: A basic working DSL}
\end{itemize}
\subsection*{Christmas Holidays}
\begin{itemize}
\item \textbf{Weeks 1-2 (09/12/19 -- 22/12/19)}\\ Begin to implement a basic inference algorithm (such as MCMC) allowing programs to be run 'backwards' to infer parameters.
\item \textbf{Weeks 3-4 (23/12/19 -- 05/01/20)} [\textit{Christmas Break}]
\item \textbf{Weeks 5-6 (06/01/20 -- 19/01/19)}\\ Aim to finish the main bulk of implementation and get a baseline system working by the end of this week and consider extensions if finished early. Begin writing up progress report.
\\ \textit{Milestone: finish baseline implementation}
\end{itemize}
\subsection*{Lent Term}
\begin{itemize}
\item \textbf{Weeks 1-2 (20/01/19 -- 02/02/20)}\\ Finish progress report and implementation as well as any extensions, time permitting. Use the PPL developed on example problems in order to evaluate it, comparing against problems in other languages.
\\ \textit{Milestone: Progress report deadline (31/01/19)}
\item \textbf{Weeks 3-4 (03/02/20 -- 16/02/20)}\\ Prepare for the presentation, begin planning the dissertation, particularly the structure and the content I need to write for each section. Begin writing, starting with the first sections (i.e. introduction and preparation).
\\ \textit{Milestone: Progress report presentations (06/02/20), finish introduction and preparation}
\item \textbf{Weeks 5-6 (17/02/20 -- 01/02/20)}\\ Finish writing up the bulk of the implementation section.
\\ \textit{Milestone: Finish implementation section}
\item \textbf{Weeks 7-8 (02/03/20 -- 15/03/20)}\\ Complete first draft of dissertation, finish the evaluation and conclusion sections and complete any unfinished tasks.
\\ \textit{Milestone: Finish first draft}
\end{itemize}
\subsection*{Easter Holidays}
\begin{itemize}
\item \textbf{Weeks 1-6 (16/03/20 -- 26/04/20)}\\ Improve dissertation based on supervisor feedback
\end{itemize}
\subsection*{Easter Term}
\begin{itemize}
\item \textbf{Weeks 1-2 (27/04/20 -- 07/04/20)}\\ Finalise disseration after proof reading and hand in.
\\ \textit{Milestone: Electronic Submission deadline (08/04/20)}
\end{itemize}
\section*{Resources Required}
\paragraph*{Hardware}
I intend to use my personal laptop for the main development and subsequent write up (HP Pavilion 15, 8GB RAM, i5-8265U CPU, running Ubuntu and Windows dual booted).
\paragraph*{Software}
The required software includes the ocaml compiler, with a build system (dune) and a package manager (opam). I will also use the IDE VSCode with an OCaml extension, as well as git for version control and latex for the write up.
\paragraph*{Backups}
For backups, I will use GitHub to host my git repository remotely, pushing frequently. I will also backup weekly to a USB stick in case of failures. The software I require is available on MCS machines, so I'll be able to continue work in the event of a hardware failure with my laptop.