InfoQ Homepage Podcasts Mike Lee Williams on Probabilistic Programming, Bayesian Inference, and Languages Like PyMC3

Mike Lee Williams on Probabilistic Programming, Bayesian Inference, and Languages Like PyMC3

Aug 31, 2018

Probabilistic Programming has been discussed as a programming paradigm that uses statistical approaches to dealing with uncertainty in data as a first class construct. On today’s podcast, Wes talks with Mike Lee Williams of Cloudera’s Fast Forward Labs about Probabilistic Programming. The two discusses how Bayesian Inference works, how it’s used in Probabilistic Programming, production-level languages in the space, and some of the implementations/libraries that we’re seeing.

Key Takeaways

Federated machine learning is an approach of developing models at an edge device and returning just the model to a centralized location. By taking the averages of the edge models, you can protect privacy and distribute processing of building models.
Probabilistic Programming is a family of programming languages that make statistical problems easier to describe and solve.
It is heavily influenced by Bayesian Inference or an approach to experimentation that turns what you know before the experiment and the results of the experiment into concrete answers on what you should do next.
The Bayesian approach to unsupervised learning comes with the ability to measure uncertainty (or the ability to quantify risk).
Most of the tooling used for Probabilistic Programming today is highly declarative. “You simply describe the world and press go.”
If you have a practical, real-world problem today for Probabilistic Programming, Stan and PyMC3 are two languages to consider. Both are relatively mature languages with great documentation.
Prophet, a time-series forecasting library built at Facebook as a wrapper around Stan, is a particularly approachable place to use Bayesian Inference for forecasting use cases general purpose.

Subscribe on:

01:55 I did my PhD in astrophysics too many years ago; it turns out that astrophysics is a great preparation for a career in data science.
02:05 Firstly, there are a lot of stars out there - it’s a big data problem.
02:15 Secondly, there are problems with the data all the time.
02:25 This mirrors what we see in tech - big data, with various problems, trying to draw conclusions from it that allow you to take actions - publish or perish in the academic world.
02:40 It turns out the preparation is appropriate.
02:45 The downside of astronomy is that many of us don’t have a CS background, we’re self-taught coders, we maybe don’t know what Big O notation is.
02:55 It’s something I’ve personally worked on with my own skillset - one of the challenges in migrating from astrophysics.

Many CS developers feel that there is gaps they are missing.
03:20 My PhD was measuring how much matter is missing in galaxies.
03:30 I used to say what I was doing was to saying how big the hole is accurately.
03:35 Particle physicists and theoretical physicists tried to fill in the hole.

What is federated learning?
04:00 At Fast Forward Labs, we publish a report four times a year that we think will be of interest to our subscribing clients.
04:10 Federated learning is what we’re working on right now.
04:15 Imagine you are training your learning model on an edge device, like a mobile phone, which has the training data.
04:20 The traditional way of handling this is to ship the training data to a central computer somewhere from all the mobile devices, in order to train the model.
04:30 The problem with that is that most people are uncomfortable with that idea.
04:45 In certain legal areas like healthcare it’s not even possible.
04:50 We’ve all got data plans with finite amounts of data so we don’t want to ship a lot of data to a central place.
05:00 Federated learnings comes in to train little local models, that aren’t very big, on a variety of devices.
05:10 Instead of shipping back the training data, you ship back the model, which is a much smaller piece of data.
05:20 Additionally, it’s not the data, so it’s not the most privacy sensitive data.
05:30 There are ways, given a model, of finding out what went into it, and it’s a much harder job - and you can add in differential privacy to give guarantees.
05:45 The central authority’s job is to average the models, and that becomes the global model over all the data, but in a privacy preserving way.
05:50 We preserve data locality and privacy, which has a lot of implications for consumer tech and heatlhcare.
06:00 Also, in industrial situations where there is a lot of data, and potentially in remote locations or the data connection isn’t stable, the reduction in data can be a big win.

What is probabilistic programming?
06:40 Probabilistic programming is a programming paradigm in the same style as functional or object-oriented programming.
06:45 It is a family of programming languages that make a kind of problem easier to describe and solve.
06:55 The problem that probabilistic programming solves is statistical problems - analysing data and drawing conclusions from data.
07:00 The reason that it is difficult in general is that all data is flawed in some way.
07:15 The conclusions that you draw are uncertain because the data might be flawed.
07:20 You may not have all the data - an A/B test isn’t unlimited but rather run over a week, for example.
07:25 You might have data that is dubious in some other way - measuring a physical system with variability, such as the speed of a car may not be accurate to a millimetre per second.
07:40 That uncertainty manifests itself as uncertain conclusions, and we’d like to know the bounds of that uncertainty.
07:45 At a high level, what probabilistic programming does is provide a more user-friendly way to answer some of those questions.

How is the A/B testing affected?
08:20 If you do an A/B test, and users convert for layout A 4% of the time, and for layout B 5% of the time, you might conclude that layout B was better.
08:30 But if it turns out that the A/B test was very small, then 5% of the users converted but there are only 100 visitors, then you’re less certain that layout B is better.
08:45 What probabilistic programming gives you is a number that can tell you how uncertain these results are.
09:10 It’s important to note that probabilistic programming is a family of languages, but what we are doing is Bayesian inference, which is a 250/300 year old algorithm.

What is Bayesian inference?
09:30 The Reverend Bayes, hundreds of years ago, came up with Bayes’ rule, which is a relatively simple one.
09:40 It relates the thing yo know before the experiment, and the results of the experiment, and tells you what you know after the experiment.
09:50 In the case of the A/B test, if you have a prior belief that the conversion rate is 5% give or take a couple of percent.
10:05 The data may be at a high level consistent with the beliefs, but the beliefs will necessarily change after that.
10:15 Bayesian inference is a set of machinery about the prior beliefs and turning them into beliefs after the experiments (the posterior beliefs).
10:35 It’s a tool for turning what you know before the experiment, and the results of the experiment, into answers about what to do next.
10:45 The equation is really simple - but as a practical matter, its implementation in code can be quite tricky.
11:00 What those imply is that for you and I it was quite a difficult approach to solve real problems.
11:15 Reverend Bayes used it to solve simple problems hundreds of years ago, but internet scale problems with large data and large numbers of parameters are difficult computationally to solve.
11:30 Probabilistic programming is exciting, because it abstracts a huge amount of that complexity away and provides a gold standard technique accessible to less sophisticated statisticians.
11:45 If the last time you took stats was high school - this approach is accessible for non-professional statisticians.

The process can be iterative to improve the beliefs?
12:20 It’s a virtuous circle - if you’re in a streaming environment where the data is arriving every day, you don’t have to wait until you have gathered a week’s data to see results.
12:35 You can use the stream of data to update the beliefs as new data comes in.
12:40 As each piece of data arrives, your prior is updated to a posterior distribution.
12:45 That posterior distribution becomes the next prior for the next piece of data coming in.
12:50 The posterior at the end of this sequence of data is available to you throughout the stream as a light picture of your knowledge of the world.
13:05 In the layout test, you know whether A or B is doing, and you can end that test early if the results are obvious.
13:10 Perhaps most famously this analysis is done during drug trials - is the drug working, or is it killing people?
13:20 You don’t want to wait until the end of the trial to find out that three people have died, and that is statistically more than you’d expect.
13:30 You want to be able to end the trial as soon as you know the drug is effective (or it is dangerous).

So you can bring forward the prediction models to be almost live with the data coming in?
14:05 It’s a coherent way of looking at things - it works in the batch setting as well.
14:10 You can give me a million data points, and I can run the batch analysis telling you whether layout A or layout B is better.
14:20 You can also give me a thousand data points at a time, or a day’s worth of data, and you can get the data with a day’s worth of latency.
14:35 Alternatively, you can go with streaming data and keep this posterior data updated as each visitor arrives.
14:40 Obviously there are engineering practicalities with that approach, but all three of these approaches are using Bayes’ rule.
15:00 It’s a difference of degrees, rather than a new system, but you can migrate from batch analysis to streaming analysis.
15:10 As you say, the really exciting possibility is being able to forecast about the future.
15:15 Bayesian inference or probabilistic programming aren’t the only ways to be able to do this; supervised machine learning in general is capable of doing that.
15:30 The thing I like about it is that Bayesian predictions come with measures of uncertainty.
15:40 If we are doing the layout test, we can imagine predicting about a number of how many dollars we are going to get in the future.
15:50 What is at least as useful is a measure of how certain I am about that.
16:00 Do I think I am going to make a million dollars, plus or minus a dollar, or plus or minus a hundred thousand dollars?
16:10 Those have different implications about how the CFO should plan or how much money needed in the bank.
16:15 You can quantify the risk - in financial situations, that’s very relevant.
16:20 In situations where you are planning for the future, you need resources in proportion to how often something happens.
16:30 If you’re able to predict when something is going to happen, and how uncertain you are, makes your ability to prepare that much richer and more rigorous.

What did your QCon New York talk about?
17:40 The approach I took in my talk was trying to build a probabilistic programming system from scratch.
17:55 One of the things you get when you build a probabilistic programming system from scratch is how slow it can be without being very clever.
18:05 To get it to be fast, you need to have a graduate degree in statistics.
18:10 A number of deep mathematical problems were solved in the last couple of decades, and they are robust.
18:20 That means the user of the algorithm doesn’t need to tune it for each problem - it’s idiot proof in a lot of ways.
18:35 The most robust algorithm - already not the most cutting edge, as it’s a fast moving field - is Hamiltonian Monte-Carlo with automatic differentiation and a no u-turn sampler.
18:50 It’s harder to code than it is to say, and it’s pretty difficult to say.
18:55 It turns out my background in astrophysics is surprisingly relevant to that algorithm - shares a lot in common with orbital mechanics of the Solar system.
19:15 All this is building up to saying is that probabilistic programming languages to an extent become one line.
19:20 It’s up to you to describe your data and your problem, but you essentially push the Hamiltonian Monte-Carlo button, and out pops your posterior.
19:30 In practice, it can be harder or easier for some kinds of problems.
19:35 It is, in terms of code, a one-liner.
19:40 Probabilistic programming languages have industrial strength very fast implementations of algorithms that would otherwise be tremendously difficult for most people to implement.
19:50 The other thing they do is define a primitive data type for the objects we thing about when doing probabilistic programming.
20:00 If you have a GUI toolkit library, you’re going to define a window and a close button.
20:05 Probabilistic programming languages do the same thing - they provide a library of primitives relevant to these kind of problems.
20:10 In particular, random variables and probability distributions are off-the-shelf, well designed, fast implementations that you can compose together to describe your problem.
20:25 The problem description you and up writing in these languages ends up being very declarative.
20:30 You don’t find yourself writing a lot of for loops, or the order in which things need to happen: you simply describe the world and press go, and the probabilistic programming language figures out the implications of that.
20:40 That at least is the goal; the devil is in the detail.

What are some of the languages being used in real world settings?
20:55 If you have a practical, real-world problem today, there are two languages: Stan and PyMC3.
21:00 Both of these are appropriate and fast enough to solve big problems - they are relatively mature languages with great documentation.
21:10 PyMC3 as you may have guessed from the name is like a super-set of Python - and in that sense PyMC3 is probably the more user friendly for most people listening to this podcast.
21:20 On the back end it is using a library called ‘Torch’, which if you have done any deep learning you may have come across before - it’s an end-of-life 1.0 deep learning library.
21:30 What PyMC3 uses it for is to do automatic differentiation - they didn’t implement it themselves, they used a deep learning library.
21:40 It’s not deep learning we are talking about but it has strong overlaps at a high level and low level.
21:50 PyMC3 is going to do all of these things - it has got a fast implementation of algorithms like Hamiltonian Monte-Carlo, and other algorithms that may be more appropriate in other situations.
22:00 PyMC3 is pretty smart about figuring out which of those algorithms to use, in addition to doing a good job of implementing them.
22:05 It’s got random distribution and random variable primitives.
22:10 You write Python syntax, which looks a little unusual, but fundamentally you are describing a problem and then press go.
22:20 Because you’re using Python, that raises the possibility of wiring the application into a web server, and then you’ve got an API that serves off samples from the posterior distribution or the answers to statistical questions over a web API.
22:40 Your front-end can talk to your back-end, which is essentially a dialect of a general purpose programming language.

What are Stan’s strengths?
22:55 Stan has historically had the fastest implementations of all of these libraries, perhaps a little less true than it used to be.
23:00 Stan is academic code that is used in production.
23:05 It is used in the pharmaceutical industry in drug trials and environmental sciences in industrial applications.
23:15 In terms of the way most production software engineers think, it’s a slightly unusual language.
23:20 The documentation is voluminous as well as good.
23:25 The community around Stan is incredible; the mailing list is amazing.
23:30 If you have a PyMC3 question, and you can translate it to Stan, then ask it on the Stan mailing list and you’ll get a great response.
23:35 They are both worth checking out.

Is Stan used by Facebook?
23:45 Stan is a general purpose programming language; in a CS sense, it is Turing complete.
23:50 You wouldn’t use it to solve general problems.
24:00 It is very flexible; that flexibility is for some use-cases a disadvantage.
24:05 If you have a very specific problem you want to solve, and you want to make it accessible to data analysts or data scientists, you might encapsulate a Stan program with some plumbing to allow it to be used in languages like Python and R.
24:25 That’s what the Facebook data science team did when they created Prophet.
24:30 Prophet is a general time-series forecasting library.
24:35 You feed in a time series, and it predicts the future.
24:40 It captures any seasonality in your time-series; if you get more visitors to your website Monday through Friday than at the weekend, Stan is going to capture that.
24:45 It also captures seasonality at lower and higher levels, not just days of the week.
24:50 Crucially, its forecast incorporates that seasonality and incorporates uncertainty bounds on what can happen in the future that are implied by your data.
25:00 The more data that you have in your archive of data in general, the more confident your prediction about the future will be.
25:05 If you have a very short archive of data, Prophet will still make a prediction and won’t give up, but the prediction will be correspondingly uncertain.
25:20 Because of very good UX decisions about how the model talks to Python and R, this is a particularly accessible place for most people to start.
25:30 You can be up and running in four or five lines of code, and looking at real output of Bayesian inference; namely the posterior distribution, without necessarily opening up the (frankly messy) black box.
25:55 It’s a generic problem, and most of us have time-series data.
26:00 I’m a machine learning person with a background in time-series, and I find working with time-series pretty tricky.
26:05 There’s a reason why the ‘hello world’ of machine learning are classifying orchids and deaths on the Titanic - they are simple tabular data-sets.
26:20 Time-series are not like that - they can be arbitrarily long, have all sorts of long-range dependencies (periodic on different time-scales).
26:30 They can be a nightmare to analyse - but they are really common.

How do the data structures around Bayesian and probabilistic programming languages affect core programming languages? Where do you see them going?
27:30 The first is probabilistic data structures - like Bloom filters - being incorporated in the language.
27:45 If you’re interested in Bloom filters, check out Cuckoo filters - they are the new hotness.
27:50 You can delete things from Cuckoo filters, which is something that famously you can’t do in Bloom filters.
28:00 Analysis at the programming language level is a very useful affordance in the world we live in, where data is coming in at such a rate that you can’t store it to do batch analysis on it.
28:20 At the language level, you can imagine that becoming more and more useful.
28:25 The one place that I’ve seen Bayesian inference incorporated at the language level is an MIT project called BayesDb.
28:45 BayesDb is essentially a super-set of SQL which allows you to do selects and filters and all those sorts of SQL things.
28:50 The really nice thing is it allows you to impute missing data and select on a criteria which is missing, and the gaps are filled in in a probabilistic way.
29:00 It adds to SQL’s set of primitives an ‘infer’ statement which allows you to infer missing values, and under the hood it’s using Bayesian inference.
29:10 That code is public, and you could use it on the database side with missing data, and think how you could make potentially quite complicated algorithms to users; BayesDb is very interesting.

What is PyMC4 and how does it relate to Tensorflow?
30:00 PyMC3 uses torch, which isn’t going to see any further developments.
30:15 They have decided to switch their back-end over to a new back-end, Tensorflow probability, a Google project.
30:30 Tensorflow probability I don’t recommend you go out and install immediately; it’s a low-level set of tools that are geared towards probabilistic programming languages or very innovative research and analysis.
30:50 You can think of PyMC4 as a user-level implementation of probabilistic programming that uses Tensorflow at the back-end.
31:00 This is kind of similar when people ask me where to start with deep learning; I don’t recommend Tensorflow, because it is a relatively low-level library.
31:10 There are higher level libraries that abstract away some of the complexity and provide building blocks of neural networks that are perhaps more user-friendly.
31:20 Tensorflow probability is kind of the same deal; identify a library that sits on top of it rather than it directly, if this is the first time you are hearing about probabilistic programming.
31:40 If you already knew about everything here, go and check out Tensorflow probability - it is a very cool work.

Tensorflow is more of the brand name?
31:50 Yeah, if you go onto GitHub and go to the Tensorflow organisation, there are lots of packages (including the Tensorflow deep learning library).
32:00 Not all of these are immediately directly related to one another.
32:05 I think Tensorflow probability falls under an umbrella.

Mentioned

Fast Forward labs
PyMC3
Stan
PyTorch
BayesDB
TensorFlow

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.