You. Okay, so welcome everybody. Today, we're going to start digging under the hood so you can get a better sense of how to actually design interesting AI systems using this building material. Now, the material I present today draws on the working number of wonderful colleagues and students, but I especially want to highlight Alex Lev and Marco chrismanna Towner, who put together a lot of the examples and problems to programming systems, and Alex Lev is starting a lab at Yale next year. So for people who are interested in grad school and called this to programming. I encourage you to go check that. Okay, so we're going to spend the first section of the lecture getting an idea of how to make inference processes that have more of the expressiveness and dynamism of programs, and what kind of automated math is needed for how that goes beyond what you can do with the learning frameworks like Windsor flow or pipes or jazz. But then, given that high level context, we're going to spend the bulk of the time going deeper on two examples that are just particular settings where you can kind of see, hopefully in more technical detail, how the gear is mesh to produce the kinds of results you can see from my last slide. Then finally, we'll take a little bit of just a very quick look at a new promo steam programming at a building system we've been building on top of Jax. It's designed to scale out photos, accelerators, GPUs and monitor panels, but they also actually edge devices. Okay? So first, just to get a feeling of some of the kinds of things you can do with probabilistic programming, let's just review. So here, what I'm showing is an animation of Carsten program in Gen that is building a map of one of the floors in CSAIL from sensor data that's being collected from the robot that's represented by the dashed red line, so you can see, as it moves around the environment, the systems inferring the location of the robot and the puzzle walls constitute together. People who know what particle filtering is. You might find it interesting to know that just a single particle i Another type of interesting capability that emerges with these kinds of inference processes is the ability to detect when things don't really hang together. So here, what I'm showing in one panel is the simulated world where we put the robot in the far top right of the map. But then we told the robot in its internal world model that you see in the middle, that it's actually one remover. So the robot's being told it's in the wrong location. So let's watch what happens as this robot tries to optimize itself and so it starts, leaves what thinks is the second room gets confused and then cracks itself. Let's watch that once again. Initially, the robot thinks it's in the wrong room with certainty. Then once the predictions from its model are sufficiently incompatible with the data that it's sensing posterior distribution, which I'm showing the heat map over the top right, over parameters that govern how much discrepancy to expect between the predictions of the sensor data concentrates on high probabilities of outliers and high degrees of tolerance for inliers, so that state of uncertainty triggers a controller that says, Actually, maybe I need to reconsider the location across a broader range. And that's how the system first detects discrepancy between its predictions of the environment and then corrects it by using problems. So that kind of reasoning that's guided by a level understanding of uncertainty is one of the capabilities that becomes a generically using problem.
How recent things like this,
I guess I would say, kind of depends deep inside many robotic projects and probabilistic robotics and self driving, people have tried to put together heuristics that approximate some of these kinds of the idea that it might be possible to have multiple layers of modeling and control in a single AI system engineering approximation, but principal hierarchy Bayesian math is automated by programming system. I think that's quite maybe, really only the
last couple of years. And one final example to give you a sense of sort of where this can go. So here I'm showing, on the right an animation of a localization process for a robot. The inferred robots position is shown growing in green, and you're seeing the inference code on the left fits in about a page for what's a kind of adaptively controlled inference algorithm, where the robots inferences about where it is. They're extending step by step. And then occasionally you see those little red marks which show up whenever the robot infers a little too much discrepancy between where it thinks it is and the data using its online estimates of how good a small should be. A little bit more computation to try to narrow the gap. So we're in the early stages of both engineering the implementation of this kind of data inference, compute graph, and also having theory that lets us understand the design trade offs these kinds of things. But I wanted you to see this at the beginning, because write holistic programs in lots of domains. And last time I showed you examples drawn from three scene perception or inferring the structure of time series or other multivariate data streams, maybe at least mentioned probabilistic programs that infer linked databases clean linked databases for millions of dirty records. But it's maybe a surprising fact that all of these inference processes have the basic affordances that are needed for these interesting levels of metacognitive control, electric shift as a byproduct of doing the inference engineering using the approach that you're going to hear more about in lecture today, you will have designed an inference system that exports not just guesses, but also weights that give local assessments of how accurate the inference process is, that can then themselves analyzed by higher level control processes to adapt how much you need to be done, where under what circumstances. And that's a fundamental, I think, new and exciting affordance for AI engineering that comes from problems to programs. So just keep that in the back of your mind, and hopefully, by the end of the lecture, it'll at least be a little bit clearer how we're doing this. Now one of the goals of probabilistic programming is to make it possible to implement a broad range of inference and learning algorithms, including optimization by gradient descent as a major workhorse for training in deep learning, but also variational methods, MCMC, sequential Monte Carlo methods, exact methods such as dynamic programming. So this breadth of algorithms for inference and learning really necessitates revisiting much deeper levels of mathematical automation than the people building deep learning platforms have to consider. So let's just get a brief overview of that before we dive into examples. So in probabstic programming as a whole, I've argued they're basically two fundamental technical ideas. The first idea is, we're going to represent probabilistic models, not using math, but as code, generative programs that make stochastic choices for the latent variables. And the second big idea, which we're really going to focus on today is we're going to represent the algorithms for inference and learning as meta programs that take generative programs as input andor produce generative programs as output. And that will turn out to give us a way to directly translate fairly deep and abstract ideas from probability and their interaction with calculus and symbolic metaphors into a form that we can automate. And that's what gives us such an expressive building experience to see a much broader range of AI systems that have the interesting capabilities that I was showing last lecture. So just briefly, to remind ourselves of what this means, here I'm showing a probabilistic program on the left which takes a parameter, theta, flips a coin with weight theta, and if that coin lands heads, return zero, otherwise returns theta over two. And on the right, I'm showing one mathematical description of the distribution on outputs that I've denote, that I've given concretely in code as a general program. On the left, the code makes a stochastic choice for B, and that's essentially implicit in the math on the right. And because we've got explicit code, of course, we can simulate it on the computer. We can automate many mathematical operations, and we can scale it to distributions that may have millions of latent variables and modular systems with 10s of parts, which is very, very difficult to do if all we have is mathematical notation. Now, to really make this work, we need to be able to do things with the code representing probability distributions that correspond to the mathematical operations that we can, at least in theory, do with the mathematical objects the code corresponds to. So on the left, I'm showing the schematic a capability we're all quite familiar with, automatic differentiation, where we take code representing a function and feed it into a machine that gives the gradient of that function, which measures how do small changes to a function input changes out. But if you want to handle probability theory, not just calculus, we also need to be able to automate estimating expected values, where we have code representing a distribution and code representing a function, and we want to answer the question for input sampled from this distribution, what's the average output of that function? And actually, once we get started, we can realize there's a much broader space of mathematical operations we might like to automate, including other kinds of derivatives. Measure theoretic derivatives like the radon negative derivative, for those of you who aren't familiar with it, importance, weights and acceptance ratios and Monte Carlo algorithms essentially turn out to be special cases in this operation. So what that's measuring is how likely is some outcome under one distribution relative to how likely it is under another district that ratio of likelihoods is essentially what a rate of these are just a subset of the mathematical operations in probability theory and modern policy programming systems like Gen automate these operations, their interactions with calculus and their interaction with symbolic metaverse. So ultimately, maybe zooming out, we get the ability to do things like take a probabilistic program representing a random process and actually get the derivative of the expected value of that random process. So we can figure out how the inputs to some random process affect its average behavior. That's central for stochastic optimization, decision theory control reinforcement learning, including fine tuning of large language models, basically boiled down estimating, basically having a subtractable, hopefully modest variance estimates of the gradients of expected values, and then also rate on negative differentiation, which shows many Problems of inference algorithms and also statistical hypothesis testing.
Yeah, so good. So what was policy programming systems? If you have the distributions represented as code, then you can get automated implementation of this math, and if you don't, you got to do it by hand. And now we could debate, okay, well, what's the value of automation? I think the most important value is correctness. Second most important value is speed of experimentation, and the third is actually performance of the implementation. I may have said this last time, but I remember implementing special cases of back propagation, automatic differentiation for drone that's by hand at a time when most people believed it just didn't work. Why is that? Well, we all thought we were implementing the algorithm correctly. One fundamental issue is that's actually very hard to debug code for gradients. Second issue is, you change the problem slightly, very hard to change the code for gradients that you wrote slightly. In fact, a small change to the target problems can result in a large change for the code for calculating its gradients, especially if you want a high performance in quotation. So that means people might say, well, gradient based training of neural nets doesn't work very well, and what they're actually saying is they have a hard time correctly implementing gradients that are fast enough for the networks that they're applying for. My own view is that that's actually one of the core issues underlying what sort of the apparent successes or failures of broad classes of AI methodology, which is that software engineering, Performance Engineering and debugging, especially as semi numerical programs, is extremely hard. We need automation to do it. So and what it does to be in the back of your mind, because for most of this lecture, I'm just going to be showing code and the capabilities of that code, but if you're watching carefully, you'll notice sometimes I'll call out places where the system had to automate some of the math that you just heard about. So let's look at this example just again, kind of related to the slam problem I showed earlier, actually, just trying to track the pose of a camera from RGBD video in an office environment where the model, which I'm showing rendered in the middle, is going to be used to model data that's, of course, like way too complex. It has a bunch of stuff. The model doesn't know about tables and chairs and light fixtures, but the model is just going to use a simple hierarchical wave and technique to infer, on a frame by frame basis, which parts of the sensor data it can explain and which part of the sensor data it just can explain, shouldn't even try and should ignore so it's not corrupt. It's those estimates. And so that masking is what you're seeing in the blue and yellow and the right mouse, where the blue pixels are the ones that the model has inferred and the yellow the pixels inferred it can't explain the length fixture of the chair that
holds it. So how do we
do this again? So how do you do you have to tell it what's important to you before it knows what it should ignore and what it shouldn't ignore, right? Yeah, you have to design a model that makes assumptions, but you're giving it some objective function. Let's just watch. So let's start by writing down the model. So the first part the system, whose results I was showing has two components. One is a generative program which generates hypothetical worlds and data that would be likely given those worlds, and the second is an inference program which finds probable worlds given to you. So let's start with the generative program. Maybe the generative program starts by generating a floor plane. So I'm showing a visual representation in the top right. And then I'm building up an object called a trace, which is central to probabilistic programming in Gen in the bottom right. So then maybe we select a height z for the camera. I'm showing one randomly sampled value in the trace, where the random variable denoted Z has a value point seven, and I'm showing that schematically in the top row. Then maybe we sample random orientation from the camera by choosing its pitch and its role. And so then, once we've done that, actually render the depth image, in this case, just the floor image that would be likely given a camera at that Z pitch and roll. And finally, we're going to add some noise, which is going to be the way that we account for the discrepancy between the predicted model and the real model. And of course, the design of good sensor models and noise is not straightforward, and that's where some of the answer, Charlie to your question is. Now, once we have that generative program, let's see how we might write the program to do inference. So here's one very simple kind of inference program which grabs a frame from the depth camera, so then constructs a trace of the generative model. So asks Jen to do this generate operation, which takes the generative code from the generative model, generative program that I showed you on the last slide, and some constraints. What are those constraints? Those constraints are a data structure called a choice map, which maps names for latent variables. In this case, the observation to values, in this case, the frame that was loaded from the depth camera. So we're asking Jen, essentially, to start from a random initial world, but ensure that whatever observation came out matches the frame you're seeing that rendered here, where there's a random pose pitching the floor at a particular angle, and we're asking Jen to consider the hypothesis that noisily rendering that floor produced this observed data. And what you're seeing in blue actually are the relatively small number of depth pixels that the model thinks are reasonably well explained. So right floor overlay on the real data, yeah, oh, sure.
Actually, in the AB booth, can we ask you to hear I'm not sure where the controls are.
Maybe they're here. Let's try this. OK, great. So now let's try to find a better explanation, because this one's kind of bonus. So for a while, we can do some updates which select particular subsets of values from the trace and perturb them using the metropolis Hastings algorithm. So as we go through this process, I'll show you more about what's under the hood. Some of you may know about the metropolis Hastings algorithm. In fact, was that covered in one of the problem sets at this point. Okay, good. So all the math for Metropolis Hastings is automated. So you know, we're going to first make proposed and so what you get is an inference process that does something. We just show that to you again in just a second, starts from a random initial configuration and keeps perturbing the random variables, seeking a better and better match. And you can judge the quality of the match in part by the segmentation of the depth image, which is shown on the right. The difference process was kind of interesting in that it alternates between randomly, kind of resetting the whole process, or trying to which will only accept based on the metropolis Hastings rule, and making small perturbations to the pitch role and height values. And by the way, this proposal for Metropolis Hastings, the random walk proposal, is itself just another general program. So now, once we run that inference program, maybe we get a final trace like this one on the bottom, which is actually a pretty reasonable explanation, or incomplete explanation, of the data using a model that we wrote. Now let's make the model the model a little better. So maybe we want to model the ceiling as well. Well, that's just an incremental change to this model. So we can add in another latent variable, room, height, sample it from a uniform distribution, and then collect the ceiling in the list of objects that's going to get rendered by the depth. So here we've made a mod, a modular, incremental change to the model. When we've done that, maybe we also need to add an update to the inference algorithm. We also want to be able to make small changes to the room height, which we're going to do by adding a little line that's going to perturb that. And we're also going to allow the system to try to reset the room height, along with the pitch role in Z of the camera when it's updating the other variables. So the thing I'm trying to emphasize many of you may be used to designing software systems by writing a program and then making incremental refactorings to the program, as your understanding of the problem you're solving and the methods by which you're solving it improve. You can do the same thing with probabilistic computations using Gen and this is a very different way to get working code than instead trying to make incremental changes to, let's say, a black box neural net architecture and to a training data set, which you're then just going to optimize to try to solve a problem. And you may recall, in the first lecture, I talked about interesting performance gains that have started to appear from problems to program relative to deep learning. But in some sense, what I'm pointing out here is that there's no secret where maybe that comes from. Comes from the ability to make modular solutions that are incrementally edited based on understanding, as opposed to searching through a very large space of network architectures using a lot of compute.
Yeah? Question, just to make sure I'm following, can I think of trace as some underlying parameter that fits the data. Yeah,
see, the trace is a data structure that collects all the stochastic choices. I wouldn't say parameter. I mean, I just want to draw a distinction, because some traces may have different structures, different numbers of parameters, but, yeah, it's a latent hypothesis, or latent world, or latent explanation for the data.
So implement, implementation wise, does the dimension of the trace starts to increase as you add modules? Often, that's the
case. Yeah. And then you can actually do inference. In some probabilistic programs, for doing inference over the size and shape and structure of the trace, like in the system for learning problems to programs with data that I showed a little bit of last time, the feasibility of making incremental changes, something I just want to highlight. This problem might seem pretty simple, and in some ways it is, but what it adds up to is the ability to incrementally explore very large design spaces for problems and computation by making incremental changes. That's the important capability, and that requires a whole lot of automation, which I hope by the end of this lecture, you at least have started to get a little bit of a field. Yeah, yeah.
Instead of you like kind of designing those modules, does that generate some
Yeah? Well, last time I showed a system that learned probabilistic programs from data, I'll talk a little bit more about that next lecture. So one thing you could certainly do is write a meta program which generates generative programs and then do inference in that generative meta program. Yeah, that's right. So that's something you can do. And if you want to learn probabilistic programs from data, that's one approach. Another. Approach. Another thing you can do is you can try to use an in context prompted transformer to synthesize probabilistic code from natural language. But again, I just want to make sure that you're not fooled. There's no silver bullet. And all those approaches have some interesting capabilities, but really interesting and important limitations. So our goal in probabilistic programming is to come up with building material that makes it possible to explore the full logically possible design space of probabilistic computations that shifts the burden between explicit engineering and meta engineering methods that use Compute and allow those meta engineering methods to be much smarter than the fairly generic random search, or blind search, or even gradient informed, but still quite generic uninformed search that's been and again, not to say that those other methods are quote, unquote bad, but they certainly aren't informed. And if we want to learn new audios that learns much more data efficiently and compute efficiently, and we might expect we'd see some gains from systems that are learning modular explanations of the world by a modular inference processes and don't have to consider all the weights for every increment change. So then let's say we want something which actually runs in real time. Right now, we have this little for loop that's doing 1000 iterations some NC and C algorithm. But what if we just change it to a while loop where, in between a single pass over all the latent variables, we're just going to grab a new frame from the camera and overwrite the constraints in the trace. Would that be a frame to ask Jen to do something called Update, which I'll tell you a little bit more about the next section of the lecture that takes the trace and the new constraints and replaces this swaps in the new frame. And now you get something which can track small changes. Because maybe it only takes roughly a single sweep of Metropolis tastings over all the variables to update the hypothesis. Given the change, it's actually a small amount, because the camera was moving continuously. So now we have a real time habit. So here I'm showing trace plots. What happens when you run this from any random visualizations. So you can see, initially, the hypotheses are over dispersed, but actually fairly quickly, they concentrate on what I hope you can see was the right answer panning around the room. Now, what happens if we remove, as we just comment out the line in the inference program that occasionally tries to propose resetting the hypothesis. We do that it still converges, but sometimes quite a bit more slowly, because occasionally it just gets stuck off and, you know, nowhere. Make a lot of small changes to get into the correct zone. What if we remove all the incremental changes we do that the situation is a lot worse. So just trying to give you a few a feeling for some of the kinds of engineering design choices that you can make explicitly. And of course, you can start to write code that tries to make automatically, either using machine learning or decision theory or random search. So
now let's go one level deeper. So let's move to an inverse planning domain, where I can show you a little bit more about what's going on under the hood in general. So the problem we're going to consider here is making a system which can take telemetry, say, from a transponder on someone's location in an elder care scenario or smart home that has access to where a person is and takes data from their motion and tries to infer basic aspects of their molds. So maybe we'd like a system, by the way these systems have occurred in actually clinical evaluation of medical devices, elder care, as I mentioned earlier. So there's many real world problems that are kind of close to adjacency to this. So we'd like a system which can answer questions like, Is someone leaving their apartment. So imagine you saw location data corresponding to the sequence of black dots here, starting on the couch. I'd like you to raise your hand if your right hand, if you think they're definitely leaving the apartment. Left hand, if you think they're definitely not, use the height to indicate they're definitely now, maybe not, maybe definitely not, right? So the confidence shifts quite rapidly, right? Okay, so now here's another trajectory. I Okay, great. So now we all think they're probably leaving, especially once they pass the bathroom door. How about now? Okay, well, what if I told you that their keys tend to be left by the bedside table. See? Table, right? So I'm just trying to give you a feeling for the richness of common sense inferences that we can make, you know, for an apartment we've never seen before, right? So maybe in this case, we could consider two explanations, they looked for their keys on the kitchen table that didn't find them by the kitchen island, or they looked for their keys by the bedside and did find them on it. So they're going to or maybe they're just going to take an end Okay, so let's look at a key subproblem of this problem and solve it using Gen and write an inference program that does online. So we'd like to take something that takes a sequence of input data points and produce some predictions for the next waypoint and a possibly multi waypoint path, along with a future trajectory prediction, which I'm showing in the far right. So you can see kind of the places where the gray lines are concentrated, are the likely places the agent might be. Now, for a different set of data, maybe we all agree it's pretty likely that the agent's going into the room from the top right corner of the apartment. So you can see that coming out of the infant algorithm as well. So So let's start by writing a generative program. So what this generative program is, I'm showing the generative code here, and I'm showing a mathematical description, kind of a summary of the probability model just below. So the model takes a floor plan as input as well as an approximate starting location shown by this blue circle, and it's going to randomly sample Z start, a start location within that approximate start location that was given as an input parameter. Then it's going to sample a destination. Z, destination, let's say maybe that shows up here. Then it's going to plan a trajectory from start to goal. And here it's doing that using a rapidly exploring random tree path planner with Stochastic Gradient trajectory optimization. Let's say a little bit more about that. For those of you who've done robotics, real time robotics before, you may be familiar with that class of algorithms. We're just going to use it as a subroutine in our generative model. Then we're going to generate some measurements by walking along that trajectory with stochastically variable speed and generating some noisy position measurements shown in black. So what are the traces of this generative program? So here I'm showing the generative code, just the code description of the probability model, and then two traces, both as data structures and visually rendered. Most important idea to get in probabilistic programming is the idea of generative code and their traces, so that you can understand how the static description of the process corresponds to an actual stochastic instantiation or realization of its behavior, and how that behavior is represented in the computer as in the data structure, and also how it corresponds to Your own intuition, which you can see through visualization. So here I'm showing a trace where the destinations in the room on the top right, and here a trace where the destination is. I guess it's the second closet near the bathroom, previous map, something like that. And the Trace has a start location, a destination, and then a sequence of measurements which are shown visually. Okay? So now the inference problem is to go from a partially specified trace that just has the measurements and try to infer the start location and destination. To do something like sample from the posterior on z given x, where z are these hidden variables, the start location and destination, and x are the measurements. And here I've just shown Bayes Rule, which gives one way of, at least thinking about the posterior, but not really a way of obtaining it. Computation. OK, so let's dig in a little bit deeper. So if you try to imagine what this probability model corresponds to as a Bayesian network or graphical model. How many of you took a class on graphical models or a unit on graphical models as some other class? So good number of you, right. So this graphical model turns out to have about 10,000 variables, because the rapidly exploring random tree path planner is a fairly complex algorithm that makes many recursive choices as it explores branching paths through the through the apartment. And if you were to trace all those stochastic choices, the trace is about a megabyte. The one thing that Jen lets you do is have random processes where you don't trace all the random variables, and so that actually compresses the trace of a scenario like this down to about a kilogram per explanation, so much more manageable and maybe closer to the intuitive size you were expected. Okay, so now, once we define the model and a space of possible traces, let's try to do inference. Yes, question,
yeah, my question is the previous thing that you were showing like was the possible path and stuff was this generated, like from the spot, or there was, like a training time, and then no training, just
no parameters have been optimized in any of the examples I'm showing you right now. I'll later show you how you can use Gen to express hybrids, more explicit knowledge engineering and machine learning. I machine learning, but so far, there's been no training. We've just written down the generative ball. So we'd like an online inference algorithm which, as data points come in, rapidly concentrates the traces that it produces on just the plausible ones. So there's a bunch of technical challenges we have to overcome, to deal with. So here, I'm just going to give you kind of a high level overview of all the capabilities and inference programming on the slide. So the first challenge is efficiency finding for all traces given data, and the way Gen facilitates that is through inference programming capabilities that let you combine sequential, Monte Carlo, markup, Chain Monte Carlo, variationally trained surrogate models, systematic reasoning for dynamic programming, and you know, where appropriate deep learning methods. But to implement those, all those algorithms have a lot of math in them. They require automated probability densities, importance, weights, gradients, acceptance, probabilities, and those often involve integrals summing over large subsets of random choices in the tricks. So Gen provides affordances for controlling when you're actually doing tracing of stochastic choices, when you use exact densities versus what are called pseudo marginal estimators, and then automated maps for the core operations at whatever granularity you've chosen in your infrastructure. So these are some of the design space and considerations for inference engineering. Now, for all of this to actually work efficiently in practice, you also need the implementations of all of this to be fast, especially in a case where you're considering counterfactual changes to traces and scores. So I need you to make incremental changes to my explanation. I need you moving the destination over here, or adding a waveform without re executing the whole generative scratch. And so for that, Gen provides controllable incremental computations for generative programs, which internally has intermediate representations that lets you predict the cost and efficiency of static incrementalization, as well as mechanisms for dynamically tracking dependencies. Sorry that I should say other problems of programming systems do Gen JL currently those, but some of its ancestors did dynamic dependency tracks as well. Okay, so the style of inference program that we've been focusing on are sequential Monte Carlo inference programs. So kind of the whole, sort of 10,000 foot level do something like generate a collection of traces weight them by kind of estimating how likely they make just each incremental piece of data, resampling to concentrate on traces that at least locally look promising, perturbing those traces to try to improve the quality of the late explanations of data, and then estimating weights again for the next data point. And that turns out to be a very general building block for three scene perception record linking learning probabilistic programs from data, doing online Bayesian goal inference. And in fact, in the last lecture I gave in this class, you'll see a proposal for how we think corticos and portable computations of brain might work as an implementation of the same basic potential. So let's see how to implement that class of algorithms in general. So here I'm showing the code in the top left for an inference program that initializes a bunch of hypotheses and then incrementally perturbs them and extends them to try to solve the whole inference problem. And in the bottom left, I'm showing the mathematical description of this type of model that you might see in a neurips or UAE or AI stats or ICML paper that's doing sequential Monte Carlo inputs. So by the way, this is a multi threaded inference program. We're going to update each of the traces in parallel. So we're going to start by generating a bunch of traces just from the prior so that's what I'm showing you here. Then in each then we're going to re sample to concentrate on just the ones that are compatible with the first data point. Then we're going to perturb them using Metropolis Hastings with a random walk proposal. So if we zoom into that, what the random walk proposal is doing is taking each of the traces and simulating a small random walk and then accepting or rejecting that new change, according to the metropolis Hastings acceptance rule and Jen's animated the math for all of these, sorry, automated and in fact, you might find it interesting to know that Metropolis Hastings isn't some built in feature of Gen, but in fact, is an inference program itself that you could write in user space. So when Metropolis Hastings is implemented is you take, as you simulate a proposal trace, you update the trace to be that new proposal trace, calculating a weight associated with that update. But then we calculate the reverse weight of going from that new trace back to the trace you started from. And then you accept or reject that change. And that uses facilities in Gen that automate the math and get efficient implementations of the counterfactual edits that I was just describing, regardless of whatever modeling language you wrote. And in fact, Gen provides a set of abstract interfaces for a range of these automated math operations, and it generates sound implementations using a compiler from the probabilistic program that you wrote. And those different operations include a mix of probabilistic sampling and scoring incremental computation for the cases where you're just changing one random variable and also automatic differentiation in the case where you want to calculate a gradient and perturb some parameters by, let's say, taking a step and moving in the direction of the gradient. Okay, so that's kind of all under the hood. So then you as a user can just say things like Gen dot Metropolis Hastings or Gen dot update each trace to have data subtitled to the next data. We're going to extend the trace to have a new data point and then repeat. And as we keep iterating over this process, we get an algorithm which starts to concentrate the explanations just in the room on the top right.
So what we've seen now is one way of doing purely generative model based inference, where we wrote down a model and we designed a sequential Monte Carlo inference algorithm, and we asked Jen to automate the mouse to implement but you may be wondering, well, but what about data driven inference, where we have a bunch of training data and we train some neural net to make guesses? How does that fit? Well, there's lots of ways it could fit in. Let me just show you what I'm one thing you can do, which is a very old idea in AI and machine learning, is use a generative model as a synthetic data generator to generate fully labeled training data for supervised training in the neural network to try to make guesses about the latent variables. But with Gen you don't just have to trust those guesses blindly. You can actually use the generative model in the loop to check whether those guesses actually cohere with the generative model that you want them to. That's what we call neural Monte Carlo inference. So in Gen you can write a neural net in pytorch, wrap it in a generative program so that it can make be a valid proposal, generator of guesses, and then train the neural net on simulated data using an appropriate probabilistic loss with automatic differentiation that handles batching in a reasonable way across kind of the part of this case, the CPU implementation, and, you know, the GPU where the high temperature model is living. So here I'm showing the cross entropy loss versus log training time going up to about 1000 seconds for an LSTM proposal for making guesses about the target location in one of these bullet print scenarios. Now, if we just use that trained LSTM to make a guess, it's pretty fast, and maybe, let's say, call it a millisecond to run, but it actually doesn't get great answers. So like, for example, you see data corresponding to these black dots here. A lot of them don't really make a lot of sense. They're too broad. But Jen can clean those up. Those are much better than the prior. This dashed line here corresponds to the prior. But with Gen we can actually do neural Monte Carlo, where we accept or reject those proposals based on how well they fit the generative model. And if we do about 10 proposals, we can drive down the error to be essentially arbitrarily low, and then we get something where relatively quickly, maybe a couple tenths of a second, a couple 100 milliseconds, we see data like this, and all the proposals concentrate in just the room on the top. So this is one way to shift a lot of the work of inference engineering to meta engineering, we just designed a model that we understand. We use that model to generate label training data for a neural inference architecture that we're going to use to make guesses. Now, what do people think about this? How much of the AI inference? Okay, you're saying one person is strong. Thumbs down, curious, but like, what are the strengths and weaknesses in this instruction?
And if you make the gesture, we've got to give an answer, no, just
contrarian. You're just contrary.
Anyone else. I something maybe quite appealing about it, like we didn't have to design the LSTM by hand, and it's a lot better, maybe than just like random Monte Carlo from the prior but what are some of the limitations? Yeah,
right. So you know what, if it's actually going to weigh it may amplify problems with the model. It's only going to be trained in regions that have high probability. Another problem, though, is like, what if the map is wrong, or what if the map changes? Are you going to change the way to the neural net to track that change? I mean, like continually learning AI systems, you know that can, like, learn to track all the cars on all the roads and keep them up to date as roads change and new models of cars appear and driving behavior changes, you don't have to have some pre trained neural net that's gonna have to get pre trained every time the environment changes for driving behavior changes. This ability to make incremental edits to models and infants processes is also a problem for broad deployment in the real world of AI systems that we'd like to be continually so we'll come back to that in depth. So I want to show you briefly Yes.
Question Is that what you previously meant by fast counterfactual editing of traces and scores? Well, no, so
fast counterfactual editing. Come back that question. If I haven't answered in a bit, I'm just going to drill a little bit under the hood, and then we're going to return to this big picture question about neural and trends versus alternative approaches tendency. So let's dive into the motion model. So that motion model, I said, is going to follow the path at random speed, right? So here I'm showing this, showing the generative code for that, which is a loop that takes steps where it chooses the number of steps from some prior walks along the trajectory, and then adds some noise. And that's what you're seeing simulated down here. So how does Jen handle these internal step variables? You might have noticed those weren't in the traces I showed you earlier. They were just the measurements. So what was going on here? Well, Jen actually allows the engineer to make trade UPS she could choose complete tracing, where the whole models trace is preserved for exact integration, what's sometimes called rabble actualization in robotics and statistics, or pseudo marginal estimation, where you're going to infer the step values by treating that inference problem as an inference problem in its own right, guess them, and then use those guesses to sort of effectively counteract them, get rid of the variables. So complete tracing would result in traces that look like this, where I trace the model. And instead of just having measurements, I have interleaved measurements steps. So now, if I ask Jen to do an update, which is one of those counterfactual edits, where I'm saying, take the previous trace, where the agent started here was headed to this destination, but for some inexplicable reason, saw these sensor measurements, that's I instead changed it because I wanted to make the destination to be kind of more in line with the measurements. The problem that has is that if I just update the destination, I actually haven't yet updated all the other variables. So Jen automates the math for this and let this be efficient. But there's a problem, which is the details of the implementation of this motion model have leaked through into our process for editing traces. So for people who are accused to concerns of modularity and software engineering, sort of maybe get a feeling for the problem. You don't have a team of 10 engineers building a big system for inferring humans goals as they wander around buildings. And one team is improving the model for motion over here. But every change it makes to how they represent motion affects everybody else's code. And maybe, if you think deep learning is the only solution, you say, Okay, well, this is still maybe better than that. At least there's like a modular influence that that has, but we'd like to encapsulate, we'd like the rest of the system to not have to worry or even know about this internal choice of step variables, this low particular, low level detail of how the motion model was implemented. Well, just so you get a sense of what the counterfactual events look like. You can also ask Jen update facility to take the choice map and change not just the destination, but all the steps. So that would map this trace on the left to this trace on the right, which might be a trace that you all think is more intuitive. That's one where the rate of movement down this path is such that the measurements are actually kind of likely deviations from the actual position. And to Angie's question about fast counterfactual edits, what update actually returns is not just the new trace, but also a score, which is essentially measuring the change, or it's really sort of the it's the ratio in joint scores of the two traces on either side of the change. Okay, so that's what complete tracing looks like, which has the disadvantage that it's not modular and and so let's look at another approach, pseudo marginalized. So we could just think about this motion model which has these latent step variables, and come up with an algorithm for guessing the probable steps given just adjacent sequences of measurements, and then plug that in to kind of hide the details of the steps from the rest of the inference process. Once we do that, the whole update mechanism works, where now our traces just have measurements, they don't have steps. And when you change the destination, we're effectively filling in new steps that fit with the destination update that we make. So let me just go one step deeper into what's happening mathematically. What's really happening is we started with a Gen program which has some latent variable z and some data x, which maybe also is integrating out or getting rid of some nuisance variables we don't care about, like the steps u, so the density of the latency and the observed data is the integral d, U, of the joint density of U, z and x. But in fact, it turns out that we really do relax our requirements if we don't want exact probability densities, but we're OK with just stochastic estimates of probabilities or probabilistic probabilities, then instead of p of z of x, maybe we'll use p hats of u of z of x, which is the ratio of the joint probability of p of u, z of x, to the proposal probability of u from some program that guesses probabilistic program that guesses the u's. And it turns out that estimator is unbiased, so on average, it actually gives you the right answer. It'll have some variance, but it has the interesting property that if your Q of v, u, given Z and X, actually is the exact facial posterior. So if you fill in an inference algorithm for inferring the steps that gives you the exact posterior, then you actually get zero variance, exact marginal densities out of this approach. So it recovers exact integration in a special case where you can do exact integrals. But when you can't, when you just make rough guesses or use a learned method or something else, you pay for it with a little bit of variance, you have to put in much less effort. And in fact, this whole thing works recursively. So maybe your proposal program for making guesses, Q itself has some intractable variables. You can apply this recursively. And so this gives you a very generic way of making very deep generative models and deep inference processes, and controlling how much you specialize, how much you learn, how you optimize, how much variance there is, how much parallel compute you use because you have a GPU, then you want a better proposal, you can just run it with more replicates. So I just want to try to give you a feeling of how big the design space is for AI architectures that don't have the same fundamental scaling bottlenecks that transformers Okay, so for people who are interested in this, I recommend checking out a paper by Alex lab Marco pismona Towner in me from UAI a couple years ago that gives a general mathematical framework for unifying variational and Monte Carlo inference in this recursive setting. And also another paper of Alex is called Prost. I also students in this class on the probabilistic programming with stochastic probabilities. That kind of shows you how this part of the Gen ecosystem and Gen stack works. Okay, now let me talk a little bit about incremental computation. So this problem I was showing you was a tracking problem. Of course, you'd like this to run in real time, so maybe you're getting wireless, some wireless transponder, and it's giving you sensor readings at 50 hertz. So you want the inference process to be fast enough that you're not dropping sensor reads. So another interesting feature of Gen is it lets you do a little bit of work to optimize its incremental computation. So if you just naively reran the generative program from scratch, then it scales quadratically and the number of sensor readings, because each time you get a new sensor reading, you have to kind of re score the whole history so far. But it turns out you can do a little optimization in Gen that'll turn into something that scales linearly with constant factor that's sufficiently low that you could keep up, keep pace with a real time data stream. And I'm not going to spend too much time on the details of this, but the basic idea is to take your generative program. Instead of just writing it in raw Julia, you write it in the Gen static DSL, which replaces control flow constructs like a Julia for loop, highlighted here in red with a Combinator, the unfold Combinator, highlighted in red below that's essentially a compiler hook that allows Jen to automate the incremental computation in some ways, it's kind of similar to what Jax programmers might be used to. And Gen Jacks actually goes quite a bit farther in this direction, but that turns the quadratic complexity to linear and produces much more efficient implementations of practice. So hopefully you're now starting to get a feeling of how you could do kind of modular performance engineering and inference engineering to explore a very broad class of inference architectures.
Okay, so now, with all of this in hand, let's revisit this question, how to come up with good proposals. So previously, I said, All right, you could just try to use deep learning, generate a bunch of traces, use them as labeled training data, make guesses. Here's another approach. Maybe we can kind of roughly guess where the person's trying to go by looking at a very low bit precision crude approximation to the problem. So let's just discretize the room, or the overall apartment, into, I guess, like roughly 20, like large blocking regions. Once we've done that, we can estimate the probabilities under our model of transitioning between any pair of them using, you know, just like 1000 training runs, and just use local maximum likelihood estimation to calculate the transition probabilities under our model from one room into another, conditioned on the destination cell. So we just estimate conditional probability tables for given a current cell and a destination, what's the probability of the next cell? So like given the current cell being in room 10, and the destination one, probably it'll go to nine, maybe six, almost certainly not 1415, 13, etc. Now this little model can be extracted into efficient graphical model, representations that can be highly compressed, that you can solve exactly essentially by dynamic programming, and once you've done that, you can use this little symbolic surrogate model to make very quick, rough inferences, like, if the person starts here and heads downward, probably they're going to one of these regions, so that we're not exactly sure which one, or if the person goes all the way down through 11 and 17 heading towards 18, and they're almost certainly just going to stop at 18, so there's no other reason to do it. And in fact, with this approach, one can get really high quality inferences with almost instantaneous training time, because you don't have to do gradient descent over a bunch of parameters of some large neural network. All you have to count a few transitions between, I guess, less than five bit a less than five bit description, discretization of the world, and then the dynamic program is also extremely fast. This, you know, again, a five bit state space that you can run on a GPU. This all CPU base, sort of imagine highly efficient representations. So, you know, I'm highlighting this because we've done a lot of work in probabilistic programming at this point to have automation for the interactions between symbolic metaprogramming, calculus, stability theory, and so my goal in this lecture was just to start giving you a feeling for the expressive power that follows from that and some interesting, maybe surprising consequences, like, if you want automated inference for fairly broad domains, you could try to train a neural net, or you could Do a tiny bit of very generic logic for discretization, and then get an exact algorithm for that symbolic surrogate that's extremely fast to train and also much faster to keep up to date if the world changes. Because actually, in fact, you can incrementally, locally update the surrogate. You don't have to update all the conditional probabilities when the model changes only local, although it's actually fast enough that it's probably tractable to do that in this case. And just to be clear, this approach involves two layers of modeling and inference. There's the model of the room with the path planner that we showed you the code for in the Monte Carlo algorithm. Then there's the surrogate model of the model in the room and the dynamic programming inference algorithm for that, and then, in fact, there's a learning procedure for estimating the surrogate model. So this is another kind of depth in the inputs process. So zooming out a little bit, you know, we've covered a lot of ground, but the reason why I've done that is because I want to invite you to consider a possible future trajectory for scaling probabilistic computation in AI, deep learning has been enormously productive and told us, in some sense, what some of the most powerful uses of neural nets as a building material actually are. What I'm trying to argue is that what probabilistic programs enable is a kind of generative software engineering, which can use deep learning as components, but gives us much more modular control and various kinds of paths of scale and depth and breadth and efficiency and robustness that aren't available with deep learning on its own. Okay? So here's the attendance tracker, and I'm supposed to do a question of the day, right? Okay, so here's the question is, what are probably the first probably probabilistic programming languages with programmable inference, some kind of programmable inference, that's the question. And my answer is, there's two of them. I think that were sort of very different notions of first one of them is a language called Monty, M, o, n, t, u, d, and the other one is a language called liby, l, I, V, B, I T, for the TA. Is any capitalization? Okay. Yeah. Question. How old are these compared to, like venture or Anglican. Libby is concurrent with Anglican. Sorry, Libby is concurrent with venture. Anglican was built by a visitor in my lab after looking at venture. Monty is, I think, the earliest, but Libby has a bunch of interesting features, and Monty and Libby are kind of orthogonal. Monty was built venture Angie, similar venture with the most expressive Libby, was focused on scientific computing. Basically, Libby still is, I think it's good for Libby kind of wrong. But you know, Jen is substantially more expressive in most ways, although venture, I think, still has some degrees of expressiveness
from the first lecture, I did understand your explanation about robustness and generality. Trade off. That's out of scope today. Oh, I thought like perhaps the example that you just illustrated. I understand what generality comes from, but I don't know where the robustness comes from.
Oh, that's interesting. Well, I can say, for this example, I will point out that this whole symbolic method, right, discretizing the domain, training an approximation, and then doing exact inference has much more easy to understand error properties than training in your own right. You know, if you want arbitrary accuracy in the circuit and add more samples to calculate the conditional probability, which will converge essentially by Monte Carlo execution. Carlo. And the exact inference process is exact. So this is sort of one way to get a much more robust inference algorithm because, and in fact, you can, even if you want to, you can remove the Monte Carlo sampling by systematically enumerating current start and destination cells. So there's all kinds of ways that you can basically use this type of architecture to make an inference engine that has founded failure problems with you, rather than just hoping for cover things by
generating Thank you.
Okay, so now let's zoom out a little bit. Okay, so the examples I've shown you today were in the Julia version of Gen, which is available online. I don't remember whether this year you guys have class problem set options in Gen or not. Maybe you don't, or maybe it's like an optional unit. But okay, so I we're currently working on other implementations of Gen that are embedded in Jax and c plus plus. So here I'm just showing the runtime of just a simple benchmark across different Gen implementations that are part of the open Gen panel, and this is work that nonprofit partnered with MIT, and we helped to get started with supporting a quest for intelligence that is built there, building these implementations in partnership with us. So Jen's dynamic language is significantly slower than the static language, which basically replaces control flow with combinators as a bit less expressive, but allows for much more compiler optimization and Gen taxes, performance is kind of interesting for a very simple theory. But Gen Gen Jacks scale out on the GP gentle C Plus Plus is a low level C Plus Plus template library that implements the basic ideas and generative functions that I showed you, and those operations like update importance that we were using as building blocks for inference algorithms. And that's very low overhead over c plus plus. And so if you wanted to do memory limited inference for embedded devices like smartphones or drones. It's actually potentially appealing, and I'm personally very excited about the possibility of real democratization of AI across billions of Android devices, the global Android user population that can be able to make use of much richer both natural language and video understanding capabilities, if promising programming platforms like Gen or made that are really mature enough to be adopted for Android applications, we've had some very interesting discussions with the Gates Foundation, actually, about where that might fit in public health. So some of you may recall the Chi expert system, I showed in the last lecture, conversational public health expert system that presented early results. That's the kind of thing that we might like to have available to a couple billion people without large server side and balances or growing user base. Okay, so this the reason we work on Gen jacks is because we'd like to scale the Gen programming model to take advantage of the gains and massively crawl hardware and GPUs. So a reminder of this graph that I showed last time, which is where we're showing how the number of cores has been increasing, even as processor frequency has flattened, just basically driven by fundamentals of device physics, basically sunny vector manufacturing. So we're now at a point where the programming model that I just described, which has this recursive, pseudo, marginal approximations and incremental computation is available on a GPU form factor. So we can start to see what becomes possible as we go, as we start to scale out massively parallel compute. And I think that's going to be that one of the next phases of research in the falls to computing project. And so just to give you a little kind of teaser, here's an example of automatic differentiation of expected values in action. So what you're seeing here is some animations that are taking probability distribution that involves it's actually very closely related to that example I showed earlier, where there's a parameter of theta that determines a coin that's being flipped, and if the coin lands heads, you get zero points, and if it lands tails, then you get theta over two points. It's a noisier version of that, essentially, where the noise is something that you can adjust to the slider, and we're showing what happens if you try to ask Jax to optimize that as a function of theta. That's what you see. Here you see the gradient estimates as a function of theta, and down on the bottom left, what you actually see are the trajectories you get you
so, yeah, the bottom right you're seeing is the expected value as it's being driven by the Jax optimization in yellow, or the correct automatic differentiation of expected values in the top right, you're seeing The values that are being visited by the optimization along with the tangent, the estimated tangent line at each point in the optimization. So this is just sort of a visual reminder of sort of what goes wrong when you can go wrong. When you try to ask a differential programming or decoding platform to deal with interactions with probability, you can get gradients that are biased that leads you to actually local minima instead of maximum, whereas if you do correct automatic differentiation with expected values, then you get automated map that drives the answer to the right place. And so that's sort of one of the types of automated map that's available. Okay, that's the lecture for today. I just want to give you a preview for next time I'll put in the final lecture, I'll basically focus on two topics. One of them is learning probabilistic programs for data. So a little bit about how that works. And the second is mapping from sequential Monte Carlo from probabilistic programs to train computation, specifically how corticos have cortical computation. So Stacey, I thank you very much.