Last time it's recitations, that's fine. It's pretty self explanatory. I'll be going over some stuff related to that today, and this week's recitations will also involve these. I think Thursday recitation is going to be in a different room because of scheduling conflicts. But just to look for announcements on staff, there's also, there's an interesting kind of AI event tonight. You might have seen this announced. It's related to something called the Art Prize. Art stands for extraction and reasoning corpus. It's a really interesting challenge Pablo, for people interested in human like thinking and genes, that was developed by a guy named Francois Legg, who was best known to the broad world because he was the creator of Paris, which is a deep learning framework that really was printed off, bringing people learning to the masses. And he works in engineering, but it has a sideline hobby of studying intelligence. And it's very interesting, because on the one hand, he's been one hand, he's been one of the leading developers of tools for deep learning, but he's more clear than I think, almost anyone, on the difference between what you can do with deep learning and machine learning and how we should really think about thinking. So he developed this thing called the arc challenge a few years ago and released it to the world and say, you know, it's, I mean, there's all sorts of issues around benchmarks for intelligence and just making any one benchmark. But this is a pretty interesting, valid problem, and really connects a lot to the themes of this class in terms of like one shot or few shot learning, and the idea of learning it with rich program structures. If you're not familiar with it, was recently interested. This problem was recently revived. It had been a little bit dormant because basically Francois put it out there, and it was too hard for most of the standard approaches to AI. So people worked on it a little bit, and then they put the site. But it's still a really good problem. And interest was mentioned by somebody announcing a million dollar prize. I think it's a million dollar prize for maybe there's two half a million dollar prize for some serious company cash. And the organizers of this art prize competition are on a little tour of universities, I think, trying to get interest from students. And they're visiting MIT today, and we're hosting them. If you want to go hear about this competition and why it's interesting. I think it is both. The financial incentive is a small part of it, but it's really intellectually interesting. And I encourage you to check so that will be at five o'clock today, in room 203. I think whatever I wrote there in the building 45 unless I messed that up, which is the College of Computing building right next door. Okay, so check that out if you're interested. So today we're going to be talking about languages for thinking probabilistically. We'll do a little bit on Bayesian networks and then move into probabilistic programs, but properly giving some conceptual motivation for why we're interested in these kinds of languages and aspects of thought that they help us catch. I And again, you saw in the cautious guest lecture couple weeks ago you saw Justin. You saw some hints and teases of or kinds of things you can do with modern public programming languages. And here we want to give just, I think, a more basic introduction to what this is about and how it relates to the use of this class in terms of modeling, thinking as some kind of inference over richly structured representations. Up until this point, we've used some math, you know, a Bayesian inference, basically, but and we've written some programs like in the P sets, P set, but we haven't really focused on a formal language for probabilistic modeling. But there's a number of reasons why we need to do this which directly address some of the things that people found, but I hope found dissatisfying, intriguing, but maybe dissatisfying some of our first models, like the number game type models, right where we would just like enumerate a list of hypotheses. But here, what we presented is having very general languages for expressing a wide range of models that really are tools for thinking. They're tools for us thinking as modelers about problems of data. But most importantly, they're tools that can that can represent the mind's own thinking processes. So these languages will let us be they will help make clear the structure of the model. They will give us tools for efficient and scalable learning and inference. Inference, you know, just to use these terms in what hopefully will be a standard way, by inference, we mean when you write down a particular probabilistic model make, you know, making effectively updating the prior to the posterior. And think about that by learning, we mean making inferences at a higher level. So in this class, learning is also kind of inference. You might think of it as inference in a better program, a program that generates programs, or that generates distribution of programs. It means making inferences about something in the model, rather than just using the model to make inferences in traces about the world. So it might be inferences about numerical parameters or some data discrete structure. So we want tools that let us do these kinds of things, and ultimately to give us hypothesis hypotheses about what are the what are the representations in our own mind of the world, but that includes representations of the things in the world which are other minds, or maybe our own minds, and those recursive representations. So we're going to be building up to languages which support recursive reasoning about reasoning, and that's abs. We're interested in anything that's like our intuitive theory of mind. There's no way we're going to be able to do that without the kinds of languages that are being developed. So there's a few different classes of languages that you could talk about as a language for probabilistic modeling going back a couple of decades. The standard languages were various versions of graphical models. So these are probabilistic models defined on the graph nodes and edges. The edges could be directed, like arrows, or they could be undirected. Represent the direct edges represent some kind of causal dependency. We'll talk about those a bunch the underwrite ones, the edges represent some some kind of association. We'll spend just a little bit in this class today talking about graphical models, and by the end, we'll be moving into these more general classes of things called probabilistic preference. I will also every realize I, well, I what I normally do in class is I mentioned underwrite to models, but I skip them because they are the least cognitive. They are models that were originally developed in some ways. You know, they're models which capture which are really nice at capturing the relations between a bunch of variables, but the undirected part just captures associations. So they're good in some machine learning context, and they've been used in like image processing, and they were the basis of some kinds of early, really interesting kinds of learning systems in AI things that might be you might you might have heard cockfield networks, or Ulta machines, or today's modern energy based model, the diffusion model. I decided to highlight them a little bit just today, the Nobel Prize in physics was awarded to two pioneers of AI. In particular, the people who developed these underwent graphical models. So I don't know what that says about my case in this class, but, and I realized I was modifying this, I put a cockfields name on there, but I misspelled it. Hope field. Boltzmann, of course, is a famous physicist who is no longer alive, so you can't win the Nobel Prize. But Bolton machines are a kind of undirected graphical model with Jack Hinton, who was the other recipient of this of this year's Nobel Prize in Physics. And you know, even though I think that these models have relatively little to tell us about cognition, I think it's fantastic that the Nobel Prize in Physics recognizes because they're beautiful examples of physics, they apply the math of statistical mechanics and really helps develop it in ways that really lay the groundwork for a lot of things that are going on right now in today's not just neural networks, but machine learning and cobaltistic modeling, or generally, just making the link between Like ways of representing structured probability distributions of some form, and how to think about learning and inference in many different kinds of systems. So it's beautiful math. It has lots of applicability. And both Hopfield and Hinton, who received this prize, who I know popfield A little bit, I know Kenton very well, they whatever the Nobel Prize is supposed to recognize those two folks deserve it. So I think it's great to celebrate that. But in some ways, you know the difference between the fact that this is really good physics is also why it's kind of not great cognitive science or AI, because the whole theme of what we're trying to do when we're trying to study the mind is to understand what is this particular kind of adaptive system that's distinct from just, you know, a spin glass or a big, statistical thermodynamic system, right? So that's a that you might take that as a first pass way to think about neural network using the same math that were developed for thinking about gasses, for example, and how the properties of gasses emerge from interactions between molecules. There's a lot more structure to the lines, and that's what the tools that this class and these other kinds of languages are going to be getting.
Yeah, go ahead, yeah. But I wanted to ask, kind of, like, what you're saying, because I didn't understand why it's not cognitive these models. Like, I couldn't understand, like, why?
Well, unfortunately, I'm not. I'm not going to give a spontaneous lecture on that I can, but I'm just talking about why I'm not talking about the I'll try to say it again, but I can't really tell you that without telling you how they work, right? I better tell you about some other models. Instead, I'm just explaining why I'm not talking about focusing on but the just to try to say this one more time, these models apply. Basically, they model interactions between nodes in a graph or the random variable, and they model them in an undirected, associative way. So they model like a general field of how these things might depend on these other things, but they're not causal. So they don't have some things which make other things happen, and they don't have abstraction. They don't represent knowledge about knowledge effectively. The tools of Bayes networks are these directed graphical models. They they focus on causal, directed dependencies as opposed these undirected ones. So they're sort of, I would say, unusual language thinking, because the basic thing about our thinking is we understand some things as giving rise to office and and that's, that's, that's crucial So, and that's also crucial if you want to do any kind of learning these inference, if you want to think about not just interest in a model, but a meta model that generates the model, then that's also a language. So you want to say learning the structure of the model is inference in like a higher level program that fundamentally requires a directive model. So those are the most basic things, and that's the difference between directed and undirected models. And then when we go to problems with programs, that's where we get to things like recursion and abstraction that are not present in the graph. In the notion of the graph, it's at the same time. You know, there's insights and things that are shared across all of these. And I urge you, if you're if you're interested in this, to take some of them, like MIT has several classes on graphical models, where you could really learn the math of that and learn some of the insights. It's not that they have nothing to do with intelligence organization. It's that it's interesting, that if everything in this kind of probabilistic language landscape, of the things that have that I think are the least cognitively relevant and the most like beautiful physics. All right, so the graphical models that we will talk about today are these directed graphs, so called Bayes nets. This is just an old slide from a bunch of early applications of these models. I don't expect you to get anything from this, other than to see a variety of graphs, different sizes, shapes and colors. But in general, in the you might get something from this, which you might see well, areas where these have been applied. Often, there's other places where people have developed graphical or network models of things, especially when they want to model the ways that some variables that you might measure in the world depend on others. So for example, people have been interested in predicting the stock market or just making better investments. Might model how various motions of various stocks or other securities that you could buy and sell depend on other factors that you could measure in the economy, or a lot of applications in biology, computational biology, bioengineering, where you might want to understand processes that go on inside cells. And these models extremely popular, the idea of a Bayesian network. I'll just, you know, again, I'll give a, this will be a little bit horrible. And I'm just trying, I want to do this in a way to draw out what are some schemes that will then build on with probabilistic programs. So this is, this is, again, kind of a one slide introduction. So you could, you can think of a Bayesian network as a way to represent a joint distribution on a large set of variables, although on these slides, the large set will be rather small, but you'll scale that the joint distribution might be unconditional, or we sometimes might talk about margin, these marginal properties, or it'll be most interesting when we're doing inference and we're going to condition on, observing some values some of the variables and making inferences about the other but the graphical structure captures, again, dependencies between variables that lets us efficiently reason from the variables that we can condition on to the other variables. Just a simple Bayesian analysis, like what we did in the number game. When you say hypothesis and data, and you put a prior on hypotheses, you have a likelihood function, and then you condition on the data and make instances about hypotheses. That corresponds to a really simple Bayesian network, which you could say has two nodes, H and D, and an arrow from H to D, and you'll see why in a second. That's true. But here we're going to have, we're going to be, in general, thinking about graphs with more than two nodes and more interesting kinds of dependencies. You Bayesian networks or directed models provide a very nice way for thinking about and formally modeling certain aspects of causality, as I've said, though often when they're introduced, especially in like an engineering class, either the word causality not mentioned, or it's mentioned but only to say, you might think these are causal, but they are, or don't worry about that, or don't think about that. And there's a long history of reasons why people who've developed Bayes nets have said that in this class, that's not the philosophy that we're going to follow. I'm not going to, I don't want to get into philosophical debates really about what causality means. I'm going to but I'm going to bring in some intuitions. I think it's to avoid those philosophical debates as well as to avoid bringing in bad intuitions or baggage that some people might have, or unhelpful, distracting ones that people in machine learning or other engineering classes often say, don't think about these as causal. There are other people though, Judea Pearl, he's another very important person in this field who is often considered one of one of the founders of the probabilistic approach to AI and specifically Bayesian network that was history for these directed graphic novels. He developed these in mostly in the 1980s and then going into them through the later 90s and 2000s he really leaned into the causal interpretation and developed an additional kind of layer of mathematics, which he's best known for these days, of basically causal modeling that builds on the foundations of basements. But for the purposes, and we're not going to go into that too much here, although, if you're interested, I can point you in that direction, but for the purposes, what we're going to be doing in this class for today, I will, I will point informally to what I think of a helpful causal equation, formal basis, where you don't have to say anything about causality dependency, and the that comes most basically from the way conditional probability is defined, and the relation between conditional probability and joint probability. And you can think of the Bayes net as a formal way of representing a complex joint distribution in terms of a bunch of simpler pieces, simpler conditional pieces. That what makes it simple, when it gets usefully simple, as opposed to unhelpfully not simplifying, is intuitively when there's some actual structure in the world that the directed dependencies of the Bayes net can capture. So formally, we just have, we have a set of variables that we're going to define the joint distribution over. That's the set x1, through x in here. And we have, those will be the nodes of a graph. And then we have a set of directed edges over that. That's this directed acyclic graph. It's important for the formal definition that we'll see in second, that the graph is acyclic. So you can have that means you can have a let's say you had three variables like this. You could have arcs like that. You can also have arcs like this, which might look like there's a cycle in the graph, there's a cycle, there's a loop in the undirected graph, but there's no directed cycle, right? As opposed so you can go like, there's no there's no directed path you can follow that loops back, as opposed to, like, that sort of directed graph that has a cycle that would not be allowed. The reason it's not allowed is in no simple sense. If you look at, if you look at what we're going to see in a second, is the definition of how the directed graph represents a joint distribution that wouldn't correspond to a well defined distribution. According to this definition,
many people will often ask, well, if these arrows are supposed to represent some kind of causal dependencies, what about some kind of cyclic process that's cyclic is high we know many kinds of dynamical systems involve one thing that causes another thing that then causes maybe some other thing which comes back and causes the first right? There's lots of cycles, cyclic feedback cycles like that in the world, and the way that's usually handled in this language is by unrolling in time. So you might talk about, for example, if we had what we thought of as conceptually the causal relations between three variables that you might represent cyclically. If you if we wanted to capture something like that in what's called a dynamic Bayesian network, you might do something like this, draw three variables at different time slices and say, Well, this one causes this one at that time, right? This one causes that one at that time, and this one causes back the first time at the next time. So each of these columns would correspond to different time slices. And if you keep going in time right with the picture like this, you just iterate that right? What you have is a way to unroll in time, a cycle of causal dependencies, but still in the underlying graph, there's no directed cycles. There's no way to actually follow through that graph and come back with them, which, again, would be a violation of this definition. Of this definition. So I'm just pointing you in that direction, because if you're thinking ahead and you're thinking, how does that work? Well, the answer is what's called a dynamic based Bayesian network and unrolling the top. But let's just go back to what is the most basic definition. What is this graph representing the edges that are present in the graph represent a particular factorization of the joint distribution that's shown in this math here, where the parents of a node, that's that function is telling you which are the nodes which send the directed arrows into a node. So the parents are anything that points to a node xi here. So the definition of the Bayes net as a joint distribution model is just take the product over all nodes or all variables of the conditional probability of that variable dependent on its parent set. Now where this comes from is something that, hopefully you're all familiar with from the first class in probability, or many of you would be something that's called the sometimes called the gene rule of probability, which is just saying for any set of variables, right? Let's say I wanted to write the joint distribution on four variables. Let's call them at one four, right? I can always write that as the product of four conditional distributions of any one of them conditioned on a subset of the others in some in some alphabetical order. So, for example, I could say, I could write this as the probability of x1 just on its own, times the probability of x2 given x1 times the probability of x3 given x1 x2 times the probability of x4 given x1, x2 x3 so that kind of decomposition works for any number of variables, right? You can think of it as just generalizing. If you're familiar with the joint distribution on two variables, like the joint distribution of x1 and x2 I could write that as p of x1 times P of x2 given x1 of course, as you're also familiar, I can flip that around. The ordering is completely arbitrary. I could also write it as p of x2 times P of x1 given x2 and that applies for any number of variables. So I wrote one particular way of factorizing that joint distribution on four variables in terms of in terms of a product of four terms, where in each each term, I'm adding in the dependency up on the one before, or on all the variables before.
Do you need the variables to be mutually independent of each other? Well, this
is a way of this is a completely general statement that handles all patterns of dependence. And the point is, is that when there are some interesting patterns of independence, or what are called conditional
independence, the then you'll get a more interesting graph.
So in this case, that the factorization
that this this factorization of four variables, x1 x2 x3 x4 that corresponds like to a to a directed graph. That is a complete graph. So that means everything is connected to everything else with some directed edge. In particular, this would correspond to convenient to write it like in a sort of a chain like this. X, sorry, my handwriting is terrible. X, 2x x1, x2 x3 x4 and if I write the following graph.
So notice, notice how in this graph, x1 doesn't depend on anything. It has no parents. X2 just depends on x1 x3 depends on x2 and x1 and x4 depends on every from one, from three and from two, right? It's a complete graph, but it doesn't have any direct cycles. It just has cycles of the undirected so that. So that's if I, as you can see from this definition here, right, that if I were to write out a product of terms with the dependencies based on the parents in that set, then I get this expression. But if I were to drop some edges from this like, for example, suppose I erase the edge from x1 to x4 So now x1 is no longer apparent of x4 in some sense that this is saying, well, there's still a dependency between x1 and x4 but it's mediated by the other variables. So that would correspond to dropping x1 for this expression, that makes sense, how it's just giving a concrete example of what's going on in that equation. And in general, when we have a really complex set of lots of variables, what we hope is that there'll be some structure where not everything will depend on everything else, maybe sometimes a locality, maybe only everything depends on a few things, and then that could be represented in a sparser graph, which corresponds to a factorization of the big joint distribution into a bunch of smaller terms. Now one place where we want to quantify like, what? What are we simplifying? What are we sparsifying? Here? Right? Let's suppose we just restrict ourselves to binary variables. So we have n variables. There's two to the n possible states of the world that we're talking about. And if we wanted to specify a joint distribution on all the sets of all those variables,
I'm trying to do some causal additions here, wondering which cable or something high pumping, okay, if we want to do it. So one way to to think about what this directed graph is buying us. Here is to say that we have n variables, binary variables. We have two to the n states of the world to specify the full joint distribution with no simplification, no dependencies, we need two to the n minus one numbers the two to the n because we need to just directly specify the probability for each of the two to the n possible settings of those n binary variables, but minus one, because the probabilities have to sum up to one, right? They're not all those numbers are not all independent, right? So if I had four binary variables and there's 16 possible worlds, I need fit numbers between numbers probabilities that are between zero and one, and constrained sum up to one. And if I specify any 15 numbers between zero and one, then the 16th is just one minus the sum of all the rest. Then that's one way of specifying the joint distribution. What we can see when we simplify a distribution using is we will reduce the number of parameters that need. That's like the most minimal way. This language buys us some representational power. If you it might be helpful to see what happens. Suppose we have the full joint distribution which corresponds to this full directed graph. I'm going to put x1 back into this. You could, you could, one way to see how that is not buying us anything in terms of the simplification, is to say, Well, how many numbers do we need to represent the joint distribution? Not like by just enumerating overall possible worlds, but using these terms here. So if we have a, if we have a binary variable p of x1 here, we basically need one number to specify the probability of like that point. So that's one number that we need here. For this person here, we now need two numbers one to specify the probability of x2 given x1 having one of its states, let's say they're true or false, and another for x1 being the other states. So that's going to require two numbers. This one is going to require four numbers, because there's four possible settings of the parents, x1 and x2 and for each of those four possible settings, we need a number for for the probability of x3 and then again, there's going to be eight. So there's two to the number of parents, one plus two plus four plus eight, that equals 15, right? And again, as you know, if we add up powers of two, like the sum of the first n minus one powers of two is equal to two to the n minus one. So those are just two different ways of counting up the number of degrees of freedom that there are in the joint distribution and n variables. And all this is just saying is that if we have the fully connected Bayesian network, it doesn't bias anything. But let's just see what happens if we adopt here's here's one of the very standard, simple, maybe intuitive example from a medical domain, where we can see how this might bias a little bit. So now I have a four nodes that I'm labeling with medical conditions, tuberculosis, flu, coughing and sneezing. Now we might intuitively start to put all sorts of structure on top of you might say, well, two of these we know to be diseases, two of them we know to be symptoms. Maybe we want our if our arrows are going to represent something like a causal dependency, we might want them to go from diseases to symptoms, but right now, we'll, we'll, and I encourage you to think that way, but right now, we'll put on some arrows that might represent just this kind of, more minimal, sort of direct dependency, in The sense of which parents do we need to track, or which variables we need to track in order to meaningfully define the minimal notion of probability on the other variables. So here's, here's one way we might do this. If you want to think causally, we can think, Okay, well, coughing is something that is caused by maybe either TV or flu, so those are both possible things that could generate the problem sneezing, maybe is something that could be caused by flu, but not by TB. So there's only an arrow like that. And again, we could think, if you want to lean into the causal interpretation, and you think of this sometimes, like we might, I might talk about the God's eye view of the world, like, you know, somebody's creating the world of this thing. And you have to, you have to think, like, how do I generate the world? What choices do I have to make? That's, that's a causal way to think of this. And you might think, well, I make some choices. First, I make choices about the variables that don't have any parents. Those are just chosen from some distribution, some prior distribution, and then I make the choices of the other variables that depend on those itself. We'll say more about that later today, at next time, but at this point, this just by specifying the parent set for each node, what we specified is a simpler way to parameterize the full joint distribution in terms of some tables like this. So this is just now a version of the same kind of accounting I was doing there, but in this slightly sparser graph. So here the variables that don't have any parents, they just require one number to specify the probability of that variable being true a priori. So that's kind of like our prior in just the simplest HD base net. You might think of those as, like, I don't know the base rates of having these diseases, although I think point one is kind of high for tuberculosis and point two is high for flu, those are the numbers we put in now then for sneezing that has one parent, which is flu, we now need two numbers to specify the probability of sneezing being true, one, given that the parent is true, and the other, given that the parent the parent is false. That's this so called conditional probability table up here, when you have a variable coffin that has two parents. Now that's going to require four numbers to specify the probability of coffin for each of the four possible configurations of the parents. Okay, so crucially, these numbers, again, this might be a little confusing, because you look at point eight and point two, and they add up to one, but these aren't the numbers add up to one.
What adds up to one?
Yes, we can see,
okay, it on yours? Now, that's really weird. Okay,
yeah, so, so these numbers add up to one, but that's just an accident, but the probabilities at one are the probability of sneezing being true versus false. Sometimes we might write like point 8.2 to correspond to the two different values of sneezing being true. Here it's easier to see we're saying, okay, so that coughing is true. Well, again, these numbers have an intuitive basis. We just made them up. But we say, well, if you have both of the things that can cause it being present, then it's very likely that you're going to be coughing. So if both tuberculosis and flu are true, then coughing is true, but probably point nine, if only one of the causes is present, then maybe you'll still be likely to cough, but a little bit lower probability. So that's the point eight of point seven, five there. If neither of the causes are true, you still might cough sometimes. That's the point one probability. Now hopefully that seems sort of intuitive as a way for just thinking about how the probability of one variable depends on the other. It doesn't, in any way directly depend on sneezing. Then you might say, if you know that somebody's sneezing, maybe it's more likely to comfort. But the way the Bayes net captures that is through the indirect connection between saying, well, effectively, sneezing makes it more likely that somebody has some underlying illness like flu or stuff or cold or some other respiratory illness, and so that might that would make it because those things can cause sneezing and coughing, then that would inferentially Raise the probability of coughing. But there's not a direct link between them, right? Rather, that link is mediated by the underlying causes. So again, that's a causal interpretation, but the formal definition of the Bayes net is just saying, I assert that the joint probability of these four binary variable worlds on possible patients is determined not by 15 numbers, but by four plus two plus one plus one or eight numbers. So I've simplified by a factor half or so from 15 numbers down to eight numbers, and in some ways, maybe made my life easier. Does that make sense, that this is like what the formal meaning of these models is, and hopefully you can start to see a little bit why. The more I have a really complex system where I can carve out, in it a sparse set of causal dependencies, then a large number of of numbers that I might need to think about all possible simplified, it's adjust the ones I need to specify to no condition not, hopefully, it's more minimal pairs. And whether, whether, what we want to do is inference, which means, like, for example, observing that somebody's sneezing and inferring whether how likely they are to coughing, or inferring from coughing, how likely are they to have some disease, TB, versus some other disease or something. Or if you want to do learning, which means making inferences about the structure of the model, both all of the algorithms that we want to do will will be simpler, more efficient, both in compute time as well as data usually, if we're trying to do some kind of learning process, efficiency will come from sparsity evolve. Now you could take a whole class again on how on algorithms for inference, and I mentioned graphical model classes over six, which are good ways to learn about these many different outcomes. What I'm summarizing on this slide a range of different standard ways to take a fixed graphical model where we know the structure, the parameters, everything I just put on the past slide, and have general recipes for inference, which means condition on one or more variables, having certain values, making inferences about the probability distributions of some other remaining variations. Many of these algorithms have really nice properties, like especially these, these message passing, or some product type algorithms, which are especially efficient if the bear, if the graph isn't just sparse, but has a tree structure, or nearly a tree structure, this, this actually this by a tree here again, we mean usually a tree in the undirected graph. Okay, so a graph like this that's singly connected this, this would count as a tree structured model for these purposes. And in general, people have when they develop these graphical models, they like ones which are which are where there's only, or close to only one path in the undirected graph between any two variables. That's often the heart of making these things work really efficiently. And if the models are tree structured, then there's very efficient like dynamic programming, like algorithms for doing it in this class. The algorithms that we'll emphasize are these, Monte Carlo sampling based you saw some of those from Akash the web. People publish a programming language in the browser that we use for problem web book also implements a range of sampling based inference algorithms. And we'll see some of their their properties both nice and less appealing. But in general, the reasons why we use those methods in this class are a number. They are for one is that they are completely general. You can apply them to any well defined Bayesian network or any well defined probabilistic program. They are also, in many cases, very cognitively appealing, both in their dynamics and their results. They seem to some both the dynamic difference in our brains and minds and the results, which often, which means often their speed, accuracy trade offs. That means, often ways in which we do not make fully perfect by any of these inferences, but make sort of good shortcuts can often be captured by the right kind of fundamental sampling. Another one last class of approach which has been very popular in machine learning is what's sometimes called amortized inference, or using the neural net, or a machine learning system, basically using the graphical model to generate and think of it as generating training data for a neural network, like drawing many samples from the model, like imagining possible worlds and training a function could be it's often a neural network. It doesn't have to be training a function to map from some of those variables to any of those other variables. This is the kind of thing that, for example, transformers are quite good at, or other kinds of generic functional consequences. We won't use those methods too much in the class, although we'll see a couple of ways in which they've been applied to model aspects of perception in the brain, and they're very so I don't want to say it around interested, but we tend, we'll tend to emphasize the more sampling based approaches, because I think they're they're actually more grade like and much more cognitive and flexible. The thing about the amortizing responses, they're also very general. Unlike, say, the message passing algorithm, they can also be applied to any structure, but they require a lot of offline training, which is fine, you have a fixed structure and you have lots of time. But in cognition, we're often interested in maybe, in some sense, this is almost what some people have actually proposed, a difference between perception and cognition. In perception, there's a sense in which your visual system, like solves the same problem over and over, every single time you open your eyes to some sense, there's a standard problem for something. Whereas in cognition, we often do structured models, we're often thinking about models we never thought of before until someone told something. And so we want inference algorithms that can support them. So this is just a little background on the kinds of things you could do in inference and why we're going to be focusing specifically about sampling based inputs. But if you're interested in these others, especially if you're interested in perception, and you want to check out learned inference, you know that's there are possible projects that could be done. You'll see a couple of examples.
The graph you drew from hypothesis to data that was very helpful to understand. Yes, could you walk me through how the science principle and this principle and conservation of belief can be interpreted in this
examples right now. But in some sense those are all. Those are all like ways to think about the internal structure of what's going on. You know when, like, when you have a big set of I mean, in some sense that that graph of H, D is, I use that to tell you about lazy networks. It doesn't really tell me very much about models, because all the interesting stuff is hidden inside the node, right? And so the science principle in those things is like ways of making sense of what goes on when you have this really thick note, which, course, which has lots of different possible hypotheses. And it's just, you know,
sort of, I'll be expecting more from you later.
Maybe another way to put it is that, again, I think it's part of why we're interested in holistic programs, rather than, um, just a graphical model, like because in the basenet picture for the number game, there's no there's no interesting like the node is just as fine grained as we get. There's no interesting structure inside the node just has some possible balance, which could be like all the 90,001 of many reasons to have a probabilistic program is so that we have much more fine grained structure. And just writing down a note that corresponds to h, we can actually write down what's inside of that, and what's inside of that might be a program that generates programs, for example, and then all the interesting structure in there can be used to, basically, to evaluate the things that the size principle is telling. Would you say
the probabilistic program is more dynamic compared to the Bayes net? Well,
in this case, it's just more fine grained, right? But in this class, especially for lecture purposes, I'll often put up these graphs that are basically Bayes nets, which really thick, arrows and nodes, and then we'll go inside. They will know what's actually going on inside. For that program is going to be a better way of capturing it than just a big table, right? Effectively, in the language of these nets, almost say about the way one variable is another is a table. So it's better than having a one table for all possible worlds. It's having a bunch of smaller tables. But if we want to go beyond that, effectively, that's where that's one of the many reasons why the program representation.
Thank you.
So in this, in this, you know, we saw a simplification from 16 or 15 numbers to eight numbers. And any one of these algorithms here, if you apply this, any of these standard algorithms, or kinds of algorithms, to doing inference, like observing the value of x1 and making an instance about x4 or observing x1 x2 like coughing and sneezing and making instance about TB and flu, they will all benefit from the fact that the model is simpler than the full joint distribution, but not very much, because it's still very small. And this is so small that you know, it's trivial. You can just do it all by hand, or enumerate all the 16 possible world where things get more interesting, again, is where the model has it gets bigger. So I'll show you one classic example from Bayes nets. This is one of the best known Bayes nets, where these tools were originally developed, had a lot of impact in what have sometimes been called expert systems, or models of reasonable expertise in some domain, like, again, a medical domain, to be a place, of course, where AI, of all these strikes is widely, well, I wouldn't say, but in some ways, you know, many tools that even your doctor might use, or that people might use when they're developing trying out new kinds of therapies are using things, kinds of computational technologies which build on the ideas that we're talking about here. So let me give this example of what was called the Qmr network, where Qmr stands for Quick medical reference. And that was, it came from a well, again, if you doctor ever done any sort of thing, you might know that one of the things physicians learn, or are often are given or by, is like big tables or handbooks that sort that tell you how to they are kind of guides for diagnosis. I mean, nowadays it's all just like digested into a chatgpt, but it's still the case when you're when you're learning to be a doctor, you will learn about many diseases and having some symptoms, as well as what sometimes called differential diagnosis, like what kind of symptoms tend to go with which diseases, and what are the things that you look for if you want to tell apart those things, so you want to say, Okay, Well, this symptom could go with this, these diseases. I this piece disease and that one, but then I should look at these other symptoms that will tell me which of the different diseases that could cause the symptom are the actual so the Qmr network was an attempt to take a physician handbook, a quick medical reference for a number of common diseases and symptoms, and turn it into a probabilistic model. So in particular, in this network, there were about 4000 symptoms, or they were sometimes called findings. So these are variables representing things that like a patient could present to their doctor, or they could be medical tests. But I think they were all binary. So they like things that true or false of potential patient 600 or so diseases or conditions that you would so it's a pretty large model, in the sense that if, even if this, these are all binary variables, there's about 4600 binary variables, which means there's two to the 4600 possible states that the world could be in, or possible things that a patient could present to you. And that's a lot of numbers to keep track of. It. Inference, if you were to try to do inference, exhaustive, exact inference, in that full joint distribution would be impossible, intractable by any nice four, which made it tractable. The first nice part is the bipartite structure of the directed graph. So you can see that up here, where, again, capturing what I think is a very shared intuition, even for those who are doctors, is that the arrows always go from one set of notes to another. So bipartite means that it's not fully connected by the arrows have their parent is one of the one set, and the child is one of a different set. And what that allows you to do is to factorize the joint distribution, which is written down here in this in this language from the original Qmr network. We're using M to represent diseases and s to represent the symptoms, and we can represent the entire thing in terms of a distribution on m and distribution of s given n. So in some showing you a little bit of, like, an internal structure of like, that sort of a meta network. See if I can draw that here. What this factorization represents is something like saying M and S, where this M corresponds to the two to the 600 sets of possible diseases you can and the S corresponds to the two to the 4000 set of possible symptoms. Already, that's really good. I have to deal with that. But with this, with this factor, is saying I can represent the full joint distribution. Was one of these things, but then I can look inside and start to see the kind of thing you're looking for by saying, well actually, inside this, there's a bunch of separate variables which represent each of the different diseases, and you could have more than one, usually, though you're not going to have very many, but you're also not going to have very many symptoms, and each symptom depends on only some of the diseases. So those are the kinds of simplifications that this model is giving. But the first step is to simplify it into just a bunch of independent terms on the diseases. Again, under this model, the diseases, by not having any parents, are modeled as all being independent in the priority, right, or unconditionally or marginally, all of the diseases are independent. That means, you know, coming into the, you know, coming into the doctor's office, you could have, you could have one different disease, you could have multiple diseases, but there's some independent probability that a patient could get any one of these diseases. All those assumptions are, of course, oversimplified. For example, we know there are conditions diseases that affect the immune system, which may be more likely that you're going to get other diseases, but not in this model. So so we first simplify to just having an independent prior on each of the diseases. And then we also have only the dependencies between the s or the symptom variables and diseases, and we and that also breaks down into a bunch of independent terms where it only depends on the diseases that are in the parent side of that s. So we, if we kind of count this up, then we can say, well, roughly, instead of two to the 4600 numbers, we have about 600 numbers for each of for the prior on each of the diseases. And then the way I'm counting this here is I'm saying, well, there's about 4000 numbers, one for each of the symptoms, times two to the k, where k is the parents for that disease. Now this is the, just imagine the simplification that every symptom only depends on k possible disease. In fact, it's not like that. Some symptoms, like a cough, could be caught by many things. Other symptoms, like, I don't know, red splotches on your in your inner armpits or something. I don't know if there's any disease that causes that, but probably not as many that cause conflict, right? And lots of other more technical things, certain like some elevated level of some particular chemical you can measure the blood. Some of those are very general markers of inflammation. Others are very specific markers of a small number of certain kinds of diseases. But let's just say, you could think of k as representing like, let's say the typical or it could be the maximum number of parents, but let's just say sort of rough order of magnitude, if each variable only has, you know, maybe like, often here it's maybe around most. If each symptom has maybe 10 things that could cause it, 10 or 20, then we have 4000 times two to the 10. Let's say now, two to 10 is a big number. It's 1000 roughly, and 4000 100,000 is big number. It's about 4 million, right? But 4 million numbers plus another 600 numbers is a lot of numbers, but it's a lot, a lot, a lot fewer numbers than two to the 4600 so you can see how we're simplifying it. Now the Q, this is the factorization. This is the simplification that comes from the bipartite structure of this graph, and from the fact that it's somewhat it's pretty sparse. Namely, each node in one part of the graph, in the symptom part, the effect part, only depends on a small number of k parents in the in the cause part of the graph, the actual Qmr network made an additional simplification, which corresponds to something called a noisy or, I'm not putting the math for the noise, or you can see it in the problem box, web book. I can write it down. It's quite intuitive, and it's a very standard one that gets used in a lot of the Bayesian network literature and also in some form in the provost programming literature. So what this means is, or the way to think of the noisy or is different ways to think of it. One way to think of it is if you have a variable which is an effect of multiple causes or it has multiple parents, and again, we're in the setting where all the variables are binary, so they're either, let's say, present or absent. So the noisy order says an effect will occur if one or more of its possible causes or parents make it happen. So it's kind of a causal power theory, sometimes called that,
or in other words, if, for any one of the parent nodes, if that node is on, so if that cause, potential cause is present, then it has an independent probability of making the effect happen. Doesn't always happen. So it's a noisy order. You can see how in the limit that those independent probabilities are probability one or deterministic, this is just a logical or right the logical or says you have a certain symptom if one or more of the diseases that cause it are present. But we allow here for the fact that these things are just probabilistic. Some are reasonable. A lot of complex, biophysical, biochemical, physiological things going on in the body. And so we don't require all the causal links to be deterministic, but rather, we say again, every cause has some probability between zero and one of causing the effect. And the effect happens if one or more of the things that could cause it do cause it. So you can think about that as flipping a bunch of independent coins, one for each parent node that is on or present where the lead of the coin it corresponds to that causal strength, sometimes also in the noisy or there's what's called a leak probability, which is just the probability that, well, they could happen anyway. So if you remember, in the example I had, this is not exactly a noisier but roughly, has that property right, the lead probability is the probability point one that says, Well, you could be coughing even if none of the neither, none of the parents are present. And that can, you know, that's often put into these models because, again, they're just probabilistic summaries of a lot of more complicated things going on, like any other model, their simplification of reality, but then each of the two parents has some reasonable probability of causing the thing. And so those the numbers in the second and third rows go way above baseline point one, and then the but then there's an even higher probability of events happening if both of the causes are present, because each one has an independent chances of concept, okay, yeah,
the one after this, yeah. Can you explain No? Can you explain why the indices for the
condition of prior, different and, like, where, like, what, how? A
little bit of an abused notation. So that would be, yeah, but basically, what, in this case, the product over j is just the product over the J diseases, and then there I'm writing down the product over the I symptoms, conditioned on the full set of diseases, p of s, I, given n is the full set of diseases. But then the directed graph says, well, actually, that's simpler than that, because each si only depends on small number of the n's. Does that make sense? That's the parents. That's
particular parent.
So you're saying J is the number of it's just indexes, and then I's Number
No, no. So what really, what this product here is, is I just took in the definition of the Bayes network. I just take the product overall variable. It just broken that product into two separate products, because that you can do that with any product, one over one set of variables. The third, the one on the right, is the set of variables have no parents at all. That's the diseases they have no parents. Yeah. And then there's the ones that do have some parents, and all their parents are in the subset. Does that make sense? And if I could write down graphical model, like anything, I can fit with a sparse directed graph, or bipartite directed graph I could fit with a denser one, where I said, Well, I could make it a graphical model where each si has its parents all the ends. If I did that, then in this when I counted number of parameters, it wouldn't be 4000 times two to the k, it'd be 4000 times two to the 600 so that would be a lot. Be a lot less than two to the 4600 but still be too much. Does that make sense? But if actually, each disease only depends on life, sorry, each symptom only depends on maybe 10 or 20. Let's say k equals 10. If diseases, then I have this much simpler structure inside and then simpler in the noisy. Or the thing about the noisy, or here is that each for each pair of a disease and a symptom, or each pair of a cause and effect, or a parent and a child, you really only need one number, which is its causal strength, because these are these correspond to independent causal mechanism. So when I say the effect happens, or the child node is on, if one or more of the parents cause it, those are each independent coin flips, right? So, or you can think of them as independent causal mechanisms. So if I say that, well, TB has some way of causing and flu has another way of causing cough. And they each are stronger or weaker, whatever they are, but I have one number for each cause each possible effect to specify how strongly it causes the effect. And that means that actually, now you have something that kind of looks a little bit like a neural network, if you are familiar with that, which we all are, namely each edge now gets one number which is like, Oh, wait. And some of the work that Jeff Hinton. I don't know what the Nobel Prize of physics was given to Hinton for, but one of his really interesting contributions was something called a Helmholtz machine, as opposed to the bolshe machine. Is the thing that was in common with him and Hopfield and everybody looks most like physics. The Helmholtz machine is named for a different physicist, and it's more in the right connection, but that's an example of a directed model where, again, it was, basically it was, it was, this was a really interesting project or proposal from mid 90s that corresponded to one of, one of the first ways to couple a directed graphical model, like Bayes network, at least some causal interpretations, like for medical diagnosis, with a neural network that had the same directed topology, but went with the arrows going opposite direction. So when I was talking about learned inference, where you draw samples from the directed generative model and you train a machine learning model to to model the conditional dependencies, that was something that Hinton introduced there, and it explained the fact to this, is your type model. It has it's both very intuitive and extremely simple, because each edge now can be interpreted in the causal or gender direction as having a weight, much like the weight in the neural network. But this weight now is not just a parameter in the model. It's a it has a causal significance, right? If I when I put a number on an arrow, what I'm saying is, how strongly does this cause increase it independently increase the probability of that effect happening. So those numbers are interpretable. They're not just three parameters. And the way the Qmr model was built was actually by asking doctors to estimate those things. You could also look at data, but it turned out you could build a pretty good model by just asking people like you can ask doctors, well, how often do you think it is that a patient who has no other diseases, if they if they had this disease like TB, how often would they do that? Would they would they have the symptom? Or how would it elevate their symptom? Probably, that's another way of defining this noisy or is, how much do you elevate the effect when you have no other causes, and then you turn that cause off. So again, these are, this is a very interpretable model if you connect the mathematical parametization to that cause on the if you look at the examples that we have in the commods web book in sections two and three, where we talk about generative models of fishing. We have a bunch of medical examples which follow this mozier structure. I'll go through my class to capture some of the patterns of influence. But again, just, just to, just to put some more words on what we mean by tonality here. We mean, not specifically. We don't. We don't mean a process that necessarily unfolds in time, but in some kind of abstract notion of which choices depend on which other choices. So you know, if you're if you're familiar with the Old Testament, the Hebrew Bible, it starts off with God creating the world, right? And I'm sure there's many other creation stories and other cultures. I know that one better. And it starts off by saying, you know, in the beginning, this is not a religious lesson. This is just an example of a familiar, directed, causal process that happens to unfold in time, right? It describes a story of God creating a world over, over. Over seven days. And in the beginning, God creates the universe or something, and then God, like, makes the ground out of the sky, and then God puts some animals and fish on there, some trees, and then people or something like that. But the point is, there's a set of choices, and some of the later choices depend on the earlier choices. Like, you know, God decides to put the fish in the sea and some animals on land after having carved out the land from the sea or something. It's just an example. Again, it's not a theological message here. But in general, when we talk about causality, what we need is the fact we depend on others. So if we want to make a bunch of random choices to generate a possible patient comes into a doctor's office, we we want the choices of of which symptoms they're going to be presenting with are going to depend on earlier choices, earlier in the logical sense, which are the choices about what diseases they have? Is that? So that makes sense of what I mean by a set of causal dependencies in this form. So Bayes nets are a language for that, and probabilistic programs, which we'll turn to in a moment, are a much richer language for that. Effect. Basically are, think of them as a as a rich language for describing how these choices unfold. That's what the program will do. But before I get into that, we will have our attendance check and our question today. The question today is, what is my favorite cumulative programming language? Now most people will not be able to guess. You might think it's web, which is not actually by people. It's actually a language called Church, which is named after Alonso church, who was, who was a logician and one of the founders of not promised programming, computer science and idea of computation. You might have heard of the church Turing thesis or the idea of a lambda campus. So Church, which is spelled like church. And I realize this might sound like a very religious church, and the end of the poeistic programming that was named after, I'll say a little bit later some other time, why I like church so much. But it was language that was developed by the posh who you met, and Noah Goodman, who was, they were, both in our group some time ago, so I take a little role in how did this language exist because, because I didn't strenuously object to those people, but really they developed, And I'll say something about these nice properties. And if you if you're interested in learning a little bit more about church, one of the readings we have the chapter called concepts in a probabilistic language. Thought uses church. And there's a the version of problems that we use in the class, which is called, well, it's the v2 version of the book is written with web people, but you'll see a link to the v1 version, which is written in church, and you can still go back to that version. It still works remarkably. The version of church there runs in the browser just like web people that has a JavaScript back end. Web people is written natively in JavaScript. So it's a little bit easier for people who have learned more, like Python type languages, if you haven't or JavaScript. Church is based on Lisp, which, again, if you have learned Lisp, you might think it's beautiful, like I do. If not, you might see a big mess, a big mess of parentheses. But the foreshadow, one of the things that we'll get to the reason why, like, there's many reasons why I like church especially, but once, especially once you learn to read,
it really corresponds very nicely to what is a good candidate for the language of thought in the sense of like the kinds of semantic representations. It has many properties, but one of them is that it has a it has a basic like function argument structure and predicate structure, which looks a lot like the core of the way linguists analyzed semantics of sentences. So if you want to think of a language of thought where the set the sentences or the statements in your programming language are something analogous to the meanings of sentences, then Lisp, or the lispy version of policy programming in church is a really nice one. And we'll come back to that next time, or the time after, when we see some of the ways that people have now been using modern probabilistic language models, L and such as ways of effectively translating into a probabilistic language of thought. But that's just a foreshadow where we're going.
Yeah, can Qmr capture the interaction between the diseases? Because you mentioned how much, okay, you mean the
fact that, like, you might have diseases that would cough other things, or no,
you mentioned how much having a TB elevates the probability of coughing. But I thought like having a cough and TB versus having no cough, sorry, having already flu, and additionally having TB versus having no flu and additionally having TB would affect the probability of cough differently,
yes, so Qmr, any reason that captures the fat and the noisy or it captures the fat, but having multiple diseases that, if that be my interaction, yes, by the fact that having multiple diseases raises the probability of having any one disease for a symptom that is, That is a child of the graph or a potential effect of each of the disease. Yes, the Qmr captures that any noisy robot capture the notion of interaction between diseases that it might not capture is if a disease can make another disease more likely, or if we interact, or if the we that the diseases interact isn't just, you know, the noisy or isn't exactly a linear model, but it's kind of close to linear. It's basically a whole bunch of individuals. Bunch of independent things. So it's not linear, but in some sense, it's like the simplest notion of causal interaction and work that we'll see later on in the class. It also turns out to be not a bad model of the way people intuitively, naively think about causes right causes which interact in more nonlinear or more complex ways are often hard for people to think about, especially if they're cyclic patterns of dependency and many of the things that come up in and this is again foreshadowing for possible projects if you're interested in ways that troubles all the way people misunderstand important challenges in society, like climate change or pandemics or things like that. It also comes down to the ways we might have mental models which simplify the nature of the interactions between a bunch of variables in ways that are really importantly different from the way we actually understand scientifically, although might be changeable. So anyway, these, these are these tools will be very valuable for thinking about both scientific models models, if you want to understand how we collectively come to understand the world better. So we're almost out of time for today, but I'll just to kind of introduce the like the key thing about what we mean by policy presence, just a little bit how it's different from betas, and then we'll do more of this next time the So, in contrast to the language of graphical models, here, what we're going to be doing is we're especially going to be using the language of universal problems programs. So that's you know, again, if you're familiar with the idea of lambda calculus or churches, the church Turing thesis. You probably everybody is familiar with the idea of a Turing machine. That's a kind of automata that basically a finite state automata with a T that you can read and write to. That's one universal model of computation. By computation you mean sort of any procedural, any process, basically, or any algorithmic process, these are all there's no one circular but sometimes computation is fine to be anything the Turing machine can do, but it's, you know, intuitively, any sort of step by step process, no matter how complex and how many steps there are, is another way to think of it. The land account for this. The thing that church developed is another formalism for that which, in many ways, has more potential for the kind of abstraction that you see in human thinking. But all of these formalisms, Turing machines, churches, landing, capitalists, any, pretty much any programming language that you've ever learned is, is Turing equivalent? Is a one language can be translated to the other. That these describe a universal way of describing a bunch of processes. But now the way the processes are not going to be processes for solving a problem, but processes for generating worlds. So we talk about a model of possible worlds, or the fine grained structure of how they're both dependent each other. That's That's where the programming comes and the programming language is the general universal language for describing any possible worlds. So a few simple things to keep in mind about this way of thinking about distributions on possible worlds. One nice one is the vis a equivalence between you can think of this the procedural way of putting distributions on possible worlds and the more standard mathematical way. So just to illustrate what I mean by these two different ways of putting distributions on possible worlds in a probabilistic programming language, we might have some statements like this, which basically describe how to make some random choices, like in this case, flipping some coins with probability point three of coming up. Let's say these are one zero points. Point three means, or point three probability of coming up. One we generate a bunch of zeros and ones from these coin flips, and then we'll return this procedure. We'll return the sum of those values of those 01 flips. So we can run this program a few times, and each time we run it, the coin flips will come out differently. Here they came out 101, so here they came out 000, we do this a bunch of times, and that generates a distribution on return values of the program. And we could take that distribution, which I'm showing here, if I ran the program in the limit that I run it an infinite number of times, I will get a distribution which follows a certain mathematical form. That's mathematical for when you study causality theory, you usually learn things like that, like, OK, that looks like the Bernoulli distribution, or binomial distribution of weighted coins and their sum. So it's a certain kind of binomial distribution. The Bernoulli is the single coin, and the binomial is the sum and here, and that's one way to mathematically describe a certain way of assigning probability over, in this case, a very small set of possible growth that only has four values because there's only four possible outcomes of flipping 301 coins and adding them up. That's one way to describe that distribution. But another way to describe it right is as the limit of running this program many, many times. And there's a, there's a there's a theorem, which I will not even attempt to formally state, let alone the group, but there's a theorem that these are the there's an equivalence between these in universal language. So that means you can think of these as like almost two different notions of what it means to be a computable probability. One notion is to be generated by sampling from the outputs of a program in one of these universal programming the other is to have a program that can generate those the numbers in that graph, or to assign the real numbers or things that approach the real numbers, That assign probabilities between zero and one over the possible outcomes. So we when we say a distribution is computable, we mean the probabilities of both can be generated by a program. And for any distribution that's computable, that's, there's, there's, there's going to be some probabilistic program whose limiting case of running it is that now it's not a one to one thing. There might be many programs which generate the same distribution, and you might think of that as like, Oh, is that a bad thing? It's not unique. But in many ways, it's a good thing, because working with the programs that generate distribution that generates samples of possible worlds is often a much more tractable way to do the computations underlying so that's that's the kind of thing that we will come back and see that next ways that we can use these programs we look forward to predict what's going to happen in a complex world. This is a little example of a little intuitive physics setup, if you, if you now, if you haven't used a problem on web, but this is definitely the time to start doing it. But you're probably using it pset. But using the first examples is one of these little twinkle see how it bounces around, and you can generate a distribution. You can also reason backwards from seeing where it comes to maybe where it was dropped, and many other things. So we're a couple minutes ago. Thank
you
for reporting to the litter
every perspective, yeah,
I just don't know if you're aware of that. Learn social norms, internalize,
not just like a like internalizing?
Oh, yeah. I mean, what does it mean to internalize or fully agree with? Is it hard question?
Yeah, so I actually read that when you're
just obeying social so small you're not it's incurring a cost on you, but if you fully agree with
it, actually
so thinking that, you know maybe which paper is that that's pretty famous,
I don't remember. But is it a cognitive sign paper or, like, send it to me and me. I'm curious. Okay, yeah, no. I mean, I think that. I don't know that particular paper unless, but people have worked on things like that. It's a very good practice. I don't know. It'd be really interesting to see it. That is how people mentally represent it. I wouldn't be at all. I mean, there's been a lot of work, and we'll talk more about this later
on. Internalization, well, a
lot of things like you might have, so you, I forget you took this test before, but in your Are you sad enough? Right? Yeah. Do you remember things like the naive utility calculus or any of these inverse planes? Again? Maybe I'll let me see if there's other shorter questions, but I can give you some pointers. I'll just say it's a good direction where there's been a lot of interesting work on looking at how people think about other people's behavior, think about norms, using tools. And I don't know exactly like what you're saying is something that I've often thought about, and I don't know if anybody's exactly modeled, but all the more reason why it's a good thing for the class. Like it's definitely
there are some closely related that.
I think the interesting thing would be to see, yeah, how do you mean? One of the things about social norms is they're not just things that we think apply to us, but other people should follow them. Yes, and so when you're giving it like a positive utility for yourself, you might also sign a positive utility to other people. I guess, does that make sense? Yeah, anyway, let's let me talk to other folks, but we can anyone have any quick administrative questions?
I wanted to understand your last term about why it is better to have different programs representing the same
distribution. I will come back to that. Yeah. I mean, that was just a foreshadowing, like it's not, it's not so much that it's it's the many. One is not the thing that's good. The thing that's good is that some of those many, many programs that you can sample from are much easier to work with than others, and being able to find the right ones are the ones that might capture the causal structure of the world that our minds can exploit. Does that make sense at all? I mean, it's, it's it's not a it's not a thing I can explain. It's more like, as we will see.
Did you have any examples, like, is it because people have more comfortable programs they use so that? Like, if maybe, here's
another thing I can say, which is not an example, but it's related. Is that probability distribution, like, let's say I give you a distribution, one set of arrows, and I give you and then, you know, there's, there's, and then I have another distribution. There's not often a good way to a program is composed well. So if I have a program that can generate a trace of a trace of a possible world, then I can have another carbon that can do something with that. And so composition, either in the predictive direction or in the inferential direction, works much better when you're using processes or programs to describe the distributions or like, let's say I want to learn or change things. So I have a model, and then I learn about something else, and I want to change my model. So if I have a program that generates possible worlds, then I just change the thing that I learned, because what I learned is about processes in the world. But if I just have a big joint distribution, it's kind of like in a neural network with a black box. I don't know how to change it, right? In
operations management, we call that, instead of looking into a product, you go and fix the process. Yeah,
exactly. So it's like many other things, you have a mechanistic understanding of model,
computational, algorithmic, is that kind of okay? I, like, kind of your question really interests me, because in entrepreneurship, someone like intentionally not follow the norm. Oh, that's interesting. Yeah, yeah. So could you copy me when you're sending the paper to Gosh as well? Oh, actually, yeah, yeah, thanks, norms and rationality, yeah. Could you maybe I want to just but if you're sending a mail, if you don't mind, oh yeah, because I want to add some thoughts as well, and I'm curious what Josh thinks about it, because
his A,
M, O, O N, at, MIT.
Thanks. No, sorry, I'm just there's things that are a little bit like that that would be great projects again, I think so. What I would say is that works the
way that we can connect with what Josh would explain would be through here, like the work model on desirability, where we have value, so kind of we can interpret entrepreneurs as somehow weakening some morality and following this. That's cool. I'll definitely ask you about it. Yeah, thanks. Actually,
model structure, are
you in course nine?
What's course nine?
Brandon, coming to science. Which course? Oh, you're in school, yeah. Are you a PhD?
What group are you in marketing?
May I ask you, your advisor? I S. He has a lot of work on research, rationality, right, got a guy It's okay, it's okay, somewhat successful
isn't basically trying to learn to invert one of these graphic novels.