matin

9:03PM Aug 19, 2024

Speakers:

angie

matin

Keywords:

reward

belief

game

optimal

prior

market

entrepreneur

state

negative

action

agent

model

positive

dynamics

tool

information

product

observe

xm

feature

The call the elevator. Call the elevator.

All right,

how have you been? Not too bad. How about you?

Yeah, it's

interesting. This game was like just implementing the demo.

Was interesting. There's multiple different things that that I want to discuss about it, but I want to know, like, what you wanted to discuss today, and

you can go from there. Oh,

I see, yeah, I just had a chat with Charlie, and I think I will reorganize the whole slide into some he really liked this diagram. This is from Josh's precautious because Charlie and I have been talking about how to use AI as thought partners for entrepreneurs for, I think, around one and a half year. And this somehow captures and we can contribute it in a way that how entrepreneur, how machine can learn about this entrepreneur, and how the belief about the human's belief and machines belief can be communicated with each other. Yeah. Like, for instance, this existing tool that helps you build or elicit humans belief about how different theories? So this was example of a portion. In order for them to have exclusive segment, how should different future states of motor racing and engineering performance and mid price competition and high end customers should be? And they have some format of asking the probability for each node and edges. So I feel this is a way for communicating the belief one and three. So

someone wrote that model there. Someone sat down and said, what are the, you know, relevant, you know, characteristics of this market? Well, it's mid price competition, you know, whether we can get higher customers, blah, blah, blah, like they sat down and wrote that after the fact, then Portia had done the thing that they've done or during the time there. So

I think this is like a more playing ground theory building. So maybe if you click,

yeah, these are these games, yeah. So

since we were working on a car, I chose this to play, and they have some background knowledge of Porsche and how they define the future state is something like this. So mid price competition is the probability of you believe this would happen. That

makes sense, but this is more for like for teaching people,

right? Or what is this game?

This allows people to have firm specific. So this was from a paper. The authors of this paper is kind of state of the art and entrepreneur like scientific approach, and what they are calling for is firm, specific causal logic for how they imagine, create value and guide their downstream choices for measurement, experiment and evidence gathering. So developing this tool and Charlie frames it as a way that we have more fancier tool than what Currently they have based on probabilistic program. What

I'm yeah for sure, what I'm just trying to understand is like, from like, are they expecting like, people to, like, play these games and actually, like, if I own a car company, let's say I have a TV company, am I expected to go play the Porsche game? And then, based on that, try to, like, literally, like, try to input what I think is the state of the market, and see what the outcome of the game is, and then go do that action in the world. Or is it more like,

you know,

I'm playing here and understanding how I should yeah.

Oh, bye. Uh, right?

I mean, I It's more like, you know, you're expected to play many of these games and learn about how to think about a certain specific business scenario, yeah, and then be like, Okay, now I have to have some sort like, what are the relevant parameters for my case was these five. For me, it might be some other things, because I make ice cream, I don't make cars, you know? And right is that the is that it's more of it's it's a tool to teach a way of thinking kind of, right? And

that's our first goal, and that, but that doesn't mean they can use this as a prediction tool or action prescription,

for them. It is those like a teaching tool.

It makes sense, okay,

but I agree that there would be some very difficult gap from teaching to like the real life

application. I think there are, depending on how you want to. is my very close to that kind of, like the way I think about how these tools are going to help. You know, they are not just for automation. You know, some tools, a lot of successful kind of, like established computer tools, are automation tools? We have a task that we know exactly what it is, you know, and we just automate it using that. We have a compiler, you know. Did you know that automation is included in the Tools

definition I just learned today? Oh, really.

But yeah. Again, these successful

tools, a lot of them are like, for automation, but, like, there was just recently a newer class of tool that, you know, it's been there. It's like spreadsheets are kind of like thought partners, right? But like things like chat, GPT, you know, are, they're not automating any specific thing, right? They are there to, like, help you interact with a bunch of information with computational power in a way that's easier, you know, then, like, you don't have to go get a PhD in machine learning or get a PhD in something to be able to, like, extract some information from some documents. It's a general purpose tool that helps you think a lot of program like high level programming languages. I put them in that category. I don't know if you've worked with something like sympy, these symbolic algebra softwares, they're great, because if you have a very big, hard integral that you need to do, you know, you still can do the high level planning of how you do the integral. Okay, I'm going to do integration by parts, and I'm going to bring that into three terms and then our first term, but you can kind of use sympy to do, to like, describe those steps. But sympy doesn't multiply, like, you know, polynomial multiplication and rational function decomposition, and like, you know, calculate all the derivatives and things like that. And I think based on kind of like one useful kind of, like, longer term tool that can be really helpful as a thought partner in a setting similar to to the one they're like this. These games are like, targeting is a kind of like a programming language. Or, if you don't like programming languages, like, it could be like a, you know, even a, like a diagram based, kind of like software where you can describe the dynamics of, like, some decision process, you know. And it could be very simple dynamics, you know, like a lot of these models, I realized, don't use, like, continuous variables that they're like, you know, certain, like, the sentiment of the market about something is positive or negative, you know, and the value, yeah. And then you can, like, make your own experiments and play with that experiment and modify the experiment. And basically, like, instead of having these 20 games that you can pick and play, you can make your own games and play, and maybe that will help you with your decision making. That may be what system dynamics, people using

kind of end rules are trying to do, like they have a tool that, for instance, you can imagine how many tools would affect the energy supply, and you can somehow have the lever that affects them. But the thing is, I don't know. We should find some middle ground between discontinuity and the discontinuity of the state that we Yeah, the discount the continuous or discontinuous state

is really not an issue. Yeah, it's like, the things that are harder are like, okay, there's no the thing that makes this problem hard, the thing that make this problem like, I thought, Okay, go home and like, I'm glad it was harder than you thought.

And I also because it's, it is a bandit

problem. It is like a form of a bandit problem. And I thought like, bandit problems are like solved in the sense that the most general specification of a bandit problem has approximate algorithms that basically get like, you know, close to like, theoretically optimal performance. And simple cases of the banded problem where this, this pivot game is like a very, very simple bandit problem, right?

That is, like, totally, exactly

solved, right? But what I realized is that, like, it's not solved at all. And even though you can implement heuristic strategies, even though you can implement greedy strategies for this specific case, really quickly, right? You can't implement strategies that are optimal up to a finite horizon or optimal for, like, maximizing discounted, you know, kind of like

revenue over time.

So you.

Yeah, the place where I'm at right now is like

having the decision, the structure of the decision process, encoded in the model you know, having the fact that you know you need to you make decisions, and that decision gives you information, in addition to some reward, and you need to really like, make decisions that maximize the value and value, you know, there's value of information, and there's, like, you know, regular reward value. That is something that, it's a technical problem that is not super like, looked into, but definitely in the PPL saying, I don't have no ppl that advertise themselves. You're good for decision processes, you know, and

they are no. There

are none. No. The state of the art when it comes to these types of decision processes is reinforcement learning, where basically what happens is they encode, like, they approximate the dynamics of the problem in a way that's, like, differentiable. And so then that allows them to use gradient descent to, like, optimize a neural policy. So, like, in our case, like, so, yeah. And also I want to, like, stop just right now, okay, because I feel like I'm going on a tangent, and I would, yeah, right. So I would want to, like, bring it back. And like, give, give. Like, ask you where your thoughts are, and then I can, like, describe more the game. And like, I think it would be good, actually, for me to go over the game and like how I like what is happening in the demo, yeah. I think helpful, yeah. And

then you can think more about

kind of, like future steps, and I would have, right,

yeah, that's and if you have anything else that you want to put

like on the agenda before we go on, yeah, I have one question. So last

time we were talking about kind of sequential segments, and Charlie and I was discussing, I think instead of sequentially segmenting certain chosen parts, it is better. It's a better modeling principle to start from more, bigger, bigger model, and then decide whether this can be aggregated or not. So I this is like you can think it as an modeling like from no pooling to partial pooling, and this is more like complete pooling, complete pooling, than to partial pooling. So the way, the direction is different, and I think this is better. Is there any theory to support that? Well, I mean, the

Okay, do you generally agree?

I don't see a big difference, I think,

like, may I add one more? Yes. So

one worry

was that here you can get because this is path dependent, and let's say, because of our this is like, student, I want to build an AI products that teaches how to use AI, and this is professional type of customer, and this is student type of customer. And even though this is much higher, you happen to sample this, then this would let you inform that, oh, I should target for students, even though this is so that can affect your future segments. That was my argument against this. So if you do this, it might help you somehow. Let me understand what you mean by this and that states. So

in this specific game, right?

There's only

what is the state of the war the state of the world is like, what is the sentiment of the market about like luxury cars and economy cars. And what is the sentiment of the market about EVs and hybrid, right? And depending on how that sentiment is set up, one of the four decisions is the most profitable thing to do, right? I thought it was more like two types of product feature

and two types of customer segments, right? Okay, let's call it. I'm just Yeah, two, two product

features and two customers. But do you see where the difference is? No you can, if you have two categories, you can, like, make them into a one. So if I have a single variable that is like the market segment, right? And it can take on two discrete values, one and two. I can take it make a continuous variable that takes values from negative infinity to positive infinity. And if that continuous value is positive, I would say, you know, I. Associate those cases with the same as the discrete variable being two, and if that continuous variable is negative, I can call that the variable being one. So to me, the fact that you have continuous versus discrete kind of like parameterization of the state on this level of like model expressivity, what things can you like write in this model doesn't make a difference. Now, sometimes it doesn't make sense to like model things as you know, continuous. Sometimes it does. The reason why I'm using this continuous variable is because of the reward mechanism that you know showed me that last time there, you look at the sum or the difference of these two indicators and divide them by two. So that's why they're continuous. Oh, sorry, sorry. Just to clarify, I'm not saying about

the discrete versus continuous, but rather whether it is one sided asymmetry information versus two sided asymmetry information. Let me since this is kind of important, it seems to be important because the conference, there was a philosophy of a lean startup. Have you heard about Lean Startup? I've heard about lean startups, but I

don't know what I have no I don't know if they're good or bad, but the amazing idea is kind

of they emphasize making MVP and test it from the customers, because they assume that customers has much more information than the producers, and They are lines of research, including my advisor, who are against that, because it is, in fact, two sided information, because customers don't know the details of the product feature. So the reason I this designed as not 1234, but one, two multiplied by one, two is kind of product and market is based on the idea that, like, they know they they're, like, two sided informational symmetry. Okay, I don't see how that, like,

I can understand that. To me, these two on the level of

the information that these two sets convey,

it's only the number of elements. So to me, like, you can, like, treat one one as one, you know, like, Yeah, put them in one to one correspondence. So what other feature of the model would make this different than that? Yeah. So if you have the demand side

and the supply side, or this is a place and demand side, if you somehow discover some unexpected thing that, oh, I went out to test how kind of fast this machine is. But people are more interested in how fancy This is. They discover new feature and that can affect to the product feature level. So that was why, like, this is what entrepreneur side their belief model is multiplication of need and feature and market side is need and feature, but they have different measure spaces, if you will. And the final front of fit is a correlation between those values that entrepreneur, producer and demand people give to them. So kind of this is coming in here. So I just wanted to put this on your radar so that, because this affecting this is very central to startup pivoting. Yeah, I don't like so is this a Bayesian model, or what

is this diagram shown here? This doesn't seem to be a very Bayesian

model. Is it? I don't think they have Bayes in here, but this is

just like an observation, yeah, right. Of like saying

you have two vectors, there's a belief net vector of the, you know, entrepreneur, and there's a belief network of the market, and you have a simple linear model that says the product market fit is equal to the correlation of these two vectors. And

what are n1 through n3 and f1, through f3

like, what are those variables, different needs and different features? So

people usually confuse this, like, for customers in the model itself, like, so needs of

M, is it that that math, that that like expression on the right hand side of that B

plus line. So in there's a little in this blob over here. So is needs M, a three

dimensional vector that's just n1 and two and three, and then needs E is another one that's n1 and two and three. I would, I would say they have more

information on this, and they would have more information on this. So there is those edges,

basically go from, go from, if you wanted to draw this as a base net, right as like a Bayesian network. There are these latent variables that are like, you know, needs in the market. And I mean product features

are not like something that you

would want to infer some unknown, you know,

needs in the market is the latent variable, in the sense that

that's the thing that an entrepreneur wants to discover. But are we modeling product features as also some random variable that someone wants to discover, or are those things that we want to optimize? Yeah.

I don't know what is being optimized here, but, like this, seems to be some notes, yeah, of entrepreneurs and markets. And my point is, they have different there's different resolutions. I understand that part.

I just want to also so I get that,

I understand at a high level of like,

like, there is this information asymmetry that, you know, entrepreneurs might know more about the feature space, right, and, you know, or they have more information about the features based off the what, what the product can look like in the market. Knows we are the best examples. Do you know much about how, what

roles the program can do, but I know much about, like, what they need, right? So that part I understand, what I want to

also understand, is like, when you want to, like, write a model about that sort of a dynamic, right? How are you thinking about the belief variable,

or how are you

formulating the model so that that asymmetry, that like asymmetry that exists in the world, you know, it's a natural thing that you want to model. How does that like show itself in the model? Mathematically? Yeah, I don't have a good answer for that. And I think

that is why kind of this question comes in, right? And that's why I'm saying

in both of these, under both of these scenarios, you can have that asymmetry model that is not a, you know, I mean, because these are the latents, right? Those are not the entrepreneur doesn't know them, you know. And in our model, there's no customer, but, you know, if there was the customer might have, might get different resolution of different parts, and the entrepreneur might get different resolution of different parts, right? So, if you had a bigger model, like that market model that I shared, like a couple of weeks ago, you know, you can assume the world state is much, much simpler than what I said. There's a 10 different world states, but you have a function from World State to, you know, the information, the belief state of the entrepreneur, the different firms, and the belief state of the different customers. But you know that function based on the world state can give them different values. And for, you know, depending on this, and no matter what the actual state is, the consumers will know more about certain parts, and entrepreneurs will know about certain part. Doesn't, to me, seem like it has anything to do with the state formulation. That's why I'm trying to like clarify. It has more to do with like, the formulation of like, you know, the emission model, or, you know, in a sense, got it. I don't know

how to approximate this in language, but this seems, to me as just one agent, and this seems to be like two agents interacting. And when this happens, some more dynamics comes in, like, how they communicate to each other. So that was like, kind of Yes, two agents that, to me, makes sense, and that's so

that I resonate with that a lot. And that brings you back to the thing that I said last time about, like, the simple game is that there's no, there's no there's one agent, yeah, and yeah, I see how. I can see back to what I said in the beginning of this. One thing that no one does. Ppls don't do system dynamics don't do is that they don't model this. Oh, there's multiple agents that each one of them is planning and you want to like your decision has something to do with how you think I will make a decision. Right? Exactly. Yeah, right. And

so that is something that

it's technically a hard problem. So, you know, it would be good to, like, find a tame, technical, like, formulation, yeah, but I think that's an interesting area to like. I think it's something that probably business people will care about. Ai people will care about that in a different setting. But it would be cool to, like, try to form, like, nail, like, some sort of, like a project with a feature, with the main feature being, there's multiple agents, and they interact, and you want to kind of like, capitalize on that interaction. Yeah, the way I framed our pivot game to Charlie

was, it helps you. It helps this person have, like, can trust the machine. Because after playing this, like I did, I always lose, and I somehow starting to believe, Oh, I should follow this person, like, person being this, which is a machine, right? Yeah. So in this diagram, it is kind of as a thought partner. Machine has a belief about the world, and it shares its belief with me, and it turns out the product of their thought is more profitable, and that somehow affects my belief about the machine. Yes, sure. So I think that's capturing this thing. Dynamics is how I'm trying to present to the faculties. Yeah, that makes sense. I can explain to

you now. Maybe is a good time. Yeah, I'm glad we are on this same page, that I am 100%

with you, that you need to have multiple

agents interacting in a lot of right? That's okay, so let me go over this. I'm gonna go over, like, just at a high level, what is happening, and then I'm gonna go over the code, okay, but I'm gonna, like, be more concrete, like, I'm gonna write things down exactly. Oh, I love that. What is going on. You can use the next Okay,

thank you. I can sense the pain.

Okay, so this is exactly like what you told me from before. So from my, you know, this is my understanding from last week. Hasn't changed. It was clear to me once we maybe I didn't get to your point quickly, so I'm gonna, like, go over, like, just what's happening in the game. So the game has, like, there's a world state, and the world state has two variables, and there. So there's, I call them XM. Let's call them XM and XP, right? And this is so, what is this? So XM

is just

a real number, right? And it indicates, so this indicates, let's call it market sentiment,

and let's say, yeah, there is, like two market

maybe not sentiment. I'm just going to call it indicator. And the code I just called it indicator too, because I don't know how to but basically, if x let's say there are two markets, right, positive market and negative, right? If XM is positive, right? Then that means that, you know positive market is the one that positive market is good, that that's the one that you have to, like, invest in and go towards. And similarly, if XM is negative, that means that the negative market is good, okay,

and that's the over. That's like, the high level

modeling assumption, how they define good, I guess so, from what are,

so what? Let's go back to the case of Tesla, right?

They want to, like, make this car. They have, like, these four options to, like, you know, make a hybrid or an EV and make that a luxury sports car, or, like, an economy sedan, and they just don't know which market is a better fit for their like abilities and like what, basically, what is the best choice they need to make, right? That's what I mean by good. So it's a very simplified kind of like setting. You know, there's no, there's no like the there's only one agent, you know, in the game, and that's like us, or the one agent who's playing the game. And there is only the world state is only these two variables. So the market can be positive or negative, you know, which means that, you know, if the market is positive, you better go and, like, participate in a positive market. If the market is negative, you better go until the negative market is sedan, product and luxury and economy

kind of market. So luxury and economy are the

markets. So do you want to go like in the luxury car market? Be a player there, or the richer paper is your target? Exactly the market segment is going to be richer, and then the product is sedan versus a sports car. So I changed this to instead of luxury, instead of sedan, and sports is like hybrid and evina. I think, I think if we

choose luxury and economy, as America set on and sports is better, but, yeah, let's just luxury and economy are the market.

Product is sedan or sport, but product right now is EV or hybrid. Got it. So the car, the type of car that they can make like, are you good at making sports car? Are you good at making like sedans? Okay? And then you know what market is better for a better fit for your The point I'm making is, I'm not on

the car expert. And example Charlie made was, Charlie is a supercar expert. He was a director of,

I think, international

Vehicle program, and the one he made was 300 mile range versus 350 mile range for EV and hybrid car. So it's a product is the number of range mile. And this is a market, right? Suburban versus urban. Largely, it's very confusing. Oh, I see. But let's just stick with this one. Yeah, I'm 100% I have no idea about like,

what market I'm. Just like, being in as well, I'm willing to change this. Yeah, yeah. I was, like, out eating dinner, and I was like, Damn, it wasn't sedan sports. Was Evie hybrid. So like, the next day I woke up and, like, I pushed the change to GitHub. Yeah, but I think some visual aids might be helpful

for this, because I I found it a little hard to map. Like, what did I select it and like, kind of, do you know what I mean? Yes, I know visual would be very open, okay, since they have a sports car, did you know that?

Yeah, you have this full Okay, I'll put an emoji in there.

Let's call this as Sadan, although

it doesn't seem like it, right? It doesn't mean it's a compact,

yeah, let's just stick with this. Okay, yeah, but that's

what I mean, basically, like,

what market

is the one you should go into? Should you go into luxury or economy, right? Positive could mean, you know, luxury, negative might mean economy, right? But, yeah, the reason I'm just calling positive or negative is because of, this is a real random variable. And like, I can like, look at the sign, but otherwise I can just call, you know. I can call like, you know, this. You can call like luxury,

this one. You can call like, economy.

Okay, I'll go, okay. And similarly, with

XP, you know, that's for, it's a real number, and I call it the product indicator, right? And that means, like, What product are you good at making? Yep, you know.

Okay, good. And, you know, similar, you know, meaning to XM, right? So if it's positive, you have, you're good at the positive, you know, product,

okay, and then, so that's the world state. And I'm gonna, is it defined in game, or

it's in defining, yeah.

So if you look at pivot game, it has a state, okay, right? And it's and the state is just an array of two numbers, right? And there's a market indicator, which is the first number in there, and the product indicator, which is the second, got it, got it? Yep. Okay.

And then, so that's the world state.

What are so how many? What was the agent situation? There's only one agent, one

agent, and that's like, the entrepreneur, whatever. You know,

that's that. And, okay, so that's that. There's no agent interaction. Big caveat, what are the actions?

So the actions are also, like

you can think of, there's, there's four actions, right? Just picking one of these four, right? I want to go in the positive market and a negative product, you know, so actions, you know, there is, like, you know,

there is a market action,

right? And a product or,

let's say, a market choice and a product choice. Okay, so the state of the world is constant at every stage. You can

just say this market and that product, right? Okay? And then, what is the reward situation? After you made your action, you have to get a reward.

Okay? So, you know, the one thing is like, I will I

encode this action right using the following way? So

I introduced this encoding right for an action.

If you

choose, you know, let's say a positive or a negative market.

I don't know if this, this notation might be confusing to you, so if it is, let me know. But this is some notation that I've seen in some math text. So if you choose positive or negative market, one of them and

you know, let's say negative, positive. I mean, actually, that's bad notation here, because let's say this. I'm gonna like do it if you choose positive market. Thank you. And positive product, your action is encoded as plus one half. It's a vector plus one. So just, you know, let me write a few other ones. So similarly, maybe I can make a table here. So market

product encoding, okay, so if you go plus plus, you get plus one half, plus one half, right? If you. Go plus minus. You go, like, plus one half, minus one half, right? So I'm just basically picking these signs and putting them next to one half, and you will see why in a second. Maybe you can already tell,

okay, so, like, you can pick a product,

you can pick a market, right?

And based on, you know, each one of them has two cases, you know, can be positive or negative, and no, whatever sign you pitch, pick for market. I give that to the first element of this vector. And the number is always one half. The absolute value is always one half. The second one goes for the second element of the vector, again, the absolute value is always one half, right? So this encoding might seem a little strange right now, but let's say, let's see why it comes in into the reward. So

you know, if

the agent takes an action with encoding like, let's say

a one, a two, right? That's just like this, positive one half, negative one half, it receives

a reward distributed

According to a normal distribution right of this like a one, a two vector, the dot product of that with XM, XP and some sigma. And this sigma is just like the reward noise. Here's called emission noise, right? So I call it emission noise, something like that. So, like, you know, what is an example? For instance,

if you pick positive and positive, you get the dot product of positive, one half and positive, one half of that is just to some of them do positive, negative, it's going to be negative. That's exactly right. And so that's the reward function, or the reward model. And this should, this should have all of that. And what's the goal? The goal is to maximize, let's say here, the goal is to maximize the dynamic the reward up to a certain like finite horizon. It could be maximize discounted reward, given a discount factor, or something like that. And in the function, in the quote implementation, you can also see, like, you know, so if you go up to game again, game, yeah, like you're already here, so you're so the game, you know, we talked about how there's a state, there's a market indicator, product indicator, there's this emission noise, which is another part of the, you know, what the, what the parameters of the game are, and you can see that, you know, if you want to get a reward for a certain move, you know, you can make a move, and that move is one of these four vectors, right? You can,

yeah, what's the definition of the move, move?

So the move is this vector, so it's the action. That move

is another name for action. So you make a move in the game, you know, maybe I should. I'll rename it to action, and I go home, like, but, yeah, so you give it an action, right? And it gives you a reward. That's what this function is basically describing. So in the in the game.py, in the get reward function, just say this is the action I took. What is give me my reward, right? And different players have different strategies for coming up with moves. Each player has a belief, right? So I'll get into like, what that MC value? Player, that's like, just Monte Carlo value estimate. Player, yeah, and the moves array, it's in py again,

positive moves. That's all the positive

possible moves. So you pick

one of these moves, you know, and you got it. The game knows based on that, right? Okay, so what is str? So

that's just string. It's just to print the move.

So if you let, if you so, that's why, like zero, so that they're actually in there, in that function, you can kind of see how you don't, doesn't have to, like, you know, right now, the move is represented as a vector of, like, plus one half a minus one half. But to, like, print it out there and just flatten it into, like a number from zero to four, and there's no information lost. I can I use it to print, like the plus minus stuff, but I could have done that also by looking at the sign and, like, making that into a string and whatever. But my point is, doesn't matter that it's like two by two or 1234, the state, the move, action representation, because. It's all discrete. Okay, so that's that, is this match? Is this match? What you said last week, that's what you had in mind with the dynamics of the rewards, right? Okay, so quickly, how the strategy of the of the like, near optimal player is, like, set up. Should I go to utils?

Utilities, just the utility functions that I have

for the computation. So there's, like, you know, a function that splits matrices. And, like, you know, where should I go to see the new optimal

and the game in Game dot device still? Yeah,

so that. So if you scroll down, there are players. There's different types of player classes. So, like, a human player is just a player that, like, asks the human to like this MC value player is the one that, okay, so, so let me, let's go right,

if I may, in a new page. Yeah.

Okay, so I at first misread this to bandage, so I already knew, yeah.

So why is this a bandit? By the way,

have you do you know the formulation of the bandit problem? Have you heard of like, are you familiar with them? It's like the Bandit is, like, you have, what, if you've been to a casino, there are these things where you can, like, but instead of having one arm that you can mark term bandits, there's like, multiple arms. So here there's two arms, yeah, right. And actually, here there's four arms, right? There's like, positive, negative market. Each one is an arm. The four little squares that you click on is, each one is an R, right? The different the classical formulation of a bandit from is that each arm has its own independent distribution. Each arm has a distribution. We don't know here the value of the distributions are correlated, and we know a little more about how they're correlated, right? So actually, from a first principle, just like, if you were a high school student and someone came with you with this game and wanted to, like, solve it, like a pretty good strategy for solving this game is, initially, you can pick any one of the, you know, states. Let's say I picked the one that's positive, positive, right? So, positive market, positive product. When I select that action, I get a noisy reading of the sum of of these two numbers, right? Basically divided by two. But I look at that, if that sum is like big and positive, right? Let's say it's bigger than twice the standard deviation, or something like that, right? Then I know that with high probability, those guys had the same sign and they were both positive, right? So I know that the optimal choice is to keep playing the positive. So I can just keep playing the positive. If the answer I got after playing positive, positive was a negative number with large absolute value, again, an absolute value that said, let's say, was twice bigger than this, right? Then I'm sure that the both the numbers were negative, right? So I go to the other game. I keep playing negative, negative. Now, on the other hand, if the absolute value of the number that I receive after playing plus plus was kind of between negative sigma and plus sigma, then I know that the signs of these two numbers are the opposite of each other, right? So I go and play one of the moves with positive negative, right? And based on that, I get, you know, let's say I play positive negative. If I play positive negative, the answer has to be, you know, there's two more probable cases, the answer has to have large absolute value, either in the positive direction or negative direction. If it's the positive direction, I found my optimal game, optimal play, otherwise the other positive, negative hand does my optimal play? If on the very low chance that I get an answer that is like still small, what I learned is that these two numbers are pretty close to each other, right? And not the difference is comparable to this noise, right? So then, then actually being like more because, like I I knew that their their signs are different because I added them and I got something small right, and this time again, I added them with like, different signs and it still got something small, right? Or, I looked at a difference and it got something small. So both of them has to be small, right? And they're being like, more Bayesian can help you more. But at the end of the day, kind of like, you know, that game is like, the way you would get, you kind of squeeze a lot of reward out of that game in the average case, right? So an optimal player that tries to, like, maximize the expected value of like, the long term games for like, many possible games, right, in an approximate case, can kind of ignore, you know, if they're both close to each other in a sense, right, squeeze out so, like, you know, let's say the XM is like. Plus, you know, 1.43

you know. And

you know, XP is negative point seven, four, right? So I should, like, pick positive here and negative there, you know. And like, get the sum of these two guys, you know, and get something around, like two, you know. And that's a small reward at every step there. As if that was like plus 10 and this was like minus one, or if both of these numbers were large, basically, or different, you know, in in, you know, magnitude or different direction, and like, one large magnitude, that that's a case where I can really get a lot of reward. So even this, like simple ad hoc strategy that I just described. It's like, pull plus plus, get an idea on the signs and go there should play close to an optimal player, right? I should add this to the experiment. I'll get into what the experiment is doing also. But let me quickly, just like show or sketch out what it what? How do you make an optimal decision in this game? It's with value that equation that you that's what you need to use, right? So, you know, what is you know we did for you know, we define a value function.

You define

value function.

I'm gonna define this, even this one, right? You're explaining how

you implement this one. That's even, that's a God that knows what the

state of the market isn't always, always, always that someone who has side information. What I'm describing is like an infinite steps look ahead, you know, kind of like just thinks through all the different possible things, you know. And AlphaGo, kind of AlphaGo, yeah, basically AlphaGo. I'll explain. AlphaGo is not that different from this problem. The states are much bigger, you know, but, but it's pretty similar. So we I define a value function from I call it V, you know, and it takes in an action, and it gives me a value, a real number. And you know, v of A the meaning of this is going to be the expected, let's call it. I'm gonna be super age.

May I choke them? Oh, yeah, of course,

thanks, yeah, sure.

Don't even ask

so, yeah. So, so a horizon value that just means how many steps ahead of time I want to maximize my action for so. So the HFA is the expected optimal

return, or the expected

return of taking action A

and playing optimally

for H steps. Okay, so some examples,

v zero of A, right is just meaning like, the expected return of taking action A and playing optimally for zero steps after that. So like, basically, if I just want to maximize my reports in this step that I'm at right, What? What? Which one should I pull right and this one right. So this is, like, just

only the expected return

of the immediate move,

right? This reminds me, just one quick feedback. It will be

helpful if there is some feed rules that tells me how this values underneath are updated and different values, I think that's another thing I need to add. Here's the thing,

or I definitely, and I will most likely add that. I'm not a web developer, though, so like, I have chatgpt open, and I have, like, lots of Stack Overflows open, and I'm, like, making this, but, yeah, it would be good to, like, show a little belief state of the optimal. Yeah, that was what I was

trying to do here. Like, this is the chosen states. Yeah, believe. So if you see the sequence of. Choices, yeah, because here I don't. So it would be great if I can learn how my choices previously have affected my next choice. Yeah, I haven't found it like best. That's actually cool. Yeah, you can show that.

You can show given that that's what you played, and this is the information you receive. Here is what your optimal belief should be, about XM and XP, and that's really that mean computationally, that's easy to write, making a graph for it and put it in your web page is going to take a, you know, maybe an hour for me, but that's, I'll show you actually how easy that is so but

great point. Just one last comment.

So here I made a. So I have a, I think this is HTML. I'm not sure about that, but I have a code that it shows which option that you chose from here and from color. You know that it went from here, here, here, right? And this is the reward that you observe from each chosen part, right? So here, this is minus two. So it has expectation of this, but it observed minus two. This is to escape from here, which is really bad. I mean, this is a really bad cell to start from, and it chose to go here, and you can see is one. So this is why it observed one, right? I see, and based on its updated belief, it went to here, yeah, so I can I the way I was

planning to, so your beliefs about the value of these cells are not independent. They're correlated, right? So initially, your belief should be like, you know, if you have a Gaussian to Gaussian belief, it's like, just the standard Gaussian symbol, right? I see. And then you pick plus plus, interesting, and you observe some value based on the noise, right? That should, let's say you you observe plus plus, and you got something like positive, right? That shows that both of these numbers are positive and big, right? So then, then you're like, they should like, observe much smaller. But let's say you observe plus plus this time you selected plus plus, but you observe something negative, right? What does that mean? That means that they're close in magnitude, but opposite in like, kind of their size, right? So this jumps like, you know? This, like goes here, okay, you know. And then after a couple of like selections, you should really, like, have a tight idea of what the correct answer is. And this, like square, this like circle, just keeps getting smaller and smaller. If you increase the noise, and the size of this depends on the size of the noise, you know, the emission noise and the reward process. If you have big emission noise, you know, let's say the emission noise was bigger than your prior right. Then every is almost every. Every observation is really not that important. Really, many, many observations. If the emission noise is really small, there's almost no noise, then you'll immediately can, kind of, like, if you play 212, games, two equations with two unknowns, and you can solve for it and get the XM and XP, and so after two moves, it should, like, really collapse your posterior. Yeah, I don't know whether this is the right moment

to say this, but, um, observe. So what experiment people have observed was scientific entrepreneurs usually have weaker priors, meaning they are willing to, like, reject what they believe to be true yesterday, if this, like, surprising thing comes up, yeah, and the question I asked to them was, like, they would end up having a week prior, but they must have started from a very strong prior. So I was listening to this and got interested, because this is always kind of contract, no, you can have a week prior. And so

what is let's think about that setting. You can have a very weak prior and a very strong conviction to start from a certain depending. I mean, it depends on optimizing what your horizon is, right? But let's say long horizon thinking in these models is hard, right? Okay, if I do this and get some reward, and then do that and get some reward. What is my expectation, right? And I'll write the equation in a second, and you'll see the belly equation. But let's say you are at one state. You want to do a one step look ahead, kind of like strategy, right? Actually, one step look at is also like, exactly implementable for this, okay? It's like Monte Carlo, but you can also exactly implement it for this simple game. But I tried every thing, but I couldn't. It's not it's not easy. But let's say you are a one stage, one state looking at or you're a greedy agent. Let's say it's the most obvious in the greedy setting. You're a greedy agent. You. You have a very weak prior. So the standard deviation of your prior is really, really high, but it's concentrated. The mean of it is at one, one or something. It means, like, positive market positive. Then, because you're a greedy agent, and you think about your next move, you're like, Okay, my next the expected value of my next move, right? Is going to be the expected value of the reward under my prior. That's the best I can do. Right? The expected value of the reward under my prior is just going to be a sum, right? If I plus positive, positive, and my prior concentrates on positive, positive, so that's going to be big. All the other ones are going to be small or negative, right? So, like, if I look at the values of these actions, right, the value of positive, positive is obviously plus. Plus is obviously bigger than all of them. So there's an obvious choice that I should go pick, plus plus, right? But my prior is really, really weak, right? My the standard deviation is really big. So this the Gaussian ball, right? Is really spread out. So as soon as I get some negative reward, my prior concentrates on the negative section of the game, right? Whereas I can have a very strong prior that is not informative and very strong prior that's concentrated around zero, right? I think what that is like trying to tell you is like, it doesn't commit to what the sign of this thing is. But at that point, it's like telling you, like the value or the indicators of both market and products are going to be small, right? And the tighter that, like, you know, initial kind of, like, prior standard deviation is like, the longer, the more samples you need to finally, kind of, like, move away from, like, your initial beliefs, and go to to the you know result that the kind of data has indicated to you. Maybe if you another feature, tell me if you like, this could be like, just the, like, a little bar that, like, you could move the confidence of the player in their prior and then you can see that the belief of an agent with a strong prior belief, kind of like an agent that had like a strong prior belief, needs more evidence with the belief to some other place of the of the state of the world, whereas an Agent that had, like, a broad belief it quickly, like, snapped to it, you know, correct, or would quickly change their minds with the data. Yeah. So

do the mindful court, yeah. So he has kind of three C's of how prior and posterior I could just like, compromise and consistent and contract, I couldn't find a photo, but like, basically, it's like, this is a prior. If you have a very large prior, then it's more likely that your posterior contract. And if it's like prior and likely like this, yeah, and sometimes, yeah, that's a compromise, yeah, yeah. I think this, this really helped me see clearly. And if we could somehow replicate, yeah. And can you draw? We just say verbally, yeah. So I've got to come back to this, yeah. I like drawings.

Okay, yeah, what I was saying about there is, like,

let's first go with the one that had

actually the, let me write the Bellman equation for this v zero maybe. Or let me write it for the full thing, and then we you'll see it and that that will also inform the drawing. So at the end of today, what is the expect? What is the value of taking this action for a certain amount of Horizon, right? So that is going to be some expected value, right? I'm going to say, let's, let's say, you know, this, A is just that vector, the vector of plus one minus one, and let's say x is the state vector of the game, right? So a dot x is your immediate reward, right? And then,

so, you know,

you have an expected, expected value of that under the current belief,

right? Plus, okay,

the expected value,

let me put this inside that actually, of the maximum. Let me just write it. It looks like it's just, I know you know this? Yeah,

there is a difference

this. I want to be explicit about this S prime, right? There is this summation happening here, over S prime, the up the the age of this expectation is under, like it's a conditional expectation now, right? So. Let me call this R actually, just so I can have room, because it's also some noise, right? So R is the reward that you get by choosing some r of a, yeah, right. And

here conditional on RFA, right?

So the value

action A for H steps, right now, right? Is the immediate value of some reward. That action A is give us that that's the term that's like, you know, indicates more exploitation. If this action is really, if this is big, you have to keep pulling this company. And this is the exploration term, you know, plus, like, you know, in the future for the next h minus one moves right, you know, let's say you you got this reward a right? What is the best you can do in the next step, right? You know, you have to pick one of the another action that right. We call that a prime Right. And, you know, like, so first, let's say, I said plus, plus, right, and I get some reward. And now in the next action, given that I know that reward, right, I gotta go, sorry. So when you say this, do you mean this

current belief is updated, like I've never seen, yeah. This is a, oh, the conditional expectation notation, yeah.

So, so this means, yeah, the current belief is updated. So this is under the updated belief, yeah, yes. So that means, like, okay, under my current belief. Note that this depends on the current belief. So I'm conditioning on something that I you know, so kind of my beliefs about the reward, kind of like will inform how much I think I need to explore, right? And this is the value that comes through, like, exploration, right? This is like prior and this is posterior after

one. But then if you look into VH minus one, it's

like another one. So it has another condition, so posture becomes prior

in this in the next Exactly, right? So,

so let's look at the zero now, and let's look at that like the thing that I was saying, that there are places where you can have a week prior and a strong indicator take one action, and that might shift quickly, right? So for v zero, V zero of h, right? P, zero of A is the base case of this recursion, right? And it's just the expected value of, like, what is the next reward that I can get, right? So it's just the expected value on their current belief of the reward of some action, okay, and let's say my current actually, so let's say this is my let's put some coordinate axes in

here, right? And this is, like the market indicator, and that's the product indicator.

And let's say my belief is someplace like this for making it spicier. Let's put it on one positive, one negative, right? But it's like, very wide, you know. Let's say, like, it has a very

wide, you know, that's like one standard deviation. So like two

standard deviations, like, so I'm like, Yeah, I'm pretty I'm sure it's here, right? And let's call this probabilistic program and business, right?

So like negative five and like five,

okay, if you look at the value functions for each one of the actions. Let's write them down. So v zero of plus one half plus one half right is going to be zero, right. It's just going to add them and get zero right. V zero of minus one half plus one half right is doing the exact wrong thing. It's going to get negative five. It's just going to give a negative, get a positive. So this negative 10 at the negative 10 F I get negative 5v, zero of positive one half, negative one half, right? It's going to be five, right. And v zero of negative one half, negative one half is also going to be zero, right? So if I wanted to go, let's say I, you know, here I'm like, Just deterministically picking the best action I'm maximizing. There's like, these soft max, let's say, kind of like, policies that pick an action based on the probability of the expected reward, you know. So they're like, you know, they would like, exponentiate these things and normalize them and get probabilities. If I wanted to, like, follow that sort of a policy, or if I wanted to use the this, like, the magnitude of this value function as an indicator of, like, how good this action is compared to other actions. This is an awesome action compared to all the other actions, right? Yeah, it's like, maybe here than all of the other ones, right? So obviously I will make this action. If I had a softmax agent, like, almost all the time we take this action, right? But the prior is weak. So if it observes something around here, the prior will quickly concentrate around here. That's the exact type of thing. You know, you have a very broad prior. You get some observed load observation with low noise. You would like snap to that right? On the other hand, I can have so that's where you can have very good reason to take a certain action, but have a weak prior and quickly pivot, right? Sorry,

there is another setting where you can have a tight prior that's kind of fun. Or, you know,

can I have a very tighter prior, like

around this, around zero? Hmm, what does that say? That says in a greedy setting, right?

V zero of every action is going to be zero,

because the mean value of the action of XM and XP under disbelief is zero, right? So I am really, I don't have a good reason to take any of these actions, right? So, like, I can, I'm ambivalent about what action to take, but I have like fires about, like this thing, right? And no matter what action I take, let's say I took an action and I was here and here I ended up here, right? Because this, like, prior, is really tight. This will not move all the way around there and concentrate. This will, like, move a little bit, maybe in this direction, right? And still not even, you know. So, so it's still like, kind of moving that direction, you know. But you know, by making this tighter, you can make the effect even less, like, you know, you can make this really tight and have this move slightly, yeah, and then still, it doesn't have at the next step. It doesn't have a great reason to go move in a certain direction. But

because, yeah, because to come from the

conference. Do you remember the paper I shared about the overconfident bias? I did remember you sharing it,

chance to like reading it. This one? Yes. So this was a like

trilogy, and they were conditioning on one agent entering and exiting. They appear different ideas and how they pivot, yeah, and what this author gave me a insightful comment. When we suffer from over precision, we may sometimes want to have estimation bias, so somehow, like two bias can help you. This reminds me of that interesting when we suffer from over precision,

you may want to be confident, but so when you say over precision, you mean you want to, like have the the reward be not noisy at all. Want the reward to be very indicative. So like very strong headed entrepreneurs, they

don't want to learn much. Is my image of that. But yeah, I was trying to connect this estimation and precision error with, you know, the induction biases composed to approximation and statistical optimization error. Have you heard about that by any chance?

Yeah, because, long story short,

I think all this seems to be kind of functional approximation of IDEA and product to the market concept. Let me elaborate that a little more. But this is about, yeah, so induction bias is decomposed to statistic ladder, which is proportional type of the space. So here what they're the optimization problem they want to solve is org min of complexity of

so you have hypothesis space,

and you want to, like, pick him minimally complex

hypothesis such that, like, it explains data at a certain given Yeah, okay, yeah. And that's like, their view of like, yeah. They have this inductive bias for, like, sim like, simpler hypotheses. So we in machine learning, they have this

theory of universal approximation and how this optimization changes according to the size of hypothesis and function space, and also optimization error, which is, I think it's very related to inference algorithm, meaning like whether it does converge to this optimal or not. And when I showed this to the author, he some not sure, but he gave me his he shared this prior that this seems to be more precision error and this approximation error. So this is how many levels that your neural network has, and this is how many data you have, and this is kind of how much time that you ran this inference algorithm. So yeah, he connected this with the estimation error, right? So

I yeah, maybe can spend more time trying to go down this, like the analogy here, but,

yeah, I guess there is some sort of analogy there. I'm

not sure. Yeah, interested to, like, learn more about it, but, like, I'm not, yeah, yeah, let's continue

this discussion on Slack.

Because I think I kind of can you send me this paper. I can answer the actual paper. I

might be able to get it better. I keep coming back to this thoughts like every other often. But

so. What this is trying to tell me is,

like two stories. If you are very

farther away from ground truths, you may benefit from having your prior large No, yes, exactly. Or if we like, turn it other way, maybe, if you're very strong headed, you may want to stay near the ground truth, yeah, but if you're really strong headed, so

that's the loss averse strategy, if you're really strong headed and you want to, like, stay near the ground truth, right? I mean, in this, in our scenario, you're not really. There's no cost at taking actions, right? There's only, like an expected reward that you can get right if you're really, if you have really strong beliefs you know about, if you're really sure about your beliefs, and you're like, way out there, yeah, you are going to take many, many, many, many steps right to get to the correct answer. If it's not there, right? So you better be closer to the origin

so that you can move towards the correct answer with fewer moves.

Because what we are kind of fishing

for here is if our simulation can explain the phenomena observed that people don't understand why, that's the kit, right? Meaning,

what's a big hit our our simulation found the use

for the phenomena we want to understand. Because I see it as a theory and phenomena pairing. This is kind of we're pushing our simulation, that it can, in the hope that it wants to explain something, right? Yeah. I mean, it can. I think, if you go to

someone, which, I mean, I

don't know if I don't, from my understanding of, like, just the, you know, last time that we spoke about this and stuff, I don't think, like a random NBA student has an intuition about Bayes rule, yeah, is that fair to say or not?

I don't know. Okay,

but this

playing with this, like

simple simulation. I mean, I would even make it simpler for that purpose, would give a pretty good like people can develop really good intuition about what it means to make optimal Bayesian update your beliefs in an invasion optimal way. If you're really confident about something, you know how your belief should change. If you're not confident about something, how optimal your belief should change. So it can show that sort of thing, and it can show one other experiment that would be good to add to the experiment section, which I didn't explain. What is the experiment section is just like running 1000 games or 10,000 games, yeah, for all these agents, and averaging out the rewards that they get. And yeah, it shows and I ran the experiment. So

I think I know, okay, yeah, great. So you can see that there is a good jump from

greedy to one step, right? If you just think about one step ahead of time, you increase your average. These like bars are all like kind of so compared to greedy like this, this

part is added, and you're saying that is this amount, right? So that's with one step. Look ahead this.

This is a greedy term now, so a one step look at agent says, Okay, what? How my the thing that I would make, My decision is like, get reward and then the reward that I will get at the next step. And I don't care about that. A two step thing, agent says, the reward that I get now and then I play another turn and I get some reward, right? And then I play another turn, and I get some reward based on the information that I get after these two turns, right, and I want to maximize the sum of all these rewards that I get right. So a two step look at that kind of agent gives more weight. It spends a lot more compute to find like optimal exploration strategies. Could you make me one story that

somehow in the car, example, yeah. And also, it's like, 615 Oh, it's okay, yeah,

let's walk. Yeah, I

would like to make a story with you on how looking ahead can help in this situation. Yeah, this is too I mean, the kind of,

let me, I mean, the thing that heuristic strategies I described, like it does some form of look ahead, right? It does something like, you know, when it first clever plus plus, and it gets a number, you

know, that's positive, right?

It is looking ahead to, you know, based on like the answer, it gets right, and says, uh,

that, for instance, let's say you do positive, positive, and

then you get a small number, and then the optimal thing to do is to go positive, negative, right? And the reason to do that is because that will reveal the signs of the two numbers. Now, I know the signs are opposite. Uh, but because I did plus plus, and I got something close to zero, so they mostly canceled out right now, I have to go plus play plus minus. And that's not because it's a greedy move that will increase my immediate reward, but it's an exploration move that will clearly tell you the signs of these two indicators, and then I can go and play the optimal move for two indicators. And interesting, yeah, yeah,

I can go and play this. So optimal. So, the thing about, interesting, yeah, but the reason why, yeah, So, so, the thing about, But the reason Why You