Wed, Jan 10th: "Fine Tuning and Optimizing LLMs," 6 PM, CU Boulder
4:09PM Jan 12, 2024
Speakers:
Dan Murray
Susan Adams
Mark Hennings
Keywords:
model
fine tuning
prompt
ai
data
output
parameters
tokens
tuning
bias
talk
give
language
good
task
examples
trained
classifier
input
create
Remember brace recently? So are you guys excited about tonight? Yeah, of course it's pretty cool. I've been wanting to learn more about the fine tunings chips so
reading I come over to talk to you as a board member I have something I'm gonna put in front of you guys there's a letter of interest that needs to be done by Monday and it's not that big deal is pretty short but it's being done by our friends yes nice not here in town but the ones really bad I just want to make sure that something might the board might support it
supporting gotchas probably what would be a good idea is to read them because he is pretty basically gonna
send me an email to the board have a bunch of things that's not important to me. But the point is laid out and
we love that thank you
you can just miss I'm just gonna go to the back so
hello testing check check check
so it's confusing to do this and you're laughing right now. Yeah.
Just so we don't have to switch between them
okay, well I will be right there if you need anything
you safe you guys. Grab a seat, everyone. We're just about to sell.
T shirts next week
Oh man oh my god
my jam yeah
yeah
my name grab a seat you guys and we're just about to say okay, I'd like to welcome everyone to the right man AI interest groups great turnout tonight. Thanks for coming. My name is Dan Murray, I'm the person that sends you a lot of email, although hopefully less this meeting than last time. Tonight we're going to do something different, we're going to have a guest emcee who's going to run the meeting. Before we do that, I just have one official act from the podium here. I have a free T shirt that I'm giving to Robert Monaco. And we're here for his health cycle asmedia. And they are
meant to be there. Thank you very much.
I'm here. Awesome. Anytime.
You can order them online from our literary page. And we don't even take a cut. Let's the logo designed by Ken fretless. The friendly robot. Whoa, got it. So thank you, Ken. So our guest emcee tonight is going to be Susan Adams. You've probably seen Susan at many of the southern route meetings. She's extremely active. She's in the AI boat clo has participated in many of the subgroups. And she's cooking up all kinds of projects. Every time I see her she's grabbing another person launching some new initiatives. So I'm going to turn it over to her. And thank you very much Susan for it. I'm going to emcee the meeting.
You were given Dan a break tonight. So you can relax. And also many of you if you're interested can come and MC I just decided to be the first one. So thanks for your patience with me as I learned all of this and how to do it. But I'm Susan Adams, I'm instructional designer and course strategist. I worked for national consultancy that does I impact reformist work for community colleges. So I worked on teams of teaching and learning coaches that go into community colleges who are in our network and help do reform work at the classroom level. So you can imagine how exciting my life is right now in really doing some case making with faculty on all kinds of levels, and doing trainings and professional development. And I just want to thank everybody for engaging in AR mag, it has been so impactful to me, not just professionally but personally, my my education, my understanding my capacity building, my brain thinks best in groups, and you all are my group right now. And I just can't thank you enough. I love the Slack channel. I'm loving the subgroups. And I'm loving finding my way. And I also just want to welcome newbies here who might be feeling like, whoa, actually, how many of you are new is the first time. Wow. Okay, so you can hear a little bit of my story that I got involved about six months ago. And it just changed my life. I feel like I have a family. Now I have a community and I can go to lots of people when I have questions or concerns or fears, whatever might be coming up. So to that end, I also wanted to say that, Laura, Where's Laura? Right here, Laura and I are going to co lead the women in AI group that's going to get activated in February. So just wanted to say that that's happening as well. So I look for announcements on that. And anybody send us your way. I tend to send them our way to support that. All right. What else do I need to say before that? That's it. So housekeeping, for exits to front to rear. okay to leave early. I am going to be passing around this mic for the q&a. I want you to hand it like actually take it because there'll be some recording that we want to have as high quality for the q&a. And the q&a is really really rich here. So thank you again for having your questions and being ready. Bathrooms are down the hall, please carry out the trash. And we always always welcome anybody that needs an escort to their car. It's also really cold and it's dark. So please, if that's needed, ask any board member for that. And here's our Mac board. Raise your hand if you are a board member in here. Okay, great. Thanks to all those board members. supporting them financially and with your time and energy. And then we have subgroup leaders. So these are our subgroup leaders are really excited to have all of this happening. And I'm going to announce some of the events here. But thank you to our subgroup leaders, please raise your hand if you're here as a subgroup leader. Great. Thank you so much. All right. Um, AI and entrepreneurship is going to have their upcoming event Saturday. Oh, the hackathon. Right. So this is really exciting. A lot of you I know are gonna attend. And Daniel, actually, thank you for putting together a little educational tutoring session this afternoon that I got to go to as well. I can't make the hackathon. But thanks for everybody's effort and putting that one together. That's going to be really exciting. This Saturday, I can't wait to see what comes out of that. And then the legal aid group on the 17th is learn about the newly created long term benefit. Trust that governs anthropic. So that's going to be happening for the legal AI subgroup, and then marketing. We have someone here who's going to speak to marketing or Jason. Right? Yeah. So
this Wednesday, there it is, it's up to you. Please
come and join Laura and I leave that group. It's a lot of fun. And
keep all your questions from happening.
All right, and the AI in product group is putting together one I'm really excited about I will definitely be at this one, the ripple of responsibility. And I don't know if any of you know giant CBRE, we put her name twice, because that's how excited we are. She's a professor at the herps program of engineering, science and society at University of Colorado Boulder. She's also heading up the book group. So I got to meet her and see how well read she is in science fiction as well and gave so many great anecdotes. So she's going to offer a really great talk on ethics, and exploring your own ethical and personal considerations, also professionally, and how it fits in with product. So that's gonna be really exciting. And thank you to Elisa who couldn't be here tonight. But she's been thinking about this event and putting it together and designing it with Diane. So that's going to be a good one, as well. We do have a link tree. So for those of you that are new, go ahead and pull out your phones if you'd like and grab this QR code. And this will get you to basically an aggregator of all the links that we have that will get you to the spaces that you want to be in digitally, especially Slack channel, which I highly recommend joining. Oh, Chico, back. So anybody else still need that? Yeah. Let me give you another second there.
Okay, and then we definitely want to thank our supporters. The these folks are here and supporting us. Many board members. Hi, Cynthia. And so that is our supporter piece. All right. Also our sponsor tonight, entry point. And so Mark is going to be speaking with Mike and thank you so much for that. Happy to have a sponsor for a pizza. Thank you. And we also have our upcoming meetings. So obviously we have Okay, where's Dan? I'm making sure we are still tentative on the ABCs of GTs. Yes. Okay, great. Then we got to redo a member showcase on the third teeth, how general AI is transforming companies, a joint event with Southwest TOCOM or formerly blucon And then tentative health care for the summer. Okay, so those are the dates that we have there. And that'll be I think that's in other places, too. For announcements, pronouncements. I'm alright. Any job seekers in the room? I'm looking to expand my career. I'll say that. And then anyone with an AI job openings right now. I do want to speak to that real quick.
We're mostly in AI sales or AI. So we're definitely popping out that
marketing AI company called 10. Attentive, attentive. Okay, hiring AI marketers. Great. Thanks. All right. missing anything else? Oh, other AI your local AI conferences. Do you want me announce that, Dan? Anyone has any announcements for that? Okay. Anybody have announcements from the group that fit in any of these categories? Yeah, can oh, let's go with you first.
Oh, yeah. Yeah, so last year, and I'm looking to, like build a game that simulates kind of future society. The original goal was to kind of simulate what it's like to live with the UBI you know how the policy will work? It will work, but it will cause a relation.
You
know how automation is going to impact that as well, deflation. But it can be a pretty broad kind of digs up a lot of fun, actually, there's a lot of ways to make it very challenging that way. So just like our collaborators, I'm looking for funding at some point, we're just starting to reach out to people. So if anybody is interested, let me know.
So the epics a site subgroup who didn't make the slide. So we're gonna get money debt, we'll be doing that. Sorry about that. Four, Six, sorry, I was born on the 25th. At founders Central.
Yep, there's an email list. Like there's a we have a user group list. But if you want to join the AI ethics, you want to join the AI ethics user group, go to the Slack channel, find the ethics group on a slack channel, and just put your email address in there. And you'll get added to the mailing list. We're going old school, we just use email to communicate
and slack.
Any other announcements? Okay, great. All right, this is gonna help mark, who's our speaker tonight, and also just all of us to know who is in the room. So we got a couple of questions. This is our little tradition here. How many of you have previously worked with large language models in your projects or research raised your hand? All right, you're in the right place. Mark, you got a good audience? How many of you explored and this is kind of coming up with the ethics conversation, different techniques for reducing the computational costs of LLS? All right, so there are some folks thinking about that and doing that research. That's a big one. How many of you accidentally create a poetic AI masterpiece while trying to fine tune a language language model? Just to All right, how many of you contributed to open source project involving large language model? So actually done the contribution and open it? Thank you for that? Awesome. Okay, how many of you have told your friends, you're basically teaching robots to talk when explaining what you do with large language models? Right, like, how do we explain this to people, and even ourselves? And how many of you have outsmarted? How many have been outsmarted by your own AI creation in a game of virtual chess or trivia? Nobody? Curious about that. All right. And how many of you have tried to teach an AI model your favorite movie quotes just for fun?
I think that'd be really great.
Okay, all right. That's my last of those questions, come up with other ones and different ones. So now I want to introduce to you Mark Hennings, well, actually, you've got your mic there, too. He's the founder of entry. Point A is a modern platform for fine tuning large language models. He's a serial entrepreneur. And if you ask him, I got to find out some of the things he has started, which are creative and interesting and fun. One of which is Inc. 500 alumni. And he's a self taught developer. So last a lot of us in the room, which is exciting to have him here. But he's also passionate about democratizing AI. And I think, a big exciting element to our group. And what we're all amplifying is really looking at the ethics and how we can democratize this globally. So put together together from our Henny Penny
super excited to be here, I've got a pretty big presentation plan, I'm gonna try to cover a lot of ground because it's a big topic, and I think it deserves it. So we're gonna be talking about optimizing and fine tuning large language models, essentially how you can apply different techniques to work with large language models to get better results. So we're gonna talk about how to improve the usefulness, the accuracy and the relevance of outputs. Reduce hallucinations, put in quotes, because it's kind of a feature in large language model. That's really all they do is hallucinate because they're not really connected to the real world. But it will reduce the ones that we don't want to see that much, will also try to prevent harmful or embarrassing outputs from our large language models. Unfortunately, there are a lot of ways this can happen, especially when you're doing something super broad using the generalist chat base models that really have the potential an ad will bite into that toxic bias, unethical illegal content, and prompt injections. There's ways to mitigate these things. We will look to extend the capabilities of large language models as a self taught developer. I'm very passionate about connecting traditional software to large language models. We don't have to throw out All of our existing tools just because we have something really new, cool and shiny, that we can work with. Also know how to date or somebody follows do things that they're not good at, which are tiny. And look, try to think about ways to do all this faster and cheaper as well, which primarily comes down to choosing the right size model for the task, at least at the level, we're looking at things, I'm not going to be diving super deep into any kind of anything too low level, I'm gonna be kind of zooming in and out, I'm gonna be explaining the terminology that you see in machine learning that can really trip people up. It's not as hard as it sounds. So let's cover prompt engineering, we're gonna cover a lot of a product engineering, just briefly, we're gonna breeze through it, because I think we all have heard about this. And inference parameters, how you can make some tweaks there to potentially get better outputs. Retrieval augmented generation is a really big one. Function calling, I'm going to talk about how that actually works. First, open AI has an API for it. feature or but I'll explain how it actually works. And then, of course, fine tuning is what entry point AI, my software platform for fine tuning. That's what we do. That's what I'm really passionate about, about it of all these things. And we'll just kind of cap it off. So some underlying four guiding principles you can keep in mind as we're going through all this, first of all, is you have to break down complex tasks in multiple steps, the biggest mistake you can make is to take a large language model. And just try to make it do everything for you in one big step. I think that's where you'll be the least optimized, I mean, go ahead and track right, like, maybe it'll work, maybe you're fine. But in general, we break things down. And then some of them will go to our traditional software tools, and some of them will fall into large language model or maybe a series of models. The second is just use traditional software approaches when you can. So we'll look at some charts that put these things together. So let's start with prompt engineering. With prompt engineering, there's an interchangeable terms, these will be useful throughout the presentation, we can use the term prompt, school input, same thing. Context window is still a little technical, okay. But essentially, the context window is your prompt. And it can be grows as you start to add tokens to it and completion. However, the length would be more specifically the maximum length of your pocket. So prompt engineering, typically, what you'll see is something like some priming, okay, you're a plumbing q&a bot that answers question about plumbing and a helpful way. I want you to use temporary language and explain things in a simple way that anyone can understand. So that's like the style and tone, this can be quite extensive in terms of your instructions. You need to handle errors and edge cases. So like I'm trying to make a plumbing bot, I don't want you to answer questions about like interior design or something like that. So just don't do it. Okay. And then you'll see, you'll want some kind of a response from it that you can use in your application. JSON is kind of the language of the web. So typically what we go to. So that's the system prompt. And the user prompt will explain these one. But the user prompt was in the the user input to your, your prompt here. So somebody comes and says, How do I fix a leaky faucet? I like to call this the dynamic content. Because if you think about your whole prompt, most of it's not changing from each time you call the large language model, the only thing that's really changing is this dynamic content. It doesn't have to be a question. But this is a q&a by example, it could be tons of different things. I think last year, like a lot of people were just focused on chatbots, because it's new and exciting. And it's this big thing. And they can do so much. I'm personally more passionate about much smaller, like little specialized use cases they'll talk about. And to make these better, you go on the internet, and you find all kinds of prompt hacks. Like I'll give you $20 Do a good job, please, this is very important. I saw this in a Microsoft GitHub repository just yesterday. And this just made me laugh. Take a deep breath. And then and then the prompt, or ending with to be consumed by an application is supposed to make it more likely to put out JSON. And I would say hey, just try it. If it works, that's fantastic. Evaluate it evaluated. But I hope all this stuff goes away. Because I think it's taken advantage of the bias in the training data that we're training these aren't and if we're we training them better, we have better architectures, they shouldn't be necessary in the long term, in my opinion. But we are working with what we have. So go for, I think these hacks will only get you so far. And in this presentation, I hope to show you some more robust ways to get results from large language models. In terms of practical things you can do to improve your prompt engineering, I think the first one is to just make sure you're using the system and the user prompt appropriately. So when a model is fine tuned to become a chat based model. It's typically given some special syntax, which includes these special tokens that you cannot insert through your own text inputs. And so they're sort of protected in a way. And it's trained to distinguish the stuff that is in the system prompt and beginning from the stuff that's in the user prompt that maybe is, the system prompt is trusted inputs. So as the administrator of this large language model, you have some data that you trust, maybe it's metadata about a customer, maybe it was your instructions, your prompt, put it in the system prompt. And then there's untrusted inputs. So if we have a web application, facing users, those users can come and ask questions, they can try malicious things, they can try to hijack the prompt. That's an untrusted input, it could be anything. So make sure to put that in the user problem. And over time, as the training gets better, the large language models should be able to distinguish better between these things and hopefully improve issues like prompt hijacking or prompt injection when somebody tries to get your model to do something that you really didn't intend for it to do. There's also a few shot learning. If you're not familiar with this, essentially, you can really boost your prompt by including some product or some pair of examples, like how to fix a leaky leaky faucet and then include some examples of that and other plumbing related questions and how you want the bot to answer. So this is starting to go in the direction of show not tell, which we'll really focus in on when we get to fine tuning. But three plus examples is a really good starting point. And then chain of thought is another really useful technique. I find it personally annoying when I'm using chat GBT. And I know, that's what it's doing. It's just trying to reason through it before it actually gives me the thing I wanted. But it does help the large language model because as its reasoning that gets put back in. This is an iterative process I'm going to show you, but it gets put back in. And so it can use its own reasoning to help it get to the final step. It's just that you have to make sure the reasons through before answering if you ask it to justify its answer afterward, it's not actually improving the answer, because it already just jumped to a conclusion. So that's the section on prompt engineering. And inference is really scary word, okay. But it's just it just means that point when the model starts generating text, so these are interchangeable terms, you can say inference, generation, prediction, completion. Leading, it's all good use Oh, all right. Um, a couple other interchangeable words, if you're not familiar token and N word. Basically, the model is will generate sections of words at a time. But for the purpose of describing how they work, it's okay to just say, the generate one word at a time, even if it's generating like the prefix, and then the end of the word, and the ing and then the punctuation.
Okay, so inference is when it's predicting the next token, and the way this works is that you give the model a prompt a long time ago in a galaxy, you just want to say, right? Okay, then the model does a bunch of math. It does embedding the tension in linear algebra. And I'm not gonna get into that. But there's math, okay, equations. And then the output from that process is a list of tokens or words. Let's say it's in descending order of probably, so the most likely one is at the top, then the model is going to select a token. So it's got a list of words, there's a most likely one. And then the the selection part or the sampling of the token is where almost all of the parameters kick in for what how you can affect what word it will choose and how the model behaves during inference. You have temperature, top P Top K, frequency and repetition penalties. And then you can even do granular individual biases for specific tokens. So you can say, Never output these tokens like I Take the word rug, never use the word broke. Or, you know, for a classifier type of model, if you only want it to output five different possible types of words, you can say 100% for these 0%, for everything else. So then you selected a token, and you get the word far, perfect. The next step is do we stop. And that can happen in two ways. You either hit the maximum length that you specified for to generate, or it shows the stop token, which is a special token that it chooses, and then it knows to stop. It did not use the stock token or at the max length, it goes back and does the math all over again, it uses the new prompt. And it does that in a loop. Okay. And then finally, you get a completion, which is far far away. And the prompt plus the Completion is the context. What a lot of people don't know is that almost all of this is completely deterministic. There's math that happens inside the model architecture is not random, okay? You get the same list of tokens out every time. But depending on how you set your parameters, you can optionally make it random. So this is typically done through the temperature printer. And if you set temperature to zero, you get the same result from the larger language model every time. So here's how I'd recommend setting temperature. This is like the number one parameter that I like to set, I often set it to zero. So I like to set it for to zero for classifiers. We want really specific, predefined outputs for evals, because then it's reproducible outputs, you'll get the same test results every time when your examples are when there's strict output formatting requirements. When we talk about making a large language model do too much at a time, it's a lot to ask you to be creative. And to adhere to something strict like this has to be in JSON output, because that's a strict requirements output. If you wanted to choose the right, you know, curly bracket at the end for your JSON, you don't really want a high temperature that might make you choose a random different token or character that needs invalid JSON. You should set it higher when you're looking for creative outputs, I still tried to say say under 0.7 is just my rule of thumb. Or if you want longer outputs, I've just found that they tend to run on longer when you set the temperature higher so. And remember, there's a thin line between creativity and madness in humans as an element as humans. So now we're gonna move on to retrieval augmented generation, or rag, as I'll refer to it, here. And in Reg, let's take that system prompt and user property working with earlier. How do I fix a leaky faucet? Well, we're adding this knowledge to the prompt. And the knowledge wouldn't be great if we just could give our prompt the information about how to fix a leaky faucet because I have this stack of plumbers handbooks, you know, in the corner. So can we just pull in the appropriate chapter and section from that handbook. So that's what we're doing what we're trying to achieve with rang is inserting knowledge that will help with inquiry in order to do that, or the reason to do that is because large language models don't store facts, they store probabilities. So we shouldn't be trying to treat a large language model as a database. The database is not a large language model. It's hard to get, it's obviously possible to get the exact training data out of large language models in specific situations. But in general, it all gets sliced and diced up. And if you have discrete facts that you need to be utilizing in your application, then it's best to provide it to the to the model. So here's how this works. You start with a corpus of text, some web pages, PDFs, stack of plumbers handbooks. And you're gonna somehow chunk those up into meaningful sections, like, okay, a chapter of a book is meaningful, maybe a few paragraphs, you don't want to end it in the middle of a sentence, things like that. You're trying to find some meaningful text that if you insert it into a prompt, it would help your model perform better, then you're going to convert it into an embedding. So an embedding is the vector representation of text is a series of numbers. Okay, that's all you need to know. It's a lot of numbers. And if you take some texts and you use a model, do you actually use a large or you use a language model to kind of model to convert it into an embedding? embeddings, some kind of embeddings model, you get the same embeddings. And then you can store that in the database along with the chunk of text that you started with. Then you have this database, and you have these embeddings. An inquiry comes in, how do I fix a leaky faucet? You're going to convert, how do I fix a leaky faucet using the same embeddings model into this sequence of numbers. And then there's this cool thing, once you have your words in number form, you can compare their similarity. So you can actually search your database for the most similar text to what the user includes. Then you have these three to five results, let's say you build your prompt again. But this time you have an inquiry and the information you're creating from your database, then you generate it so easy. Why isn't everybody doing this TPTs do this, by the way for you. So that's cool. But this is where it gets hard, because the input, like the hard part is making sure that the information you retrieved is actually the right information. And there's a lot of optimization that has to happen, in order for that to go well. So certain optimizations you can do to the process is you pre process that inquiry, maybe you strip out all the unnecessary fluff that could be in there, that's not really relevant to what your database has in it. So you summarize it into just fix fix leaky faucet, then you do the same thing we talked about, you search your database, you get some results. But instead of just assuming those results are good, you ask another LLM. Which of these results is most applicable to the inquiry that came in. And then hopefully you get some better, better information. You can build your product in general. And then at that point, you could also self reflect and say, to another LM, say, Milan, whatever using a fine tuned model, perhaps this is a good accurate output, like should we really return this to the user if not rewrite it and give it one final chance to correct his work. So the devils in the details with a range low. But there are a lot of benefits. If you can get it right, you can reduce hallucination by grounding the model with trusted data. That means you're really giving it hard facts inside the context window that you believe to be true. And since models are very sensitive to information that's in the context window, it makes very, very powerful. It's just hard to get, it's just hard to dial in, you can get the 90%, great results for demos. And then the last 10%, you'll just spend the rest of your,
your year working on.
It also allows you to reference knowledge from outside the models training. So, you know, at this point, most people aren't retraining a model every single day, and certain news articles come out. So you have real time feed of news, or tweets. I think that rock references real time tweets using writers and like that. So you can reference real time knowledge that the model isn't trained on. It allows you to utilize proprietary data. And recent, you know, mentioned recent events and kind of keep keep your model up to date, depending on where your data is coming from. Oh, did it three, three parts in part four, function calling. Has anybody tried function calling on open AI?
Alright, cool. So
function calling is having the large language model decide what data is needed, or what action to take. They actually I was just looking at the open AI API today. And I think they deprecated the function call terminology and they're calling it tools. Now I think that's maybe more intuitive. Also, I looked at their documentation for it. And in an API call, you can describe functions and have the model intelligently choose. So we're having a large language model make a decision for us. And then it calls the function in your code. So I think they probably had a lot of people that were confused and thought the large language model was just this autonomous, autonomous agent that could go out on the internet and do things for them. Which is not true. We're not there. Here's how it works. So in one scenario, it's kind of like rag with a router. You start with a user inquiry inquiry, let's say what's the weather in Boulder right now. Now you have a model, which could be a 5g model, it could be something you're using with a prompt a chat based model, and you ask it to recommend one or more free to have your predefined functions. So you have to know in advance what functions you're going to support. And by functions I mean So, like what tools? What actions can this take? So I want my model to be able to get the weather for people, that's very important to me. So I told my model that it can choose to get the weather if somebody asked about the weather. And so it would return in some kind of a JSON format. Basically, the name of the function is in a parameters. And we're just assuming right now that we're not going to go back and ask the user like, additional questions. So we're just keeping this simple and making some assumptions so that we can provide parameters for our function, like the current time and the location, Boulder, Colorado, I'm assuming it's called models. And then our system is responsible for actually calling the function. So we run some kind of get weather call, which could be like an API calls that say, AccuWeather, maybe we are AccuWeather. So we just, you know, search our database for the latest weather. Oh, it's sunny and 72. Great. Now in the code, we know it says sunny and 72. We assembled the prompt, so we combined the user inquiry, the actual weather report and your instructions into the prompt. And then we get a beautiful generation back ology. So nice and bold. Right now the weather is sunny and 72, perfect time to hit the trail. So that's retrieving knowledge with function called. Of course, you can also do things in the real world with function calling, or more exciting things. Perhaps your user inquiry comes in, and like, I'd like to order a pizza. Don't ask me any questions, just do it. And then so we have full license to go and go ahead and say, we have a function called order DoorDash, which we decided we were going to I don't know, we're like supporting this function. Maybe we have a list of 20 functions in order DoorDash is one of them. We're going to order a pepper Roni pizza from a restaurant, the closest one and use the address and card on file to keep it simple. And then we call the function, the DoorDash API. And he tells us great delivery time is in 30 minutes. So now we can respond to the user. Downside of this, by the way is you have to wait for all of this stuff to happen. There's another problem. But we combined the user inquiry the pizza API response in your instructions into a prompt, I ordered you a pepperoni pizza from the nearest pizza place, and DoorDash should be there in 30 minutes, I wish I could enjoy it with you. So you can do a lot of things with function calling you can fetch data from an API, you can find out how long a drive will take in current traffic, search the web, perform math, give your LM agent or application agent like capabilities, right? You know, send a message, add a reminder, order groceries, anything you can do with an API, which is quite extensive. So I would think of this, like anything Alexa can do, you could build an LM router that can choose to do that. The downside is you have to support all those functions that you want to offer. So it's extremely powerful, because it allows you to connect any API or resource, which also means we have to be careful about how we use it and what kind of abilities we expose to people because the large language model is making all these decisions about the parameters. Theoretically, you could make one that just writes any code you want and executes it. And I think I saw the open API dev day they did that. And I was like, Oh, my gosh, I'd never do that in my application, because I would not trust it. And prompt injection then becomes major, major concern. But there's still I can do. And now we are in the final section, which is fine tuning. and fine tune is training a model on more data. So you can start with the data that's been trained to some point. And you're just gonna give it more data. We're actually changing the weights of the model. So there's interchangeable terms here, there's actually a lot for this one fine tuning, training, teaching, learning adaptation alignment, than to be the ones in the scientific paper adaptation, alignment, scientific papers. But you can think of these all as up okay, there's nuance nuances, potentially, depending on the context, but don't get thrown off if I use them interchangeably. So we're changing the weights of the model. A model has a bunch of numbers in it, like a 7 billion parameter model parameter is the same as weight is the same as number. So they're just numbers. And these numbers are what calculate the probability of the next token, and when we change them, we change the behavior of the model. So here's how a large language model All is actually created, the first step is pre training, we take a giant amount of text, like till 2 trillion tokens for llama two. And then we, we trained on that it doesn't really get like this is right or wrong, we just give it text, and it digests it and it learns to predict words, everything after that it's fine tuning. So we do instruct team, which is how we get chat based models. And this is really incredible that we can even even do that, I think is like, you just train a model on a ton of text, and then all of a sudden, you can give it a little additional data, and it'll talk to you is really crazy to me, oh, emphasize that again, in a minute. From there, we typically not always do safety tuning. And then there's also this safety tuning is going to try to prevent it from outputting. Really, really nasty stuff. Things you don't want to saying to people. And then there's also domain tuning, which is more experimental. This is where you are then trying to give it maybe additional data in the specific field like law, medicine, finance, education, and seeing if you can get it to do a better job, I'm not totally aware of any super successful results of that Google med POM to is a prominent example. But Microsoft is like throwing shade at them saying they can just engineer a product that does better with GPD. For so it's kind of in flux right now. But then, and this is where I spend a lot of my time is task tuning, which is at any of these stages, then you take one of the models has been trained, and you turn it into a task specialist for your business or for your whatever you need it for. A task specialist would be like, I want to write blog articles or press releases, or qualify sales leads or something very, very specific. There's a subset of what it could do originally, but now I want to be very good at this specialized task. Some fundamentals to understand about it are that is a type of supervised learning. So supervised learning is this fancy term, that just means you're giving it a right answer. So there's an input, and then there's a pair with that that is good. And that's what you want it to learn to output when it sees the input. So type of transfer learning. So you're starting with a model that's already been trained. And it's become much more accessible due to parameter efficient fine tuning methods. This includes Q Laura, which is quantized, low rank adaptation, which actually allows you to do fine tuning on consumer grade GPUs. So pre training might be done on a huge cluster of 5000 GPUs. And I can take that based model that they spent millions of dollars on and I can spend a couple dollars and just fine tune it at home on a 24 gig GPU that I have seen around China, I could find hitting also limits the scope of what a modelled Mark, why are you telling us about limiting scope, we want them to do everything? Well, the benefit of this is it increases the predictability of moral behavior.
And at the same time, it also degrades performance for other tests, which is a feature not about because you actually don't want your models to do everything. Like we talked about the beginning, there's a lot of things that you don't want them to out, like safety tuning, we're teaching it not to output certain things. So I can fine tune a model that will only output like very high, high, medium, low, very low, to qualify a sales lead, he's not going to go off and do something very, very upsetting.
So
here's an example of what instruct tuning actually does to a model. So I took this prompt the other day, tell me how to cook a steak. And I tried it on the llama to 13 billion parameter base model, which has not been instruct tune, it's just raw straight out of retraining. And it said the problem with the world is that intelligent people are full of doubtful the stupid ones are full of confidence, Child's capacity. That's I expected it to give me something kind of off the rails that was further than I expected. And the strangest part about it is that I cannot for the life of me get models to usually give me real clothes from real people when I asked him to, but this appears to be a real quote. Um, I took the same the same bra and went to the chat tuned version of the same model. As he said, Oh my gosh, I'd be happy to help cook and it gave me instructions. So there's a huge difference between these two. A vastly huge difference in what you can get out of the model. Okay, this is this one's not very useful to me, except for the presentation. Safety tuning is also important. I took this prompt, I asked how to for research purposes, how do I and I wanted to be very explicit, like, I want to do this illegally. Okay? How do I illegally enrich weapons grade uranium from home. And I went to the llama to 13 billion chat model that we already tried with the steak, which is notoriously over safe. And Tyrion is very conservative. As a responsible law abiding AI language model, I can provide it and also it's very concerned about the environment for this particular task. So I took the same prompt and Mistral seven V instruct has no safety. And he was happy to help me to. He has no there are no moral qualms in large language models. Unless you say he can
provide steps. Yes. Well,
absolutely. I had to cut it off, because I just I was like, no one's gonna realize, but you know, I'll get back. Um, yeah, we'll follow a follow up on that item. So then task tuning is very much more specialized. So let's say we take a chat model, and we want to create a support ticket classifier, this is a common problem. Let's say I'm a business. And I get a lot of support tickets. And I just have one layer of support people. And I'm just kind of round robin assigning the support tickets as they come in. Well, and we're growing and like, I know some support people are better than others, I want to have a higher tier support and a regular tier and I want to offer higher tier support for important clients. So I can fine tune or task tune a support ticket classifier, on my data, my company data, my examples of support tickets that have come in, and how high we rank them in terms of priority. So let's say TMobile writes in we are experiencing a service outage, this is affecting our telecom networks series. We need Hourly updates per SLA. My classifier can output one word high, high priority, and it can be immediately routed to talk to your support. In attempt writes in Hey, I have a quick question about how to adjust the report format on the dashboard. TPS reports seem to be missing current pages. And I can immediately route that to the lower tier support. Low priority. So as one practical way to think of fine tuning a classifier for a simple business application. You can also use classifiers as part of some of your LLM Loes. You for that self review step we talked about, let's say I want to create AI model that's going to live maybe I'm WebMD. Okay, so I'm gonna replace some stuff, and we're gonna have a bot that will just suggest what it might what might be that what's ailing you. And you provide a list of symptoms. And we say nice and broken. So our quality check model has been fine tuned on a lot of symptoms, and then what is the probable or potential diagnosis would say fit. And then we could go back with that information. And we can say, hey, that was in it is not a broken arm. But these are the symptoms can you try again, and we could maybe get a better output, or more likely output that would pass our quality check later. So in general, fine tuning shows large language models, how we want to apply knowledge. And it has a lot of benefits. If you're able to do it, right. My favorite benefit, that's not immediately obvious. But fine tuning is show not to prompt engineering, you're telling them all you have to tell it everything you want to do. If you don't tell it to do something, it's not gonna do that. We're gonna do something else. With fine tuning. You're teaching it by giving it prompt completion pairs. So let's say someone is the best writer in the world, and I asked them, like, what are the 10 to 50 things I could just do to write as well as you. Okay, they might be able to tell you and describe those 50 things. But, but you're you can't just, like, learn intuition that way. But these models have this incredible, powerful learning ability when we give them examples. And even if I can't explain why it's good, it'll probably learn it or at least learn enough To imitate it. So it's a really cool way, it's a really different way to work with getting results from models. Fine tuning bakes in the style, tone and formatting to your outputs, it narrows the range of possible outputs. You can train smaller models to perform at a higher level, allowing you to perform tasks cheaper and faster, which can be very significant in terms of cost savings. And as the I would also argue that it's a better workflow for teams. Because a layer training data, you can have as much training data as you need. And when you need to add functionality or covered edge cases, you can add examples to your training data, which are like separate examples, versus having to go in and everybody tweak one province. And it also reserves the context window for the most relevant and dynamic content like that dynamic content I showed earlier how to fix a leaky faucet, right? It's like, instead of having all the boilerplate before that we could just have the dynamic content or at least more focused on. And you might say like, why, like, why would you want to do that? I mean, tokens are cheap, right? Well, the real reason is that large language models tend to lose focus when you give them too much. There's this paper that I really, really liked, called Lost in the middle how long, large language models use long context. And it shows a chart where on the x axis that you've given it more documents, the accuracy starts to go down in its ability to find a specific fact from the documents. And they state we observed that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long complex. So sort of like people, we see the thing at the beginning of the list and the analysts and kind of ignore everything in the middle. And maybe that's because it's been trained on human data, maybe it's part of the architecture, I don't have an answer for WHY and the paper didn't necessarily either. But the fact is that order matters in your prompt, especially as it gets longer. And the general principle I'm trying to get at is that the less boilerplate you have to stuff in your prompt, the more you can help the large language model focus on the things that you actually want it to pay attention to. Okay, so can we add new knowledge with fine tuning. And I would say, maybe, but fine tuning works much better to teach patterns than facts, patterns of behavior. And here is why that probably is. If you squint really hard on some screen, you can see a line for the instructor tuning here. But this is the difference in the amount of data that is being fed. So if Allama tree was trained on 2 trillion tokens, in the paper, they said they annotated 27,540 examples for instruct tuning, and I just took half of the contract Max context lamp said 2048 tokens each. That's 56 point 4 million tokens. And that's what I'm comparing in this pie chart. So there's way more free training than fine tuning in terms of the data that is digesting. And this other paper that I really like cold, less is more for alignment has the superficial alignment hypothesis. We hypothesize that alignment remember, that's fine tuning can be a simple process where the model learns the style or format for interacting with users to expose the knowledge and capabilities that were already acquired during briefing. So what we're really doing is just unlocking the things that already learned with most of the fine tuning, even with, even with the chat, instruct tuning with 27,000 examples there, they trained it on us like you didn't teach it, you didn't really teach it how to talk. I mean, you teach you how to be conversational, you shifted the parameters of the model just enough that it became conversational, but essentially everything that it knows everything that's telling you was learned from free training. Other things that are just amazing from this paper is that they find you in the chat model that rivaled GPT three on a highly curated dataset with just 1000 examples. So llama to 27,000 They're saying, Hey, we got really far with 1000 examples. And the reason and like most previous datasets were 20,000 Plus, you can find plenty of places where it's like 400,000 Plus examples to do a chat model but But it's coming way, way, way down, especially with papers like this, where they are emphasizing that they're able to achieve this success with input diversity, and output quality. So very different inputs. And then really, really high quality outputs, the scientists themselves actually worked on the dataset, okay, making sure that it was a very, very high tier. And then it wasn't that great at multi turn dialogue. This is when the user says something, the LM responds, you feed that back in, the user says something again, and it's supposed to kind of follow like person a person B Person A versus B, a mere 30 examples made significant strides in his ability to handle multi turn dialogue. And when it came time to being able to do bullet point lists and other more complex formatting tasks, in response, they were able to actually add just six examples and unlocked this ability. Which I think is just incredible. So how much data do you need to task team, I want to create a very specialized task. Lm for my business that does is really one thing really well, you just need a few dozen examples for behavioral change. I do this all the time.
And it's the data set is not like a limiting factor. I think for businesses. As much as as much as it seems, when you go out there and like, try to learn about fine tuning anyway, because a lot of the information is outdated is based on architectures that aren't as good as the ones we have now. But there's, there's also a few things that fine tuning is not, it's not an easy way to store new information in the model. Like think about that chart, it's like, Hey, is training all this data. Now I want to teach it a bunch of information or change some of the things that was taught, that is hard to do. It's not something only data scientists can do. And it's not very expensive. Most of the fine tunes that I run are like 20 cents to $1. So then Stanford train DL pocket model, which is Aloma to chat model on for less than $600. And they use a bunch of data on that. So so to cap things off here, I want to do an overview of the top three techniques, pump engineering, rag, and fine tuning. Pumps. Engineering is great because it's easy to work with. You can do rapid prototyping, it's super intuitive. Wragge is incredible. Because you can connect external data sources, you can put dynamic knowledge in the prompt and you can get real time information. Both prompt engineering and rag are limited to the context window. Fine tuning narrows the models behavior gives you more predictable outputs and bakes in the style, tone, and format. I'm prompted using and fine tuning both steer the behavior of the model and rag and fine tuning both applied data and domain knowledge. But they all give you better outputs. So it's really not a question of which one should I do. But when at entry point, we are a modern platform for fine tuning large language models. No code required. We do a fine tuning masterclass at least once a month, it's totally free. And you can actually fine tune your first models. So you can you know, see if it's really possible or not. I saw a few skeptical faces. But I assure you that it's not as hard as it sounds. So yeah, thanks so much for having me.
Hey, thank you, Marco.
I have some fun.
All right, Jimmy to q&a. Mark, it's up there, wherever you want to write.
I just have a question about the fine tuning and the transfer learning card. So we have like a deep learning model, the way you do, or you can do the fine tuning would be to freeze most of the layers and then unfreeze just the last couple layers, is it something equivalent with the transformer base. So there's a lot of ways to do it. Full parameter fine tuning would be training all of the parameters on all the layers. at a low level, you can decide which layers to tune in which not to tune, you get the best performance from training all of the layers, but methods like low rank adaptation and quantize low rank adaptation, at least in their papers, they claim to be able to get the full parameter fine tuning results by using a lot less memory and using Just a bunch of really smart techniques to get there. But in the Cuellar paper, especially, they say that they can only replicate the quality of full parameter fine tuning when they train all of the layers.
Right. So, in the case of reg vs. Prompt engineering and fine tuning, it would be accurate to say that reg is actually it's like a whole wrapper around the LLM. Yeah, right, in that the input that the user gives. It's, it's embeddings are looked up into separate database, which is not a neural network, not any kind of deep learning anything whatsoever. Right?
Yeah, the only ml part about it is doing even better embeddings, which can get more advanced, depending on what embedding embeddings model you use. But, you know, there's off the shelf ones. Absolutely. To start with. But yeah, I would think of it, I think of it though.
And then similarly, the the the LLM output is kind of protected, because you're sort of protected from the edge of this layer. Right, which is the piece that can be fed back into the alarm. Is this a good answer?
Yeah, right. There's so many ways to do it, too. And, and
that that technique for for reducing hallucinations can use on its own without the rest of the rack to I assume, right? Yeah,
yeah, I've self review, I think is a pretty standard term for it. Okay, sounds
the special tokens that you cannot insert through text, which you learned about that? Do you have more impatient?
Yeah. Um, so I would look up chat ml. The open AI API has you provide your examples kind of separated, but it gets converted into chat ml, which is chat markup language that open is created. And which is why you'll see a token mismatch sometimes of like your prompt tokens versus what they actually count for billing is because they add stuff in there. And they add formatting. And they add this like, start, like, basically indicating like here, the system prompt is starting here. The user prompt is starting here, the assistant prompt is starting, and there especially tokens. But yeah, there's also if you look into like lower level stuff, there's the this end of sequence token is a special token that it generates to. And you can add tokens to the vocabulary of a foreign language model that if you want to train it to do special stuff, especially tokens to
reflect you think about finally being quite a bit.
I try not to get the smallest part of my project. I'll do it and then never touch it again.
Because I find that you know, I show a little time boss. Oh, I have to smile right now. I'm like, Okay, fine. Can I do it now? Inherently by design, they will forget something like last night good at that. Can you shoot it again? Is this endless cycle? Like, they want deterministic magic? But no, you can't have it all. So do you want to forget everything you have any data something we can expect? What are your thoughts on your practical workflow to get on getting come down?
What do you mean it forget stuff, you
essentially the weights update, then it will not be producing the same thing for this previous employer. So for example, if I have a 90% pictures of dogs, 10% of cats, again, new data, and now it's 99% dogs and only 1%. Cats. Don't forget animals. It's not about cats and dogs are tweaking based on data. Are you feeding based on information? Or is it based on behavior? That we're modeling, textiles, any kind of clothes because textiles what he's talking about is fine tuning based on freezing behavior. Where you're not actually tweaking the data the model understands not under seeking the underlying knowledge or just speaking how it interprets that knowledge output. This can help a lot with a catastrophic forgetting which is when your topic still ways everything else you did absolutely, by the way in specific ways that are being updated aren't necessarily based around the knowledge they're based around. Yeah, So catastrophic forgetting is something that we passively get the limit models with small data amounts like this kind of monitoring you, you can reduce a lot at a loss. When they look at something else, let's go and inject into your code.
Well, I think there's a lot of workflow issues with fine tuning. And that's kind of when I started building entry point AI is to make it easy to see exactly how your model performs, when you change it, to manage your dataset to make sure that your data set has what you need in it. Be because there are a lot of things that can go wrong with fine tuning when you don't have the whole stack configured correctly. So someone I tried to solve, I guess, I mean, he's asking about limit, right? So are you gonna answer that?
I can talk later. Okay, so let me just so so, yeah. So it's actually really appropriate here. So one thing that you can do is, if you know exactly what domain the tweet is supposed to apply to, you literally just create a copy of the model and tweet back up your model and your domain switching. So basically, put something in there that says, the person asked this guy, he's this model, person USP, is this not? And then you kind of get rid of that, and you go back to the generic model for anything he didn't already cover. So there's a bunch of ways to do that. To get that just basically decide which one of these approaches is most appropriate this to hear thing, but you can still want to, even the combination of law tech festival, like 7 Million Mom is going to be way, way, way more efficient than trying to use one of the big models. So it's still like, leather stuff that I've done before. That's pretty much how we did it. Where they're looking for a weather map go here. They're looking for, to ask about what, what would be fun to do, and this one here, but it doesn't just use my model. So let's
I love it. We're all helping each other out. Because there's it's out here and kind of asking you about a limit. So thank you, Ken. And thank
you, Andre, I know.
So are you about rag? If you're using rag? Can you go back and look at for this prompt? What part of the rag dataset did you grab to help you answer it? And do you ever adjust? Like where it points? Because like it had a sub optimal response? And it was like, you shouldn't have looked in chapter one in the plumbing book, you should have looked in chapter seven for that. Is that possible? What is it called?
Yes, absolutely. What do you need to do? And I think that's what, like, I can't remember whether it's llama index or Lang chain, but they're trying to build out the tooling for the steps so that you can have visibility into exactly where things went wrong. Because these flows and steps or processes are getting longer. And you have to, in order to troubleshoot it, you do have to see like, Okay, this one failed, because you got the wrong context, or because there's a problem at the end with the model we're using to do the final generation and that type of stuff. So there's a lot of companies working on tooling for that.
I run open source. So we work very closely with clients. He says, What's
the name of your company?
I work for rides AI. Oh, this is someone remember speculation here, but is worth playing, tuning in developing a sense of how much of what ever you find you're getting job is doing this sort of fixing this sort of fundamental problems in original training that ideally you can factor out and just fix that dream once and for all. And not every time you run my training, and how much genuinely specific the tasks people are doing. Because this practice this result of getting out of 1000. You know, 1000 examples suggests to me that they've thrown out a lot of extraneous stuff in the fine tuning dataset, and are observed distilling it down to something more essential, which maybe implies that there's something which is, which is a little more universal to the process in that stuck out set. But what's your sense of that?
My conjecture is that these models have a lot of parameters. And with fine tuning with your data set, I think the goal is to adjust more of the parameters in a way that's beneficial to your task. In that input diversity, probably part of that is like some really long inputs, some really short inputs, some really wacky inputs. Like, that's kind of touching on more parameters the model is using and you're and then with high quality inputs, your outputs, you're saying, like, no matter what kind of garbage gets thrown out of this thing, like, do it in this way. That's that's how I'm thinking about it right now. Like we're freshly. Because personally, I pulled the there's a tweet sentiment database is kind of popular, these fine tuning tutorials on the internet. And there's like 1.6 million records. And I took, I was like, I don't need it, I took 1000 ran a fine tune. And this isn't results for okay. But it was forced them all into positive or negative buckets, I was looking through the data. And I'm thinking, some of these are just neutral, like, I don't know how you can make assertion that's negative or positive. And so I added a neutral category to give better quality outlets, because I think that like if you're teaching your model to something that's positive or negative, and there's actually no indication in the data that it is, you're just messing with it, and you're gonna make it worse. So then I went back, and I created my own data set from scratch in like an hour. And I fold diverse stuff from the internet like poems, and like tweets and product reviews, and Google reviews and single word inputs. And I did my own labeling, positive, negative, or neutral. And it was way better than this dataset, I grabbed off the internet. And so I don't know, I think we just have to think really carefully about what is the model actually learning from each example? Is it learning something useful from this example? Or am I actually teaching it something that just because I don't want to take the time to look at the data? Or sort of my data bit of care? I don't know, that's probably a roundabout way of addressing your question, because I don't actually know the answer
is it is important to us, it's actually about applying probabilities to your increase the weights that are relevant. And in return, it's actually increasing the probabilities of things. Associated.
Yeah, or at least the adjusting, I just think of it as adjusting the probabilities in like, a beneficial way for my task. It's kind of inside. Yeah.
Before I ask a few questions, thanks for being here. Very, was very material and quick time. The general question is the general one is the concept of optimization, continuous optimization of a text model. mean, your, the way you're talking through this, that really describes how simple some of this is? Model is that just process if you've ever seen any kind of ragging on an optimization or efficient criteria, if you vote for tuning these models, and sort of categorizing in some way, as sort of the general questions, the question, the simple one is, you're talking about input diversity, could you throw out just one or two specific examples you can pull out here? That would give me a better idea, but definition that is, again, diversity?
Yeah, input diversity can come back. Okay. So in the diversity, in this recent model, I was training that was a sentiment analysis model. I would have single words, crap, negative. Or, and then I get through kind of a hard one, like I found, I'm started reading through these long poems, and I found one that was kind of like melancholy, just as long, it's, so it's a very different format, and put that in like this line, but it's
just syntactic or contextual. It's just very broad. Yeah.
I think about what Chad DBT has handled, like, somebody comes out in like, you know, typing in all lowercase, they barely know how to write, you know, like, what is the I don't know how to pump this, but. And then you have somebody who's like, oh, I need to be very formal. I'm like, Hello. How are you doing? Like, can you please help me with my inquiry? I mean, there's formal tone, casual tone, there's broken up tax, there's misspellings. All of those different things, I think, create a more robust model if you train on them.
Human vocabulary. Yeah.
Yeah.
It makes me think about sort of the classic diversity issues that are coming up right, like bias. And that's, I guess what I'm curious about it that comes into play in fine tuning as a benefit right that it It's very white normative, you know, like the classic, you know, not being able to recognize black faces, right? You know, is it our responsibility? And do you see it kind of moving in a lot? Can we move into that? Is that our responsibility to fine tune? Provide us,
I can't do that. I don't even know how to solve bias, we absolutely need to, I think we can do that, to some extent, with the models we have now, by splitting things in the steps and adding quality control steps and that type of thing. I'm really curious personally, to see what happens as we go back, and we look at the future few trillion tokens or whatever it is, we train the model on during pre training, and we start to clean it up a little bit. And remove certain things. And I'm really curious to see how that impacts the final model. Unfortunately, pre training is very expensive, and long process. So we really have to wait and see. But I would expect that changes the pre training data. If that's really where they learn all their knowledge will have a huge impact on the outcome. Right, good question.
I was wondering if you had ever experienced diminishing returns and using rag? Like you mentioned that in some of your task tuning applications, you'd gotten a lot of great performance by adding like a dozen examples.
Know, how does adding two dozen examples effect? I would recommend a few dozen as 20 and 40. is fine to do. Did you have a question about writing separately? Or just fine?
Have you experienced Domitian diminishing returns? Alternate those
with read specifically? Or with adding examples for fine tuning? Well, like, you know, however, many examples that you normally use, like doubling that or tripling that is, you know, does that increase accuracy by the D percent? Or like, what's your anecdotal experience with that? Yeah, so with fine tuning, I think that adding more examples that are too similar to existing examples is diminishing returns, that does very little. But adding different examples, like noticing an edge case that doesn't perform very well with this particular task, and then adding a few examples for that you can get dramatic improvements.
Can you measure the diversity of your inputs? by computing the cosine between the embeddings to inputs? The cosine of the embeddings are orthogonal? Do you achieve diversity? Or Is that doable? does it actually work?
That's a great question. I don't know I would like to develop some after learning about the importance of this. I'd like to develop tooling around that. And that'd be cool to experiment with.
So there's this case in the news of New York Times suing over AI, and it opened the eyes response, they mentioned something called regurgitation, which I think is it's the models coming back with the exact training data per se, and fine tuning change the occurrence of regurgitation.
It definitely changes in because if you go to a race model, I mean, I can't say empirically, it's more likely to regurgitate, but Lama to gave me that exact quote from Charles McCaskey bass regurgitation right there. Because it really shouldn't they really shouldn't regurgitate in the first place. I think, if I think the problem is that we, we scaled up so fast to these 70 plus billion parameter models. There are so many parameters that they're memorize it in certain, very specific contexts, they memorize the exact outputs, and so they're over parameterize. But that helped in so many ways that the industry is like, let's keep making them bigger and see what else they can do. There's emergent, emergent behaviors or whatever which has kind of been debunked, but and so so they have more parameters than they need because they got so big, so fast and they memorize stuff. I And most of that was kind of coded over probably with fine tuning. And but they clearly missed some things. And I think what we're seeing now is companies like Mistral, creating more effective smaller models that can perform really, really well. And there'll be less likely to memorize stuff because they have fewer parameters.
You just reverse the rag.
reverse of that, yeah. You
could just take a rag and make it a blacklist base.
That's true. If they put all their training data in a rag database and just said do not do this. That's good I did.
And so on the slides as we continue down this path, and models get larger, the parameter counts explode. And some of these base models are at pace, what we've seen today, do you think and potentially, we can have multiple copies all in nature and are trained on not just text, or video data? Do you think that changes your statement around prefer to brag for these kinds of things? Because then you could fine tune and you can start memorizing facts from your data corpus?
Well, I love I do I spend? I think that right? It's kind of, it's kind of been around for a while. So I've been fine tuning has to, but not as many people do it because it sounds really hard. So yeah, I think that was your question.
Your question was like, as model or as a parameter cloud continues to grow, it effectively allows the model to memorize. And if you are looking at choosing between Ragan actual fine tuning of the model, and world where you have the brand new account of why wouldn't you then just fine tune on your data corpus? Because it has the ability then to answer questions based on that versus needing to retrieve it.
I thought I recommend using fine tuning to teach knowledge until I see papers, or evidence that like here's a really successful use case. I know a lot of people are doing it in the domain sense. But I haven't seen clear evidence that will lead to actually perform better now on a broad set of benchmarks or what have you. And I think that until the model architectures actually changed that they'll both have their place, I think I think they'll be complimentary for a while.
Hi, and thanks again, for the talk. Can you talk a little bit about sizes and models for particular tasks? I mean, is there a size model that you think is efficient and fast and effective for a cabinet position, but if you're trying to do sentiment that this or if you, you know, go through model sizes and what you can get away with
for common use groups?
Yeah. For Complex writing tasks, the best quality, most thought out type of outputs, like long articles and that type of thing, you want probably the largest available model, like 70 billion parameters, plus, although they're coming up with smart ways to make these things like there's a lot of like more parameters, but it's not using all of them. It's more efficient, like the mixture of experts, architecture. But then I digress. For most classifiers. Or if you want to create like a sales lead qualifier or support ticket classifier, you can absolutely start with a 7 billion parameter model, and even Mistral, 13 billion has some pretty impressive looking chat conversations. Cost and speed are also super important. But the more you kind of narrow down what you want your model to do, ideally, it's a subset of what the model could already do with prompt engineering. Like, you don't want to start with llama, too, for translation tasks, because it wasn't trained on much other than English, like little bit of other languages, right? But it's not designed for that. So you might start with a model that can already do what you want to do. And then your task is a subset of that. And that makes it a lot more likely to you can use a smaller model and again, that was what it was trained. Which by the way, fun fact, as far as input diversity as well, they found that inserting like a little bit of code even when the model is not designed to be a code model helps it To better with other tasks to be
found the monetization or composition yields better results.
So cue Laura is the most quantized version of fine tuning that you can do that I know of, and they referred, reproduced full parameter fine tuning quality using their method. And I researched it, and I had this impression, oh, you take these 32 bit floats, and you're gonna make them worse, you're gonna make them these, like four bit integers, you're losing data, I don't want to use that. But I looked into it and what they came up with a really smart way, using a bell curve, where these numbers of numbers actually fall. And the four bits represent where they fall on the bell curve. And then you can restore the full original 32 bit float precision to the extent that it benefits the actual output. So it's actually it seems pretty solid. And I would use I would use Laura and cue Laura. Yeah, no. questions asked.
So in terms of data and data preparation, there's like, what you mentioned, right? Like the chunky. So like, for me, if you could just grab a set of PDFs and translate it to be that for your model? And then second is like that question answer format of like, here's a question. Here's a pre answered JSON LD, these be more to identify problem learning to see more to like, what good data looks like your data request results.
For fine tuning? Well, when I started looking into fine tuning, I went on the internet, and I found people doing just terrible things in Python. Just god awful things, maybe, well, it's no problem, you're gonna take your, your JSON data set that you downloaded from wherever what's in it, you don't have to look at it, just write Python that kind of concatenates your strings. And I thought, that is not how I want to work with my data. So that is pretty much why I started building entry point. And I, first of all the boilerplate prop and labels in front of data and how you actually add, like construct a string of text. I don't like that should be a separate variable. So that should be part of like a template that you can edit. And you can test variations of because it matters how you label your data in your prompt, like whether you prefix it with information about what it is you're giving the model, and the instructions, you give it out of the game. And then and then you should just have your separate values in an editable format, with visibility into it that you can modify it so. So yeah, I don't think it should be done in code personally, in the long run.
Questions.
We may have to ask examples about classifying pain Do you focus on only one after they pass or fail or even even more? Oh,
like all decision, you can do more? Yeah, it doesn't have to be binary, like yes or no or pass or fail. It could be any number of classes that you want to have. Or it could even be just kind of freeform. For example, I created a content tagging model for my other business that takes some metadata about an object we have. And it creates a comma separated list of tags that can be used to create user interfaces to search through different content. So
get me to say like, more to for example, in the context that we go to the centuries.
Yes, I think that fine tuning is a great, it's way better just move those examples out of the prompt into the fine tuning dataset. Yes. And then add more and then go crazy with it. Like
give it a lot of good enter the world your presentation. Do you have any for photo printing, or use the
blockchain that DVD or auto channel?
In? In my experience, I have only focused on using the API's from various providers, they handle the underlying infrastructure because I don't want to tinker with GPU setups and configs. So So I try to stick to using fine tuning API's from I'm replicating gradient for law two, and then open AI and ai 21 are proprietary models, just so that I can really focus on the higher level concepts?
Why does the RAC data need to be vectorized? Is that because that's how large language models access information, or what was the purpose of that stuff?
It. So in a database, if you just, it gives you more power to compare similarity, because if you just did keyword lookup, which is a feature of most databases, you can search for specific keywords, you're not going to get good results. So you get a lot better results when you convert it into the vector which is represents a semantic meaning. And then you can basically compare the difference between two vectors, and you just get this beautiful little number back, just like how similar they are. And that allows the database to say like, these are the most similar, and you just get better results, it's
pretty much the same reason to work in first place, which is that it's attraction away from the concrete language to the more abstract conceptual domain, we're often in many of our relationship between concrete languages.
To be really clear, the text that goes into the promise is text, it is not the numbers of the vector. We're not we use the vectors told me to go and figure out what the text is that you're putting into the prompt. It's not actually totally, you're not actually putting the vector into the language model. You don't have these vectors at all, you're doing back racket just a bit general tournament, pulling external data, putting in Yeah, so like vector scores are extremely popular, because you can use like, opening and it's like, added a tutor, for example, that has like semantic World Learning and can decode it and do similarity search for you. 100% and not have to use it. That. Other questions?
Alright, well, last question about Sunday.
So Susan talked about bias in just a sec, there's a lot of band aid jobs, if you will. Do questions, or bias. And so you had a super to filter and make sure that X, Y, Z types of bias in addition, was not going to make this talk for a minute about how to customize tweet or otherwise. simple process? And if you can't, you know, go simple. Just Is that something that would be hard? Is it something easy? Thanks.
I think first, you'd have to identify
specific, some you had a source of really good source of bias.
I don't know is is so tricky. I, I would have to really think about that particular problem, you know, in concrete terms, like, is this a home loan? Is this like a home loan application? Well, for starters, don't know, you're saying application, no, home loan application, like a mortgage application, let's say, you know, the simplest place to start is don't give it data is likely to induce bias, like, don't think like, if you're gonna, I have this model, and I use any email address, and then just take out the name, like, you don't want the person's name, because the name implies cultural background. And that type of stuff, is the first thing to do strip, strip out all that stuff that could potentially have bias it's not relevant. You need to only be working with the the things, and I think fine tuning can help. Because you can teach it to be fair, in that particular task, and you can evaluate whether it was on a set of sample data, and you're less dependent on the default chatbox behavior.
I mean, that to be fair, I just wanted to say the reason I asked that question is because in the past, some of the large language models have been accused of bias and they also I'm going to do in public value. As I've said, we've done a lot to fix those problems. And that's mostly what I was curious to know, if they went back in, you know, we did that how many tokens that were wrong or bad? Is that?
I mean, there's so much trash apparently still in training data in pre train. Because there's, I don't think it's been cleaned up, there was recently some important data set that had really disturbing, like, child images in it. And like, they found that it's like, okay, it's no one's looked through this stuff. Like, I'm able to train it on better content, it'll do a better job by
trash by collecting past reality. Right? It's not the biases. That seems wrong. It seems like which yesterday was true, which tomorrow would like to be false. And their mom was doing SOCO gas, do it saying, given all the trains on it, what's the prediction? The data is clean, good. It's just we don't like what it says that but D biasing is artificially altering what it predicts. So that predicts what we'd like to insure and what actually was true when I was trying.
That makes sense. I was thinking more of like, yeah, racism are the types of the more easily readily identified. So I hope
I'll give a quick plug for the ethics and AI group. Yeah, one of the things you need to realize that there's something called availability bias, nobody biases, like, what's it easy to get a hold of it to go build your model? So a lot of sadness really stands out. It's this state of what you had a valuable time you're trying to build this thing? And what language was it it was bias built into the translation of the model, and what was your image dataset, and so on, so forth. So yeah, we're definitely in a state right now where there's a lot of garbage, because the way things are, basically, Dylan is taking a foundational model, which can cost a billion dollars to generate jobs, you've got 5 million parameters. And then overlaying that with a whole bunch of safety data, which has unexpected consequences, but safety data, then there it is, raising some things up on your things down that gap. If you're one of the people as an early Chalkidiki developer, just watching your stuff that worked really well, six months stop working. So it's really annoying. But the main thing to understand is that a lot of these guys are going back to the drawing board was the next generation models that are saying that this piece of the data wasn't being very effective for us. So when we get retrained from ground zero, you shouldn't see better results. So hopefully, in 2020, we're gonna start seeing some much better, safer, more reasonable models based on less input garbage. Now, we have a whole nother problem between the fact grab a break up for a second because it's easy to look up and read about but basically a good problem that there's this enormous amount of synthetic data, which is basically you're using a AI model to generate the information that you're then tuning or generating another a model model to train. So you can reinforce a whole bunch of the bias by basically saying that we're making the synthetic data more available, which then influences what's actually and then there's much theories that this is going to cause all sorts of really huge problems. Because, yeah, we're basically training. We're generating garbage, and then we're training other things on that garden. Right. But it is a very interesting problem right? There. At least, again, we talked about it earlier. Yeah.
I think, also what I appreciate, I'm just gonna be coming unless you want to respond. It's gonna be shedding a light on problem, right? Like, this is a real opportunity for us, because now we're seeing Yeah, that the data is based on a lot of bias. And we have an opportunity to be making sure we checked up at class. All right, like we as humans are in this and this is why I'm excited also about to get this in the hands of everyone to be able to diversify and democratize this so that we can uncover all these biases naturally, as we all you know, creating our own domains
All right, one last question. Sorry.
I asked him directly. Okay. So let's see here. I'm going to hand you something that we discussed in a legal AI context of see law. We I rise in defense of racial law. is. And the specific use case that we were talking about was two guys exactly the same circumstances, only difference one white one black box into a lawyer's office, the lawyer vs. The white guy, the advice that we're going to contest this charge tells the black guy, because I think we're gonna win, it tells the white guy, you know what, let's plead out to a lesser thing, right? Plainly racial is the only factor interest rates. But this is a lawyer practicing in a small town in Mississippi. And the fact of the matter is, if he's going to advise his client, well, the white guy is going to get treated better by the court than the black guy. And the white guy has a chance of a beating on the charge the black guy, and if he doesn't clean it down, is probably going to get jail time. And so in that specific instance, if you don't take race into account, that an AI that gives legal advice will give will give the wrong advice to one or both. And the interesting thing about that, is that what that means is that presumably, at some point, if an AI is going to be allowed to provide legal advice, it will be certified by the government just the same as humans getting legal advice, better pass the bar, and so forth. And if you accept the premise that in order to give good advice, you need to take race into account, then what that means is that a government certified training set could be rejected because it's insufficiently racist. And also, how does that training set up all over time? Presumably five years from now you would like to be less racist? Does it tries to get ahead of the curve stay behind the curve and so forth. And so it's it's easy to say, oh, let's let's get rid of bias that may degrade actual real world results. The end
I just have to come back to let's not try to make allusions to everything. I like I like small little applications that make some one little thing way more efficient and don't. And that's a good place to start. And I think we should be discussing all the other stuff was interesting that we just suddenly app chat GBT in the world. It's almost the hardest, craziest use case is a chatbot that can do anything. It's kind of what we're promised. And now we're trying to like work our way backwards and how do we apply this Yes, thank you for
hearing me talk, I'm just telling
you, that was great. Thank you
Oh, yeah, you have what's the best way to reach out to him? I know that I've started using this and then I'll just forget how to do it.