Panel Discussion | Synthetic Data Generation in the Era of GenAI
9:10PM May 31, 2024
Speakers:
Keywords:
data
synthetic
model
customers
accuracy
labeled
simulation
world
humans
ai
question
case
companies
sign
generate
computer vision
type
domain
set
kevin
find the hidden relationships between different data connections more easily and quickly and efficiently. Yeah, now I will give the stage to Philip. Let him to introduce our panelists. Gives it to you. All right. Thank
you so excellent. Kevin, why don't you come on out? How about, how about everyone comes out, and then we'll you can each introduce yourselves a bit after I introduce the topic. So welcome. I'm
excited about this topic. So it's synthetic data generation in the age of Gen AI. And the reason I'm excited is people talk about data is the new oil, or actually, like, data is the new code? Right? Like all of what would be code gets pushed up as data into some machine learning algorithm that then, you know, becomes rewired. So if data is something between gold and the new oil and we can, like, synthesize it, that sounds pretty darn good. So we've got three experts here who are doing this themselves and deep into it. So why don't we start here. And yeah, each of I'll let each panelist introduce themselves and talk about their companies and what they're doing. Great.
Hi. My name's Kevin McNamara, founder and CEO of parallel domain. We're a company that provides simulation software for AI developers and the big problem that we're trying to solve, and we'll hear a lot about this today. Getting, acquiring and labeling real world data is really hard. It takes a long time. It's super expensive, and it's often tough to label that data accurately, especially with humans. And so we provide simulation software that our customers, who are developers of autonomous systems, autonomous vehicles, drones, all sorts of computer vision and perception systems. We provide simulation software that allows them to run 3d simulations. First, they can run those simulations to test their models and see how they're performing and use that to identify weaknesses. And then they can run millions of more simulations to generate big training data sets and actually train those models to perform better. So we'll get deeper into that and the use cases of that today, but that's what we do. Emma, Thanks,
Kevin. Emma goltikin, I'm the co founder and chairman of Chooch. We're a computer vision AI company based here in the Bay Area, and basically what we do is we try and replicate human visual cognition into machines. So further down the pipeline, in terms of creating the models, inferencing, putting the analytics on, on onto the onto on prem, or on the cloud or edge solutions, and we use a lot of synthetic data from the industry. And also we create our own synthetic data as well to be able to train the models to high accuracy. Good to be here. Matt, Hey everyone.
I'm Matt Lee. I'm the head of a company called talc, and we also provide data to train and test models. But unlike these two, we focus on text and NLP, so specifically, language models allow you to do a ton more stuff with NLP and text data than has ever been possible before. Digging into it, it turns out humans kind of suck at labeling, like they're they're slow, they're not good. If you ever worked with like, data labeling, they come back, and I don't know, like, 30% of them are wrong and basic things, and for a lot of NLP tests. Like we shouldn't have to use humans for that anymore, and so we provide this stuff. It becomes training and testing data for all sorts of use cases, from accounting to medical care and then all sorts of things. Thanks.
All right, so first question we'll go around is, what kinds of decisions do you see as being good and well suited for synthetic data? And then maybe, along with that, the flip side, like, what kinds of things are not well suited for synthetic data? And maybe you could answer in the context of, like, how you use it inside your own business, or how your customers are using it. Yeah,
happy to give some examples. I think something that's nice about a lot of the vision and perception based work that our customers do is it's very it's very real. It's easy to picture these examples. So a lot of our customers, like I mentioned, are they might be automotive companies or drone delivery companies, mobile, computer vision, warehouse, logistics, agriculture, all of these companies want to deploy perception models into the real world that can then do useful things. I think driving is probably one of the easier examples. One of our big customers is Toyota. They'll work on something like improving their pedestrian detection model, right? So with a computer vision algorithm with a car that's driving down the street, it's really useful if you can detect a person or a bicyclist that's in the road, and use that information to avoid hitting those people. Training models to do that is really hard, though it's really hard to get a lot of data for those use cases. And we find that synthetic data is fantastic for object detection tasks, semantic segmentation, depth estimation, these are all types of tasks, Vision tasks, in autonomy that are really helpful for understanding the world that's around you. So I think a great example of this would be jaywalker detection. For example, a lot of our customers will generate data sets of people running into the street at night or in the daytime or in the rain or wearing different kinds of clothing or chasing a ball, and
it'd be a horribly dangerous thing real humans, not something that
you want to want to set up in the real world, and frankly, something you hope you don't encounter if you're out collecting data in the real world. That's really dangerous, and we've all seen examples of companies based in San Francisco who are trying to do their best in being safe and testing and collecting data in the real world. But there's a limit to how safe you can be. If there's a car driving down the road, there's always a chance there's going to be an accident. And so some of the unfortunate stuff we saw with crews recently halting a lot of their real world operations because of an accident, our perspective is synthetic data, and simulation is a great way to do that in a safe environment. It's really rapid to iterate and run 1000s of simulations. So that's just one example where our customers use that today.
Yeah, so just going on that. Kevin, by definition, what we're trying to do in computer vision usually is to catch anomalies, and so that doesn't happen very often. And so even if you have some data about that anomaly, you need to be able to replicate it and then and then scale it. So one of the things that we've been doing is we've been producing different types of synthetic data based on 2d and 3d imagery, and this is through our own system, and we also use a lot of augmentation. So the story really begins with one of our customers who really wanted to detect things in the air. And we were like, okay, we can build the model for it. And we asked for the data. They had no data. They had absolutely no data on what they were trying to detect in the air. And we said, How could that be? You must have something. Said, we have CAD files. Said, Okay, let's take the CAD files and then we create a system where we were able to take the CAD files and then recreate them to be able to basically scale 50,000 60,000 images to to produce a model of to be able to detect these. So ours was very pushed by the customer. The customer wants accurate models. They're not really interested in how you build them, where the data comes from, etc. They really want you to kind of give you a very accurate model. You cannot do that with only real data. Today, you always need some type of synthetic data to be able to provide the types of accuracies that we see today. All of our models, we have 26 ready now models, all of those models have an element of synthetic data in them. The least of them 20% and this goes up to 90% depending on the type of model. So we were not we were kind of, we didn't have a particular interest in synthetic data. We were trying to build models, but we had to get into that as we move forward. So that's kind of the story with what we're doing in computer vision. Matt,
yeah, thanks. I actually have a slightly different experience with this, because with vision, there's all sorts of anomalies, and there's some of that in NLP as well. But we really find is, for a lot of customers, their use case is bounded and pretty well defined, right? In the text world, it's you want to classify over 50 categories, not a million, or you want to the simple example is always like, Oh, I have a chat bot, right? And it needs to understand. Needs to understand HR policies and things about our product. And it's not usually this AGI focused, like grand vision. And what happens that we see is, in these cases that are bounded and pretty well defined, the problem you need to solve is not generally, how far can rlhf go, and what can this model do? It's like, Hey, did I answer this set of like, 500 questions correctly? Am I getting it sued for like, this week's set of outputs? And so what we find with really rich in synthetic data is, if you have knowledge bases and documents that already back the type of knowledge that you want to convey via whatever entity extraction you're doing or whatever transcription you're doing, we have a really powerful set of tools that then translate that into data that you can effectively use for model building and model training.
Maybe to jump into the second part of your question, which is, where is synthetic data not good? I think my view would be that synthetic data is not inherently good or bad, just like actually real data is not inherently good or bad. Good data is good and bad data is bad, like poorly labeled data is bad, well labeled data is good, totally. I think that the real challenge, and we've probably all encountered this, is getting a ton of good data is really hard. You can get some good data, but getting as much as you need is probably a problem for anybody developing AI models here. Getting that data labeled well enough to be super, super accurate is probably a problem for everybody here. So then the challenge is, how do I augment the real data I can get and the real data that I can label? Well, how do I augment that with synthetic data? And it's really hard to generate good synthetic data. It's why you see companies spinning up that are solely focused on synthetic data generation. Because if you're trying to develop an autonomous vehicle, an autonomous drone that delivers packages, satellite computer vision algorithms, it often is really hard to build your great synthetic data and that product at the same time. And so this is also an area. When we're working with customers, we very often talk to them about their questions around, well, should we build our own synthetic data systems internally, or should we go buy this from somebody else? And the question really comes down to, how much can you invest, both in terms of time and money to generate good synthetic data? Because it is possible to generate bad synthetic data and damage your model, just like it is possible to make a mistake with labeling and damage your model on real data. So I think the real question comes down to high quality data and making sure that that's actually teaching your model the right thing. Yeah,
you make me think of a conversation I have sometimes with potential customers where they ask, is the decision between real data or synthetic data, or how I know that synthetic data works, and my response is always along the lines of it kind of doesn't matter if it's synthetic or real. It matters if it's right or wrong, if it represents your use case, and if it's wrong, it doesn't matter if a human came up with it, like you're really worried about quality more than the source. And if
you know what right and wrong looks like, but that data doesn't show up very often in the real world, such as in the pedestrian walking in the middle of the street case, then I can totally see how that would add a lot of value and be pretty low risk. It's not like, you know, the machine is just making its own decisions. Someone's sitting down, identifying the scenarios and then creating something that represents what the machine would see. Is that, right? Yeah, and
there are a lot of dimensions to good and bad data, right? Like bad could be poor labels with synthetic data. It could be low fidelity data or unrealistic data that doesn't look realistic enough to have a small domain gap. It could be differences in the domain itself, right? So very often, you can go collect a real world data set in Germany and then go collect another data set in Japan, and those two data sets might not work well together in training a model if you want that model to perform in Ireland. And so it's really important to make sure that you're being very thoughtful about curating the right data from the right domain, with high quality labels and high quality in general. And there are a lot of dimensions to that cool.
So Emma, a question for you. You mentioned how you didn't actually have a predisposition towards synthetic data or not like at the end of the day, the metric is, are you getting accuracy? Is the customer getting what they need? And I'm curious, as you've discovered that synthetic data is perhaps, you know, more useful than you might have guessed initially. What were there any tipping points in that journey? And are there any particular areas where you've sort of cordoned off and said, you know, definitely we want real data or human generated data?
Yeah, great question. So the tipping point was one of the customers that didn't have any data, that kind of pushed us totally towards synthetic data. Much of a choice? Yeah, it was no choice. But generally what we try and do is we try and collect as much real data as possible. Real data is really important. It's important. It's also important for seeding of the synthetic data. But turns out that collecting real data is very, difficult. You have collection problems. You have labeling problems, as we mentioned here. You have copyright issues. You have anonymizing the data and so forth. So that process is very high friction, actually. So what you want to do is you want to be able to seed your data sets with real data and possibly extrapolate as much as possible. But it depends on the domain. It depends what you're trying to detect. It depends on if you're merging language with your models as well, which we've been seeing in the in recent two years, where computer vision and language is merging as well. So it really depends on domain, but it's, it's, I think it's very critical that we, we take synthetic data very seriously in how we build our models today. And the irony is, you know, we started out with with synthetic data generation, and now it's rebranded as Gen AI, the original synthetic data, people actually do it to train the models, but we become the kind of the consumers in synthetic so there is some type of irony there as well as we move forward. So we really don't know which way it's moving, but as far as our experience and real world experience with our clients has been, we are using more and more synthetic data as we move forward.
Do you view obfuscation as something different or in between? Because it's like real world data where that's part of the fields have been masked, so it almost seems like a hybrid in some way. I'm wondering how you and maybe you can start, yeah, Imran, either of you can.
I'll just, I'll just add one more thing to that. Yeah, absolutely. So the the good thing about synthetic data is that you don't have to label it. It comes pre labeled as as the machine. So the obfuscation and all the other stuff that goes on is is very, very critical to how you kind of put that data into into the model and then retrain it. We haven't taken the position of trying to understand the label data. We've taken another approach, which is we train the model and then we check for the accuracy and if the accuracy works for that specific project or for that specific use case. We don't really go back to the and try and check every image or check every label that goes into the into the into the data set. And it's a different approach, but it's because we're able to train the models very quickly with low GPU power or CPU power, and be able to check the model itself.
Any other comments about obfuscation?
When you say obfuscation, are you talking about from a privacy perspective? Yeah,
from a privacy perspective,
yeah, absolutely. I mean, it's somewhat to Emma's point. One of the advantages with there are lots of definitions of synthetic data, but at parallel domain, we run 3d simulations to create that synthetic data. So that's how we generally define synthetic data. One of the big advantages is it comes labeled, because it's a simulation. The computer everything is. One of the other big advantages is that it is a simulation. There are no real people in it. And so every person, every license plate, every car, every time a person was at a specific point in a virtual world, those are all simulated. And so we don't have to do a lot of work on obfuscation, because that's, that's almost a natural advantage of that method of generating synthetic data. It's worth, you know, kind of covering that synthetic is really a whole spectrum, right? A lot of people would just define synthetic data as data that is generated by a computer in some form. So is data augmentation, synthetic data, right? If you're going to take a real frame of data and augment a picture of a person into that? I'd say, yes, it's a form of synthetic data, and in that case, you would need to think about obfuscation and protecting privacy. It's a huge problem for autonomy companies in general, across drones and cars and agriculture, if you have real humans who are not a part of your operation collected in your data. How do you actually obfuscate and protect their privacy so that when you go and train models, you're actually not using their PII, the personally identifiable information? And again, this is an advantage that certain types of synthetic data can provide. But I'd be curious to hear, from like a text perspective, how you get around that?
Yeah, the main thing that makes me think of is in a lot of healthcare use cases, we have a analogous thing where we're not usually participating in the obfuscation of data. But we talk to a lot of clinics where getting the amount of samples you need to train a model, if it's like patient data or transcripts or stuff, is a nightmare, like the dealing with the PII, even like contracts set up with a clinic to say that, like, we can use your data for these use cases, and then the clinics like getting the data over to you. We've seen teams take months just to, like, get enough samples they can build a basic model. And the huge advantage we've seen is for a team that, like, right now only has, you know, 100 or 1000 rows we can really turn that into something you can build a model off of, because we are able to, you know, we have, like, patient data, but like, you know, it's Iran from Calgary, and not like actual patient data that is used, but functionally in its ability to train models and and provide these sort of ground truth samples that are pre labeled. They're just as effective.
Jen, if you see any defensive value in synthetic data. So for example, you know, I remember a couple years ago, there were a bunch of news stories around people who had, like, taped stuff or scribbled things on stop signs, and all of a sudden, you know, all the autonomous cars were just ignoring the stop sign. So there's a real potential for all kinds of just crazy attacks. And it seems maybe synthetic data can help harden against some of those. Yeah,
absolutely. We see this with simulation all the time, where our customers do traffic sign detection a big part of what they do. And they all want random graffiti taped onto the server, painted onto the sign. They'll want dents and scratches and things getting in the way of that sign. So they can train and test their models under all these different conditions. And if you're creating simulations and having procedural content generation, have these rule sets to generate lots of variations. You generate millions of stop signs with random graffiti patterns on them, and make sure that your model works in those cases. So I say that's a huge value of simulation and synthetic data in general. I also think that you can use simulation to find those weaknesses in the first place, right? And this is something called domain randomization, where you want to intentionally inject random elements into a scene. And this might be something simple, like putting random objects in the scene, or it might be something even crazier, like randomly perturbing colors and pixels and noise patterns in the in the data itself. To see if you can find those weak points, because there are some adversarial attacks that are somebody painting graffiti on the sign. There are other crazy ones. Some of you have probably seen the research papers of like you inject this invisible noise pattern into the RGB pixels, and all of a sudden that that wall becomes an open road or something right from the object detection perspective. So I think simulation can be a great way to try to find those weaknesses. I also would say, I think sometimes the adversarial attack approach gets a little bit over hyped, because in a lot of cases, what, what works is an adversarial attack on one model is not going to work on another model. And so it's, it's often very hard to find an attack that is broadly, you know, going to create a lot of danger for a lot of people. And with that said, in the real world, like vandals and teenagers vandalized mailboxes and signs all the time in the real world, and we have to deal with that as humans as well. So there's, there's just some degree of just making sure that your models are robust to lots of variations of things. And simulation is great for that. Cool
and I'm wondering, Matt, like on the tech side, there's, you know, maybe the issue of red teaming, but also negative testing of things you don't want the model to do. And, you know, again, early stories of car dealerships that would just expose, actually, chatgpt with this thin veneer over it, and people were able to negotiate cars down to $1 it seems like maybe there's an opportunity there too.
Yeah, absolutely. It's definitely an area we think a lot and see a lot with this analogy I want to draw, actually. So before my work doing working on top with data, I actually worked on elections at Facebook, and a lot of that job involved, you know, one day seeing a text classifier that was incredible in its performance, you know, 99.99% precision, and then 24 hours looking at the New York Times, finding the point one and like that.
And so a big lesson always going to do that. Always
a big lesson we took was like there was some amount of process we had to go through to figure out what the mistakes were going to be ahead of time and almost over represent them in the sets that we were going to use. An example is like there might have been something about what constitutes voter suppression versus is a reasonable call to gather like a voting polling site, right? And that's, like, incredibly difficult to kind of nail down. And it was a process of not just taking a look at the samples of, you know, what people said on Facebook posts, but also to nail down on like, Oh, we don't have enough samples. Like you were saying, we don't have enough samples of this. Let's get a bunch of good samples of, like, what the actual line is between prohibited and allowed content over representing that, growing that data set, and then making sure it pushes its way into like, sort of the AI outputs. And I think in that way, the traditional way you think of a neural net is like, oh, there's a million samples. And you put it through this weird thing, and then on the other side, it has learned. And one thing we're kind of seeing is there's a way where, if you kind of add the right levers and knobs and control buttons over your inputs, you're able to better tune what kind of things come out of it, and just have a more fine grained approach to what your model is going to do.
Yeah, I just wanted to add to that. So one thing that we do a lot, because we're more downstream, is that we you have a training set, and then you have a test set, and you have multiple test sets going on, and you want to get to a certain f1 score, which is basically precision and recall harmonized, or any other type of accuracy metric that you might have to put the thing into production. So you need to put the model into real production in order to see how it how it reacts in the real world. And then you what we what we do is we have an active learning process, and that active learning process basically samples the model every day, whether it's 1% or 2% of the actual and see what the accuracies are in real time. And then you go back, you collect more data, and then you retrain the model. So this is, this is an active learning process that happens after you put something into production in order to get to the higher accuracies. But you need to simulate, as Kevin said, because statistically speaking, your model is not going to work very well in the beginning anyway, so you need to simulate your way into a certain amount of accuracy before you take it into production. So this is all kind of a process of putting your models into production for real real life use cases.
Cool. So question for each of you, how do you, or you know, ultimately, your customer, measure success and is, is the way that they might initially measure success, different from what all the benefits are that ultimately accrue to them. So in other words, when looking at solutions, there's a view of here are certain things that metrics I'm looking at, but then looking back maybe one or two years, here are all the things that I actually experienced where perhaps I experienced benefits that were more significant, that I didn't expect. At least I've I see that happen, at least with us. So would love maybe you can go down the line. Kevin,
sure, yeah. So I think the metric that comes to mind most quickly when people are talking about mean learning in general is some kind of accuracy metric, right? Mean, average precision, f1, score, et cetera. But ultimately, like, what is machine learning? Or in our case, what is computer vision? It's pattern recognition. And what you're really testing is, has my model learned a pattern in the training set that it can then detect that same pattern in the test set? And all you're really testing is, how well is my model memorized the training set when tested on the test set? And there's a problem there inherent with the quality of your data. If you have a poorly labeled test set, let's say your human labelers always make a mistake where the bounding boxes are too large on a on a cyclist, right? Your model is actually just going to learn to overestimate the size of that cyclist, but it's going to come out as 100% accuracy, because your training and your test set both had bad labels in them, and so oftentimes our customers success metric is some form of accuracy, but we try to push beyond that and and try to develop a more holistic understanding of how models are performing by mixing multiple data sources together that might not have the same biases, and those biases then can start to balance each other out and ultimately what We think for because our customers are in commercial applications trying to get products to market. While accuracy is important, and it is very much a measure of how safe or how reliable a system is, I would actually say the most important metric for our customers is time. It's how quickly can you get through an iteration, like, how fast is your iteration speed? And for anybody here, that's a software developer, you understand, the more iterations on your software that you can do, the better it's going to get over time. And so actually, our primary objective, like the success metric for our customers, is, how quickly can we get through multiple iterations of new data sets retrain that model and make it better. And ultimately, the end goal is to improve improve accuracy, but being able to do that more quickly means that, with the time constraints you have, the resource constraints you have when you need to ship your product, if you can make your iteration speed faster, you're probably going to get to a better model much quicker. So it's kind of those two things put together, make sure that you're really curating great benchmarks for testing your models, and then at that point, optimize how many iterations that you can actually do to make those models better?
Yeah, for us, it's a little bit further downstream in terms of accuracy. So accuracy is what you want your model to get to, but the customer is really focused on the business case. And the business case here it might be, you know, if it's inventory management, for example, it's how quick, how fast you can count the inventory without humans being involved. If it's detecting wildfires, how quick can you detect the wildfire? Can you beat the 911, call if it's something else, if it's an accident, whether people are wearing gloves or not, or hard hats, on an EHS level. Can you bring down the insurance premium and so forth? So at the end of the day, you know, all of this has to come boil down to a business which the customer needs to know. And so a lot of these years, what we've been doing is working with our customers to to provide that business case to them, because ultimately, the CFOs and CEOs sign off on the on the budget that needs to be paid to these types of systems. So yes, accuracy, without accuracy, you don't, you're not going to get anywhere, actually. So that's but, but that's what we've been struggling with within the ag community for many, many years, is the accuracy. Once you get to the accuracy, the business case becomes much, much clearer, and you can provide that to the customer, and you can charge based on that as well. So if you have an ROI and you have a business case, then you're not charging man hours for you know, your services, you're actually charging for the solution, and that makes your revenue go up as well as a company. So for us, it's it's accuracy is kind of foundational, and we're talking about very upstream things over here with synthetic data. But at the end of the day, it's really a business case that's going to drive the company and drive and drive the revenue.
Really, I've seen a lot of that as well. The thing that we particularly deal a lot with is actually around trust, where we work with a lot of folks dealing with language models. And I think in sort of the text based world, there's this question of, okay, sure, it's accurate, but like, what does you know? What does 99% mean? What does this f1 score mean? And especially when you get the experts to dive into it right, if you think of a doctor evaluating a language model in healthcare. The big question is like, Okay, if it answers these five or 10 things correctly, like, does it actually scale out? Can I trust this as a co pilot, as an assistant to be doing important decisions beyond, you know, kind of rote, sort of text output? And so, yeah, totally agree that accuracy isn't really big, but for us, the biggest conversation, in addition to that, is like, what does that actually mean? Like, let's go into the use cases. What types of things can I recognize or fail to recognize? And the high level question is always like, when this is released, how do we know that these accuracy numbers or these metrics we're looking at reflect the real world performance of the model at large in a way that's a little more fine grained than just like a precision recall.
Great. So we're nearing the end. I think everything's running a bit over. So we're kind of getting the flag to start to wrap. But I have one more question before we do that, and then I'd encourage audience members to come and flag these fine people down. So the question I have is, as far as like, data is the new code, which I started with, there's this AI tipping point that people talk about, which is, like machines be able, will be able to enhance themselves, right? And I've always thought about that in terms of code, but actually it's probably more accurate to think about that in terms of data machines generating data that is used to train machines and forth. And so did the degree that you're having more and more sophisticated means of synthetic data generation, like could that actually change the direction of AI in any significant way, and so maybe just one minute each. Yeah?
Bold, yes. For me, I hope somebody disagrees, because that's more fun. But I our data generation platform, as our many synthetic platforms are API based, so anything that can write code can control the generation of that data. So in a limited way, we've already done this, we have mL models that often tweak the distributions and frequencies and probabilities of things happening in our simulation to try to make a more effective data set relative to what the customer has already collected. But I think the really exciting part is where you start to because this is the generative AI conference, where you start to actually use generative models in the data generation itself that have a really high degree of freedom with what they generate. So a good example here is that we've inserted foundation model based asset generation these type of prompt and get a new asset in the street or in a backyard. If you're training a model for something, there's nothing that would stop a language model then from writing those prompts and writing up new scenarios and controlling those objects in the simulation. And it could do that at a speed that in a kind of consistency and persistence that a human never could, to quickly find more optimal ways to generate training sets for models and maybe even start to generate cases and scenarios we would never think of as humans. So I think that's really exciting, and I think that's not far off, if not already here today, right?
Yeah. So we, we actually use a lot of language models to to write code today, actually. So it's, it's already happening. The question here is always, can these, can these models actually have unique experiences like humans, and that's, that's the question, can, can, can, can there be something foundational in how they experience things as humans do? If that be the case, then I agree. However, I'm skeptical about that. I think the fuel for models is human behavior, is human real data? I think without human and human data, without our experiences in life and what we're doing, it's very difficult for them to to to kind of produce those without, without, without the real experiences. But having said that, we aren't using it. And I think in the future, we'll probably see more and more of the synthetic data become part of our become part of our lives, and also become part of the new models that are that are being created to
Yeah, I generally think synthetic datas aren't just changing. Synthetic data has already really changed the games, especially for the large transformer models. If you look at the latest LLM models from like, basically big companies, they have either publicly stated or, like, if you read between the lines, it's just highly probable that they're using a ton of synthetic data. If you just look at, like, the scale of the amount of text going into some of the models, right? And so in a world where, like, these, you know, latest models, are really jumping in leaps and bounds, not just by real world data they consume, but synthetic data they consume, I think we're already in a world where, like, you're not going to get Claude three amounts of performance unless you're able to transform it the real world text you're already seeing. I brought this up for a moment before, but I think really, we're getting to this place similar to what you were saying, where it's not just data as a real world phenomenon, that a neural net tries to understand data as a as a control plane, as a set of levers and knobs and buttons, where you then get to turn a knob and like tune how are these AI models are going to steer in a suit. We're in a particular direction.
Hi. Well, Kevin, emra, Matt. Thank you. I love synthetic data is clearly a lot more important than many of us thought, including me, so. Round of applause for these folks.