From Big Data to Good Data - Andrew Ng (REMASTERED)

8:26PM Apr 6, 2022

Speakers:

Andrew Ng

Keywords:

ai

data

projects

system

images

build

subject matter experts

computer vision

laborer

engineers

manufacturing

data centric

code

problems

recipe

application

executing

tools

customization

people

Hey everyone morning. Good to see all of you I have to admit, I'm rusty going to in person events, I was really looking forward to meeting all of you in person as well as all of you all of the viewers watching this online. Who the thought first person first event back in person IP introduced by a virtual AI Nicaea. This is exciting world we live in. Thank you also George Lonnie, the whole Insight Partners and the scale team for having me today. So what I hope to talk to you today about is that there's so ways for AI to go. I think, despite the rapid progress of AI in the last decade, it will reach its full potential until it's accessible to everyone, which is far from today. And there is a set of technologies that hope to share a view called data centric AI, which I think will be key to democratizing these technologies. But what is data centric AI? And why should you pay attention to this? So over the last many decades, the dominant paradigm for AI development has been what's sometimes been called the model centric approach, in which you know that an AI system requires writing code, you know, the code that you implement some algorithm or some neural network model, and that code is then run for it to learn from data. The dominant way that researchers have made progress in AI for many decades now has been to find a data set, maybe download the standard benchmark data, set the image net, or switchboard off the internet, and hold the data fix while you work on the code to try to make your code do better on the data. This is driven tremendous progress in AI. So it's a great recipe for AI development. But because of this focus for decades, I think that today for many practical applications, many commercial applications that you may wish to work on the code or the algorithm is basically a solved problem. There'll be an open source implementation of some neural network, that works just fine for many applications. And this is not true three, four or five years ago. But because this is true now, I think for many applications that you may wish to build is more promising to take instead what's called a data centric approach in which is even okay to hold the code fix or spend a much lesser proportion of your efforts on the code, and to Instead, try to systematically work on the data, so that that combination allows you to build a successful application more quickly. So data centric AI is the discipline of systematically entering the data used to build an AI system. This term is relatively new. But I think over the last year, it has seen a lot of momentum. And there are now I don't know very large and growing numbers of researchers and developers embracing this philosophy and finding this a more efficient way to build AI systems. You may have seen studies like this, or reports like this widely cited one by McKinsey, saying that AI will create tremendous value, right, according to this one, AI will create $13 trillion worth of annual value per year by the year 2030 PwC estimates, I think, $16 trillion. The interesting thing to me about this report was that a lot of value to be created is in industries like retail, travel, transportation, auto materials, semiconductor electronics, healthcare, and so on, is not just in the consumer software internet industry. But if we walk around today and look at where AI has had the biggest impact, we all see that it has transformed the way that internet companies, especially the large consumer, internet companies operate strated massive value there. But candidly, when I look at a lot of these other industries, when I worked in manufacturing plants, or if we had to see a doctor, the adoption of AI in all of these other industries is still very new, very, very nascent. A recent MIT BCG study estimated that only 10% large of large companies have reported any significant financial benefits from AI. And this AI thing has been going on well, either for multiple decades, or for the recent wave maybe for about a decade or so. But this adoption is still very low. And I think there's still massive room for additional adoption. So why is AI having such a big impact on consumer software internet companies, but still so nascent in other industries? I'd like to share with you two of what I think are the key barriers to AI adoption, as well as a recipe that my team has been executing to try to get past these barriers. And this is a recipe that I think will be relevant to many of you as well. As you've built your businesses,

and I think at the heart of it is that the recipe that, you know, some of my friends and I, and many others have worked on for AI adoption and consumer software, internet companies. That recipe unfortunately does not work for a lot of these other industries. And, you know, living in, I'm from Silicon Valley, the tech world is a big world. But I'd love to see AI, break away into all of these other industries as well. So what are these barriers AI adoption here, I think are the two biggest barriers. The first is small datasets. So an application in the internet companies where a lot of AI grown up will often have 100 million plus data points to learn from. I once built a face recognition system using 350 million images. And today, there are many teams using even larger, more data points to build AI systems. And the era of big data has really paid off for some of these large internet companies that may have hundreds of millions or billion plus users and just massive amounts of data. And then aI have been doing a lot of work in computer vision. And one of the verticals we focus on is manufacturing. And I gave a presentation at the manufacturing conference. And ask your brand LiFePO actually have to the manufacturing audience were asked how many images do you typically have the result of the live poll that ran in this manufacturing conference? Was this where the most common answer was 50 or fewer images of a defect that they may be trying to detect. So technology that was designed, you know, by people like me and many others, for 100 million images, it does not work for many other industries. One piece of good news though, is even though we are reading the news about this, or that big data AI system on a gazillion images working really well, nothing wrong with that. One thing that even surprises me this these days is been surprised how many how often you can get an AI system to work with just 50 images, if you take a data centric approach and make sure that that's much smaller data set is designed well engineered well. So it's not just about big data, we have small datasets, which is all that exists. In some applications, the focus has to shift to good data. And I'll say more about this in a minute. So small datasets, the first barrier here is I think the second one, which is the customization or the long tail problem. So here's what I mean. Let's take all current and potential AI projects and sort them in decreasing order value. So you have the most valuable projects financially economically on the left, going all the way to single projects that may individually be less valuable. So you get a distribution like this with what will turn out to be a long tail. But maybe the single most valuable financially AI application in the world, maybe some online ad system, maybe the second most valuable, maybe it's some web search system, maybe a third most could be some ecommerce product recommendation system. And in the AI world with figured out how to hire dozens or hundreds of machine learning engineers to execute these large monolithic projects, that in build one AI system, and you have a database of 100 million or a billion users, your AI system serves this massive user base and creates huge amounts of value. So that works, we've figured out you know, in a way how to do that. But when you go to other industries, the units that we get to operate on are often not in the hundreds of millions of billions range. And there's much more heterogeneity and much more customization needed. So taking examples from manufacturing,

I have seen, there may be a pharmaceutical plant that makes pills, you know, and you have one centimeter. And you have pictures of tools like this, that you need to build a computer vision system to detect defects on down the street, there may be a different plant that makes sheets of steel, and you have pictures of giant sheets of steel one or multiple meters across. So you want to inspect and you know over the next neighborhood company makes semiconductor wafers and you need to take pictures of those too and have an inspect that and the challenges. Clearly, these different plants make different products. And so each one of them will need a custom AI system. And from where I'm sitting, I think we have done well in AI community executing those, let's call them billion dollar projects, or even 100 million dollar projects or maybe 10s of millions of dollars projects. We know how to do that done a lot in our community. But I see also a very large number of this column one to $5 million projects, that kind of sitting there because we've not yet figured out the economics of how to do it. How to make those scale. And there are 10s of 1000s of these projects. So I think the net value and detail of this distribution may be even bigger. Certainly, according to maybe the McKinsey study could be even bigger than the values and created so far in the large consumer sovereignty industries. But if there are 10,000 manufacturing projects needs to be done. How do we get this done? Because, without say, me try to hire 10,000, machine learning engineers to do the projects one at a time. So this is actually an industry wide dilemma, to take one of the quick example of but I've deployed a few systems in hospitals where we read electronic health records to to relay child records, and make recommendations to the doctor or the patient's healthcare system. Similar to this diagram, I don't think you can build a single monolithic AI system to read all the electronic health records across all healthcare systems. Because every hospital every healthcare system tends to code his EHR electronic health records a little bit differently. And doctors, even in different departments in hospital, just use slightly different conventions for coding EHRs. So the challenge in healthcare, too, is if every hospital is a custom AI system, how can we help every hospital do that customization, without, again, one team, my team I want to your team is hiring 10,000 machine engineers to build all these custom projects. So the problem of customization is one that's plagued many AI teams. For years. I think the only way out of this dilemma, and the only way to unlock this massive value is to start to build vertical platforms. And what I mean is to build vertical platforms that enable the end customer to do the customization to let the IT team of the manufacturing plants or let the IT team of the health care of the health care system maybe have heard from the doctors built the customized system they need in order to create that value in that plant when their healthcare system or in the logistical system, some other application. And how will they do this? Well, if we go to the IT team of a hospital or a manufacturer in Austin to run our code and invent a next generation neural network, that's actually really pretty challenging. So a much more promising approach is to provide tools to help the subject matter experts not engineered the code, but instead engineered the data in a way that lets them express the domain knowledge that then results in the customer system being built that creates this value. Now, what does engineering data mean?

And why do I think that this is a promising activity to us subject matter experts in different industries, as opposed to only machine learning engineers to try to tackle? So we all know that data is messy data is not clean, the missing data data is loss of loss of value, what's the issues were there just share with you one quick example that I've seen in many manufacturing contexts, which is if you take an image of a pharmaceutical pill like this, and ask two different people to label it, not at all uncommon for one laborer to say, that's a chip, second laborer to say that's a stretch. I don't know who's right, or not the second matrix, but but this type of inconsistency happens a lot in data. And this turns out to be confusing to an AI system and will hurt the assistants performance. Or if you ask laborers to to draw rectangles called a bounding box in a I drew a rectangle to show me where the discoloration is very common one laborer will do that. Second labor will do that. Or if you go to labor and say, draw rectangle bounding box to show me where the defects are, one laborer will do this, one laborer will do this. I've been to many manufacturing plants, where there have been, say two experts inspecting things for decades. And they did not agree with each other for the last decade, but they just no one realize it. The other interesting thing is when an expert in the morning disagrees, you know, with it with with themselves in the afternoon, we see that a lot too. And this is one example of many examples of data quality issues, that hurts the performance of the AI system. So one of the earliest projects that Landing AI did, and just a quick case study was using a computer vision system to inspect sheets of steel. And when we worked with this steel plant, they had internally already built an AI system with 76.2% accuracy. They wanted 90% accuracy based on their perception of the performance of the human experts. And so a few large companies, including you know, at least one globally leading AI team, went in and worked on this data set for months and got zero improvement taking a model centric approach, which multiple AI teams took this stupid data set. I tried to innovate on the code the algorithm to try to improve performance and got basically no perform note no improvement for months. One of my engineers went in and use our tools to, you know, spend very little time on the code of the model, but spend almost all the time helping the steel plant, figure out, Where are the data inconsistencies? Where is the data wrong, incorrect, and spent almost all time using our tools to help them improve the data quality, and I think took about two weeks to get to a much higher level of accuracy. One of the lessons that will take away from that I took away from this too, is that taking a data centric approach allows subject matter experts, you either know, what is a defect on the sheet of CEO? Is this for the toes stretch? Or is he not? I actually don't know. But taking a data centric approach allows subject matter experts to look at the data and label the data and try to a quality of data that lets them express the AI clearly what they wanted to learn. And that allows it allows the AI system to more efficiently reach a high level of performance. So the recipe that my team Landing AI has been executing on computer vision problems has to build a system called Learning lens, which is an end to end computer vision platform to help subject matter experts execute this data engineering tasks. I've been very focused on computer vision. But I think that maybe many of you, I think there's a lot of room, the world needs a lot of these vertical platforms to tackle many other problems as well. And I see more and more companies thinking about this recipe that I'm just driving as a way to approach other verticals. And not not just computer vision. But to give you a sense of the types of data issues we run into and that we try to help people deal with I talk about inconsistent labels, right? Just two opinions. I don't know who's right. Maybe no one's wrong is just inconsistent, or incorrect labels quite often you just have data this is all very wrong. Even the most revered data says in AI often has incorrect labels. And so how do you quickly help customers find those problems and fix it engineer data? Image quality, images can be our focus for contrast, can you find those problems and fix it? insufficient data, certain types. In AI, we

always want more data, right? Big data's collect more data, but that's expensive and inefficient. So sometimes you can help the customer discover that, Hey, dear customer, you have plenty of pictures of stretches and check marks, but discoloration, you don't have enough of that you can hope. Subject Matter Experts have a much more focused effort to collect more data, just the data you need, rather than collect more data of everything. And this makes it process much more efficient. data changes in deployment training, AI system push the production, the world changes, maybe the manufacturer uses a different coating on the pills. So technical term is conceptual for data drift, can you detect that, go back and fix your data? When any of these things happen? You can have a lot of images. But where do you prioritize what data to label this orientation and synthesis treating more data, you're using various image distortion, or sometimes even computer graphics techniques, how to manage metadata, managing data cascades with long complex pipelines, and so on, and so forth. So this process of entering the data is very complex and multifaceted. And I think that while for decades, this has mainly relied on the score lack of individual engineers, building tools to help subject matter experts do this, in a more systematic and reliable way, will help unlock the value of AI in a lot of new applications. Just to show you very quickly, the just show you very quickly my philosophy for managing data, a lot of people think of a lot of them, people focus on training models in AI. But when I execute a machine learning project, I now tend to think of the full cycle of a machine learning project, in which we're the scope to project collected data, train the model, and we train the model, you often say my data is not good enough to go back, and then push the deployment. And again, after you deployed, sometimes you have to go back update the model, or data data. And my philosophy in taking a data centric approach to AI is to build tools to ensure consistently high quality data through all stages of an AI project. Because what I found is if you can ensure that for all of these stages, you have the right tooling to have consistently high quality data that often solves a lot of practical problems that I was otherwise Fevzi running into. Just to give a shout out I think with me here today are also Kai Yang, who's actually over at the back, who's a VP of Product eventing AI, and also Kelly Seelig, Head of Marketing comm. So lot the ideas I alluded to today actually came from Yokai. So afterwards is mingling chatting is catch him or me or television, chatting about more about any of these ideas. And what I found is on even for very good engineers, they were, you know, projects that used to take us 12 months to do that with data centric tools we can now do in one month. And I've been surprised at the number of non technical people, right, including, for example, my he is brilliant, but she's not technically she's training a lot of machine learning models now. So it's really interesting, seeing the difference that tools make to learn one more people, though AI systems. Let me just wrap up this century as a discipline, this is my engine, the data needed to build an AI system. And I think that by accessing small data and customization problems is key to democratizing access AI. I've seen the way a lot of tools devolve as first handful of experts will do it intuitively. 15 years ago, it was much of us coding up deep learning algorithms, and C++ is kind of error prone. And then eventually, enough papers were published that the ideas became more widespread and more people implemented deep learning and C++, so error prone. And eventually there are tools like TensorFlow and pytorch. That made the application of deep learning more systematic. I see they essentially are going through a similar journey for decades with experts kind of doing it. I think we're now firmly in this phase where they're not people talking about that more and more people are trying to manually apply these ideas. And then I think the next phase will be you know, Landing AI, but also hopefully many other teams executing these type of recipe to build tools to make the application of these ideas more systematic. Just wrap up with with one last thought, which is, I think AI has come a long ways. We still have long ways to go to democratize access and data centric AI. Will I think help us to democratize AI which will benefit everyone. And I think all of us here in person and online. As part of this community. All of us have a big role to play to build out this recipe and the set of tools that make the benefits AI available to everyone. Thank you very much.

Thanks