AI: Sam Liang Otter AI
9:14PM Jan 21, 2021
Hello and welcome to the Nvidia AI podcast. I'm your host, Noah Kravitz with a big shift to remote work and remote school earlier this year, Zoom has become a newfound part of many people's vocabulary. Last month otter.ai a Silicon Valley tech company announced live video captioning for Zoom calls, a potentially quite big feature to boost remote work across many industries. And while the news is quite big otter AI has actually been around for a few years now. And its AI powered transcription capabilities have made it a favorite of podcasters and other folks who need to transcribe conversations regularly. Otter.ai CEO and co founder Sam Liang joins the show today to talk about Otter's new capabilities, the challenges and potential of using AI to transcribe conversations and his own history in the tech industry. Sam, thank you so much for joining the Nvidia AI podcast.
Thank you Noah for having me here.
And and so just to set the table for folks who have not seen otter in action, before we hit record, you set up a live transcription. So in a window that I'm hiding, so I don't you know, distract or scare myself with my own words. We've got a real time or near real time transcription of our conversation running just showing up on the screen in front of me, which is one of the things like I told you, I'm not surprised because I know a little bit about Otter, but it's still remarkable to seeing it happen. So congratulations on all you've done so far.
Thank you. Yeah, we actually have transcribed Tens of millions of meetings already.
So let's get into it. Maybe you can tell the listeners a little bit about what Otter AI? Does, how long you've been around and then kind of lead up to the new announcement with Zoom.
Yeah, sure. Otter AI is a startup based in Los Altos in Silicon Valley. We started in 2016. We built the speech recognition system for meeting transcription meeting notes from the ground up. When we started, actually, a lot of people ask us, didn't Amazon Alexa, do speech recognition? Did? Didn't Siri, do it? Why do you guys need to do it? But the problem is, when you use Siri, or Alexa, you ask a question like, what's the weather tomorrow or set the alarm at 3pm? And then Siri, or Alexa will answer your question. But it's not a long form multi speaker conversation, like what's happening in a meeting? So if you use Siri or Alexa for your meeting, it doesn't work.
We have to build the AI model from the ground up to take notes for complicated meeting situations.
How do you spec that? How many concurrent voices can Otter listen to what are some of those capabilities that make it good for meetings? It kind of set it apart from the other services you mentioned?
Yeah, as I mentioned, when you have a meeting, you usually have at least two people Sure, sometimes a five or 10, or even more. So all every person would speak with different accents. Different pace, different volume. on virtual meetings, people have different background noise as well.
So the situation is way more complicated than the Siri use case where only a single speaker asking a short question. And meetings usually run long, could be 30 minutes, 60 minutes, and people interrupt each other all the time. People may speak a little faster or a little slower. They don't always speak automatically correct. They may pause and hesitate and restart and suddenly change topic. All of these make taking notes for meetings way more complicated. So that's why we build AI technologies to optimize note taking for meetings.
And so right now, how does somebody use Otter? I mentioned the new zoom announcement and then you're running this through, We're recording this on Skype, and so you're able to transcribe our Skype conversation. Does it function as a standalone platform? Is it a plug in how does one go about using it?
Yeah, Otter is designed as a standalone product. It is a virtual meeting platform agnostic. We are using it for this Skype call right now. And you can use otter for any meeting platforms including, zoom, WebEx, Microsoft Teams Google Meet, or even your phone calls, regular PBX phone calls, you can use Otter as well. While obviously zoom is one of the most popular virtual meeting platforms today, right? Now, earlier this year, we heard they had 300 million daily active users. Wow, that's just insane. So, Otter did have a special integration with Zoom. And when you have a Zoom call Otter can be automatically started. And it's a plug in to the zoom audio stream. So the order quality of both the audio and the accuracy quality is very high. Separately, Zoom actually came to us three years ago, asking for help. So we license a subset of our technologies to zoom and zoom could some meeting transcription feature inside Zoom, but they only made that available to their enterprise customers.
So even if you're a paying a Zoom user, but if you're not one of their enterprise customers, you cannot get Otter features inside Zoom. So yeah, as I mentioned, Otter is a standalone product, it can transcribe that it actually store both the audio and transcript for you. And everything is searchable as well. So for our own company, for example, we are four years old, all our meetings in the last four years are actually captured in Otter. So at anytime, anywhere, I can use Otter to search for anything I have heard before, you know, from our four year history.
I was gonna ask you about this later in the conversation, but But since you brought it up, how does that change the way that a company or even a person approaches meetings and approaches work? Is there a kind of a fundamental, you know, mindset shift that you see kind of knowing that you've got this this automatically archived and searchable record of all of the meetings?
Yes, we do believe there will be a fundamental change. And it's actually happening today. Otter has been growing steadily before COVID happened. People have been using Otter on their laptop or using otter on iOS or Android devices. So this is a cross platform product, you can use it anywhere.
If I was a student, or a reporter in my my former, former life, and I was out, you know, I interviewing somebody if you and I were talking in person, I could pull out my mobile phone and run the Otter app and get a transcription?
Yes, you can. Yeah. For people who will actually and this was quickly adopted by a lot of journalists. Sure. Because they do interviews all the time. In the past, they had to hire human transcribers to transcribe their interviews, which was very expensive. And it was as well now is Otter is fast is cheap is instant. So that was quickly adopted by reporters. However, the main market we are targeting is actually the business and enterprise meeting market. And this is a huge as demonstrated by Zoom, Google Meet, Microsoft Teams, right? Webex, as well, just there, probably millions of meetings are happening every day. In the past. If you think about it, most of the voice conversations, they were all lost, they were not even captured. So after a meeting, how much can you remember?
Is during the meetings, actually, people are pretty stressed. You know, they're really afraid of forgetting things. They have to take notes, like crazy, especially for complicated meetings, where people discussing a lot of numbers, that a lot of facts people need to remember. So you have to either type or in the past, people use a pencil or pen or writing on the paper notebook. Right? The problem with that is it distracts you um, it reduces engagement. You lose eye contact when you are you know, watching your laptop taking notes.
You know, I've been in meetings I'm sure most people listening have where there's been a mandate that you know, nobody's allowed to have their laptops open because we want full engagement which you know, in some situations is impossible but in some it is but But to your point, then you're kind of losing that archival ability. Which can be so important when you need to refer back. So the enterprise fit makes a lot of sense. I'm wondering, NLP natural language processing is obviously a huge field, a huge problem with with many, many subproblems and applications that people under the umbrella of AI are working on from all different angles. I'm wondering, between, you know, the four years ago when you started Otter, or even earlier, when you had the idea, how much of a background did you or your co founders have in NLP? And what sorts of challenges have you been using deep learning and AI to overcome as you're building the product and the company?
Yeah, for the AI technologists, to be honest, it wasn't built by me. I did my PhD at Stanford, specialized in large scale distributed systems. So I'm actually a system person. Today, by training AI person, I was the lead of Google Map location service for four years handling, you know, location data, and huge amount of map and, again, location data, which is critical for Google Mobile map, right. But I got into this space when I was thinking about a new startup, actually, I quit Google in 2010, build the first the mobile startup in October, it was acquired and then was trying to figure out something else to do. So one thing came to my mind was Moyes. One reason was that actually, I always forget things. After meetings and conversations, I had a hard time remembering information. And also, it's very hard to share information with the team when I talk to someone and then I need to discuss the issue with the team, then how do I make sure I conveyed the information correctly? So figure that, okay, it will be great if we can capture all the voice in the world and transcribe everything. And we did some calculation. Like, how many meetings are happening? How many words? Does every person speak in their life?
Do you have a given number? Is there an estimate for that?
I'm actually there has been some research, I've seen data like in a person's lifetime, he or she may speak a few 100 million words. 100 million number I saw was 800 million. Right?
Okay, I was thinking maybe a billion or 2 billion, but I talk a lot. So that that sounds about right.
So if you think about it, in almost all this data is actually lost.
It's actually, you know, quite wasteful. If you think about the amount of information and insights and intelligence that's embedded in that voice data. So that yeah, that's the origin of the company. So back to the AI part in a we actually when we started with my co founder, Yun Fu, who is also a computer science PhD, and he's also a system person. So we actually look at the market. We tested that AI technologies the voice and speech recognition API from Google from Microsoft, they actually didn't work for the meeting situation, due to the reason I mentioned earlier.
So we did sign in to do some experiment. And we will look at in what we can build. And also, we started to look for talents. So very lucky, we actually find a few good people from Google we convinced them to join us. And, you know, we started to build this, all this technology from the ground up, focusing on multi speaker, long form conversations, Which is, you know, what happened in meetings. And in, you know, over the last four years, we just made tremendous progress, we trained our model, millions of voice data. And as I mentioned, since we launched the product, in 2018, we have transcribed over 50 million meetings, I think over 2 billion minutes now.
Wow. That's amazing. And just just the tip of the iceberg, I'm sure what's to come. As you were talking and mentioned the multi speaker problem again, it made me wonder, Is there an equation? Even if it's fuzzy math, kind of the Is it is it exponentially more difficult with each new voice that's brought in or how does that work moving from single voice language processing to then being able to recognize and differentiate multiple users or multiple speakers?
Yeah. It's ard to give a numbers that how complicated is it. But as I mentioned, the the complexity came from multiple angles. Accents is actually really challenging. Yeah, think about my own accent. And English by itself is pretty difficult a lot of words.
It's not the easiest language. Yeah,
Super pronunciation. And it really depends on the context. So this is where NLP and you know, the language model and all that really matter. So it's now a isolated speech recognition problem is really embedded into the context. So it's a lot of work is also you know, the, you know, whether it's traditional speech recognition, or it's a deep learning based in a we use a combination, in the end to end is complicated. It's not just one or the other, you need to use the right combination of technologies for different problems.
Sure, how many languages does Otter support?
Yeah, at this moment? No, we're actually just focusing on English.
This is a huge market, obviously, right now, we did get a lot of requests to support other languages as well.
I'm speaking with Sam Liang, Sam is the CEO and co founder of Otter.AI, a voice recognition language processing, ai deep learning company that is transcribing audio conversations. And again, we lead with the news of their new Zoom feature auto captioning and zoom video calls. But as Sam mentioned, they're platform agnostic work across all manner of tech platforms and voice conversations. It's really remarkable stuff to see. Sam, you alluded briefly to your time at Google, I wanted to switch gears just for a moment and talk about your background. You also mentioned your company before Otter, Alohar. Am I pronouncing that right?
So you've got a little bit of a background, maybe you can walk us back to you mentioned, you're a systems guy and your PhD in distributed systems, but kind of start from there. And maybe briefly walk us through your, your career.
Yeah, I studied computer science in college. And then you know, later I did my PhD at Stanford work at Cisco before and then Google. But you know, I always itchy about doing startup. So that's why I quit Google in 2010. And built this mobile startup first. For that one, we were actually focusing on mobile location, contextual data, personalization based on behavior. But the important part of the startup was about getting a large amount of data and analyzing the data. So this first startup was about getting tons of the mobile and location data, and other sensor data as well, we look at that store raw meters in some other, you know, even a Bluetooth data as well and use that to analyze the mobile behavior. So after that my company was acquired, it was thinking, you know, anything more crazy, we can do. So and we realized that there's one sensor, we were not using a mobile device. That's actually the microphone. And in the meantime, I, you know, as I mentioned earlier, I realized that I was interested, I really need to have my meetings transcribed, so I can search them, I can analyze it and share it with my team. So then, some multiple reasons motivate us to start Otter.ai and Otter.ai, in some sense, is trying to actually really capture the voice information and make it useful for people. And today with remote work and distributed workforce. Most of the meetings are happening on zoom or webex or Microsoft team. So people are finding Otter really useful to improve their collaboration remotely.
So that leads me in the last couple of minutes we have here together. You mentioned you know, Otter was growing before COVID came about and certainly since then, we're recording in mid November. So over the past six months or so, you know, this this boom and remote work and remote learning, remote everything. Do you think what whatever happens with people continuing to work remote or how many percentage of people you know, go back into offices and do time. Do you think that this trend towards capturing the world's information in particular the world's audio data, and then having that searchability kind of change the way that we approach meetings and conversations, Do you feel like that's here to stay? And even I guess more My question is, you know, where do you think this might take us and relative to Otter, the way we work, or even just kind of, you know, the part of AI that's working on on language and audio data related things? Do you see any trends or any places that you think we're headed?
Yes, we do. I don't think that this trend will stop in terms of using AI to help people capture information, sharing information. With Otter, now, during the meeting, Otter is your new meeting assistant, people will have the peace of mind that they don't have to write down everything themselves, they know that Otter is doing it for them. So gradually, people will just take it for granted that they will assume order will always be there and where whatever they heard, they can always retrieve them later. And on top of that, not just to retrieve the original thing, Otter will actually use NLP to analyze the compensation as well, if you look at your Otter notes outer can already recognize summary keywords, for example, can detect the topics you're discussing. And then over time, you know, Otter, or, you know, what are were or whatever other tools in this domain can summarize your conversation can detect important information, you know, for a product manager, for example, you know, in this week's meeting, an Otter may remind them of the action items they discussed last week, and, you know, can help you manage the agenda for this meeting. And suppose you're not in the meeting, but somebody mentioned your name in the meeting, maybe you'll get automatic notification.
Or okay, say, Michael, assigned this action item to you. Noah it's due on Monday. So that notification can be sent to you directly from Otter.
Right. Yeah, there's a lot of a lot of potential I I'm preaching to the choir, I know, you know, but there's, there's so much potential now to act on all this information. It's something for folks who want to find out more about Otter AI, and try it for themselves. Obviously, the name of the company Otter.ai is the website. But also they can just go get the mobile apps?
Oh, yeah. Now you can use order AI in your web browser. Or you can download it from Apple App Store or Google Play Store on to your mobile device or iPad, Android pad. So again, it's a cloud based service. No matter where you use it, you can search for information anytime, anywhere.
Fantastic. Well, Sam Liang, thank you so much for taking the time to join the podcast and it's remarkable what you guys have done in four years and and all the best on the next four and then some,
Thank you, Noah, Really my pleasure to be here.