Sam Liang unedited interview transcript
10:07PM Sep 1, 2018
Speakers:
Larry Magid
Sam Liang
Keywords:
transcribe
transcript
otter
technology
voice
speaker
conversation
record
people
privacy implications
recording
real
speech
larry
phone call
speaking
podcast
accents
dictation
android

Hi, I'm Larry Magid* of CBS News. I had an opportunity to sit down with Sam Leon, who's the co founder and CEO of otter IT company, which has a free app for Android, iOS, and the web that will transcribe audio into text automatically. And in real time. In fact, if you want to follow along, or just read the transcript of this interview, you can do so at Larry's world dot com. Because I used order to transcribe it both the raw interview in real time and the end of that interview that you're listening to. We began by having him Tell me what otter is, and how it's different from other speech recognition engine that you might know from the likes of Google, Amazon and Apple
*Correcting the spelling of my last name is the ONLY edit in this transcript. The rest is exactly as Otter transcribed the audio)
honor is a mobile app and a web application that transcribe human voice conversations. So this is very different than Siri or Alexa Google Home, they handle a conversation between the human being in a robot, you can ask a short question like, what's the weather tomorrow? And the robot will answer that question. However, otter is doing something totally different. It listens to human to human conversations, and transcribe the conversation in real time. You can also upload a old recording to order it will also transcribe it. In addition, it's able to recognize different speakers voice and separate them properly in the US as new technology called diarization and speaker ID, or the voice was the first technology diarization is a technology to separate one person's voice from another speaker's voice. So
I would imagine, for example, that it know if your voice that never met me before, how will it know when it transcribed this podcast? It you're speaking, or I'm speaking, one, eight, here's a new person, it doesn't know the identity of the person, but it does, no, it's a new person. So
at the end of the recording, you can tag a segment of the speech and you tell it this is Larry and that technology in the cloud will create a voice profile or a voice print similar to your fingerprint conceptually, which can be used to match the rest of the recording. Okay, so you feel similar to the way for example, Google and Apple recognized faith and what our facial recognition when it knows this is what Larry looks like, from then on any photograph of Larry, it's gonna absolutely its various conceptually very similar to face recognition. Once you label a few faces, it's able to remember that so if I were to record, let's say, a debate between Hillary Clinton and Donald Trump's, would it because they're famous? Would it know their voices, for example? Or could it It could, you know, I in the because they have a lot of a public speech recording, we can easily download their speech in advance and label their voices in advance and create their voice print in advance. So when they are engaging in a debate, the system actually is able to recognize their voice in real time as well. Although it right now the real time speaker ID recognition is now available in the current product. Yeah,
and what are some of the practical applications of the technology, we see
a very broad range of applications. Obviously, for reporters, they do a lot of interviews. So this definitely helps them a lot. Traditionally, you have to spend $1 per minute for human being to transcribe it for you. And, you know, turn around time, could it be slow with author, you can get it instantly. So
we're recording this in real time at the Edward conducting this podcast. I can actually see our words on the screen. But I've got hundreds of podcasts and broadcasts on my hard drive. Can I go back and retro actively transcribe those? Absolutely. It's very easy to upload the old podcast on to the website. It's called author dot AI. Once you login, there is a import button. You can even use that button to import all your old podcast, as I recall dragon system, for example, which is very good speech recognition. It does well when you're talking into it. But it hasn't done that. Well, for example, when you load an old recording event, your understanding of previous technology,
I think they can do old recording to, but the accuracy is very different.
nuance has been doing this for 20 years. But they are actually lagging behind in the last few years. Because their technology is pretty old. The new technologies are all based on deep learning. So that's what we have created in the last few years. Could you
use it? For example, if you wanted to just write a letter and have it transcribed as opposed to doing what we're doing today? Yeah,
for that purpose is usually called dictation, right? dictation is one person tries to use voice to write a letter or email otter can definitely support that actually support that very well. However, Audra actually does something even more difficult than dictation. When people do pace a usually speaks a little slower in more clearly. But when people are engaged in a conversation with another human being or several other speakers, they speak much faster. could
imagine for example, let's say I'm riding in a car or I'm an airplane, I have this great idea that I want to write down. I could just go ahead and load otter and use it for that purpose and speak my idea. And I'd have I'd have a record of a transcription. Absolutely. That's not the main purpose. But it's a you could you could use it that way.
We actually just saw a YouTube video yesterday, somebody said, I'm using otter to write a book. Sure. When they're walking their dog, they're actually writing a book using otter
exactly, I guess it transcript barking By the way, if the dog if feeding the conversation,
unfortunately, we're not able to understand dogs barking yet. But eventually, with deep learning, you could figure that one. Yeah. When the dog barks, you know, what does it mean, see hungry? Or
is there actually are people working on that on that very problem. But But seriously, what's unique about this, I think, is the fact that he does for allow for a group conversation. So for example, if I were to conference and there was a panel and a number of people were speaking, I could essentially transcribe every one of those speakers at the panel. And ultimately, we would know who the speaker was, that would actually label their names once I trained to do so. Yeah,
absolutely. We have actually done that many times already. ourself. In addition, we actually used the product ourselves in our own company, we actually record all our project meetings, marketing interviews, so we actually eat our own dog food. Yeah, all our company meetings are in the otter system.
And so you have a transcript of who said, watch it at all of your meetings. Yeah, somebody had a great idea in someone else takes credit for it, it's not going to work,
right? Sometimes people have different interpretation about who about somebody's opinion, and we can always go back and listen to it again.
And again. In other words, if you don't, for whatever reason, don't trust the transcript, you can go back and get the actual audio? Yeah,
both the audio and transcript is available in one you played back the audio and transcribe it is synchronized a word by word. So I
saw that on the website. So they just to the audience knows, this operates on iOS on Android. And then there's a web app, which I presume, operates on all web browsers, right? Mac, Windows, whatever. Yeah. And I did notice it on the web. I could play back an audio simultaneously while reading the transcript, which was really great, because I did find in one of my I actually uploaded an audio portion, what am I broadcast and have my eye on tech broadcasts, and I found one or two mistakes, but when I listened to it, I could see exactly what I had fit. Yeah,
and that you're on the if you listen to what on the phone as well, it does the same thing on the phone works on iPhone, Android, you know, so you mentioned any web browser on PC or Macintosh, the engine, we built the speech recognition engine or by ourself, we're not using Google Voice meetings. So
for just to review, this is a free app that runs on iOS and Android, anybody can get it just from the Play Store. Search for otter, that would Ott er, O TT er, in App Store on iPhone, or in Google Play Store on Android. Or you can go to otter AI, that's how you get it. And you can use it on the web as well. So for example, if there's a YouTube video, for example, where television program or a Netflix show that you would like a transcription of if you had the mp3 file, you could lower the end. If you didn't, I suppose you could run the otter app in simply listen to the speaker and get a transcription. Yes, Yes, you did. Of course. Now, I have to ask you about privacy used to work for Google. So you understand the complexity effect I know for that used to work on Google Maps. And that's an example of a wonderful product that I use every day, which has enormous privacy implications. It strikes me that this product is also quite useful. But it brings up some interesting privacy implications as well. Yeah,
we definitely take privacy extremely seriously. We see this as a personal tool, and the user owns the data himself. Whenever the user wants to delete the data, we erase everything, absolutely, we're not going to sell the data for advertisement, you know, we have a freemium model, so that we make money from the user subscription, and also for enterprises, so we don't need to, we don't want to sell the data.
So speaking of law, the state law varies as to whether it's allowed, you're allowed to be court a phone call, with only one person knowing, for example, I think in New York, it is legal to because I know this Michael Cohen case, I believe it's legal to record a call and not disclose it to the other person in California, you're required to disclose it. But either way, it might be useful to have a transcript of your phone calls. Is that a possibility? With this technology,
it can, you can record phone calls, as long as you tell the other guy you're doing it. And as
a technologist did the app, allow that on both iOS and Android
and iOS, you cannot record a phone call on the same iPhone, you could use another phone or use a PC or Mac Book to record the phone call when you have it on speakerphone, and enjoy it technically possible to record a phone call on the same Android phone. You and
I have different accents and but of course of many other accents. Is that an issue for some speakers that were there, it's harder for them to understand their voice, it does make the speech recognition engine
more difficult to build. However, you know, with the deep learning technology we are building we actually collected a lot of different speeches with different accents and the train the engine to teach it, how to do the mapping between different accent to the English words. So specifically, we in Silicon Valley, there are a lot of Indian engineers and Chinese engineers. So specifically, we did a lot of enhancement and training for Chinese accent and Indian accent. But in addition to that, there are UK exits from Britain, from Australia, even in southern America, you know, people have different accents. So we have in on the internet, there are a lot of a public
speech, we applaud and use that to train the speech track. And of course, also people with speech impediment, I imagine you can eventually get to the point where you could translate transcribed virtually any, any speech, right,
right, right. But if somebody has, you know, very different pronunciation, different pace or style good, could do a personal training as well, right now, it's, you know, for the general public.
And then the other thing that excites me about this technology is simultaneous translation, the ability for example, I go to a lot of conferences that are sponsored by the UN and they have very high skilled people who not only translate it in real time, but also put it up on the screen in real time. I'm always amazed how they can do that. But I presume we could get to a point where I could be speaking in English and my words could be in French or Chinese or Spanish or Arabic on a screen in real time as I'm speaking. Yeah, absolutely.
You know, we already
transcribe the songs in speech into English words. And we can easily use another API to translated into Spanish or mentoring or Japanese and show that in real time as well with them. Speaking of time, we've run out of it. This is great. And of course, as I mentioned there with a transcript of this entire conversation at Larry the world. com So if you go to Larry's world you'll find both the audio and a transcript of the conversation Assam Thank you very much. Thank you, Larry.