Building a New Voice with Sam Liang (AISense) and Cathy Pearl (Google) | Disrupt SF (Day 2)
2:57AM Sep 7, 2018
We are going to mine once again. That subject of the perfect voice interface. And to join us is Sam Liang from aisense and Cathy Pearl from Google. And you can guess what they're going to be discussing. Yes. It's the future of voice, and what consumers can expect to come to come. And to join them is Frederic Lardinois who's the news editor of TechCrunch. Let's have a round of applause everybody, thank you.
I have no idea what that British man with his leather jacket was all about there, but I know what he's talking about.
Thank you for you do
I've heard of it?
All right. Good. Good. Good for you.
We're here to talk about voice. But before that, I just want to get a quick show of hands. How many people here have heard of aisense?
Yes, you have, because you work for them. I get that. All right. There's a few people.
You have. How many of you have heard of Google? Google little search company? Couple of people didn't raise their hands. I'm worried about you. Here. You're at the wrong conference.
Sam, can you just briefly explain what aisense is all about?
We're a little bit smaller than Cathy's company.
We're a small startup in Silicon Valley. We built a new technology called ambient voice intelligence.
Well, in plain world,
we do something very different from Alexa or Google Assistant. How many people have been to the main stage and have seen the real time transcription for all the keynotes and panels?
That's built by us.
So we have done something that Google has never done before. We raised an animal. It's called otter. For people who don't know, otter is one of the smartest animals in the world. Otter is the app that TechCrunch is use using to do the transcription for all the keynotes and panels. So how many people have actually downloaded otter?
Do you like it? All right, I have a few otters here. Let me see.
Here you go.
All right. One more.
All right. Yeah, you haven't download
download otter right now. It's a free app, you're going to have all the sessions, including this session, although it's not done in real time on the next stage. So we build our own speech recognition technologies,
we focus on something different than Google. We're not, we don't want to compete against Cathy.
The thing we do is a human to human voice interface. So that's what people have been doing for thousands of years. They talk to each other.
Now, Cathy, I know there's a Google mini underneath everybody's seat. Just, you know, take a look.
Nobody did. Geez. Nobody's falling for this.
I just want I want to roll this up from from behind a little bit.
Because we've and you've worked on this actually, we've had speech interfaces for a while.
They sucked. Most of them were not very good.
If in the last 10 years or so, five years, they got a little bit better in the last three, four years. They really got good. What has changed?
Yeah, you know, just this morning, I was thinking about when I first got started in this space, which is 1999, and I was looking for a new job and I saw a job ad for a company called Nuance Communications, and they said, they did speech recognition. And I said, that doesn't work. But they had a phone line you could call and they had some demos and you could transfer in this fake banking account $500 from checking to savings. And I was like, this is so cool. And I went and worked there for a long time learned a lot about conversational systems. But I got really disillusioned and I decided I was going to change careers.
And obviously got sucked back in and one of the main reasons for that was the advent of the smart speakers because it's really a fundamentally different use case. Part of it is the technology. We've certainly gotten way better with speech recognition accuracy. The far field microphones obviously make it a lot more attenable. But it's also the use case. Because so often in the world of building IVRs or phone systems, it was to keep humans away from humans, because that was the cost saving measure. And these new smart speakers are a whole different use case. They're adding things that I wouldn't be doing necessarily with a human. And I'm very excited about the possibility of these these systems.
I think a lot about the potential for accessibility for people who, perhaps visually impaired maybe they don't have good fine motor control. Seniors who might not have a smartphone places in the world where people where the literacy rate is not as high you might get a phone, but you can't really use it very well. And I think these voice systems present this new opportunity to allow people into a world that perhaps they didn't have access to before and we're still in the infancy of this technology, but I see so much potential and I'm really excited about it.
Absolutely. And I want to get back to accessibility a little bit later on. But first for Sam, you decided to build your own voice recognition system, even though there's this Google out there. There's Amazon's. There's Microsoft. There's all kinds of other systems. Why build your own?
We actually we started two years ago in 2016. We tested a bunch of speech recognition API's including Google, Microsoft, IBM, nuance, we found that their system actually are really good for Voice Search.
When you do Voice Search. You ask a short question like, what's the weather tomorrow? How's the traffic to San Francisco and the machine answer the question really well. However, when I try it for human to human conversations, and when I use an interface to record a meeting and transcribe it. The quality is actually pretty low.
I guess, you know, for people who who are doing speech recognition for a while, you know, there's a lot of differences when you do human to human conversations, because, you know, first of all, it's a far field versus near field, multiple speakers vs. vs. Single speaker, you know, short question versus long questions. So we actually build a team that do our own training and build a specific system that's optimized for multiple people, long conversations.
Is that fair Cathy, true for the Google system.
I mean, there's certainly different use cases. You know, at the moment I'm not going to use my google assistant for for transcribing meetings but but there are certainly
more engaging involved experiences that people do have, with their assistant whether they're doing a long transaction like shopping, whether they're playing a fun game
Whether they're all sitting around as a family and playing a game, or listening to music, there are a lot of different journeys that you might want to go on with your voice assistant. And I think
Google, we're looking at all these different places where you might be where you might want to use voice or typing, to interact and make sure that we can provide the thing for all these different ways in which you might want to use it.
This is something else that has changed too I think and it's true for both of your use cases, I think is that people actually are not that afraid of talking to machines anymore. I think if you'd asked five years ago, so people were just nervous about the whole Oh, and it looks weird. And I don't want to be talking to a machine that has really changed too but this still hesitant so I think what what can you do to to take that away that that impediment
I think part of it is being very transparent. For example, with the Google Assistant you can look on the app on your phone. It will show you anytime that the google assistant woke up and thought you were talking to it. You can delete those recordings. We, of course, anonymize those recordings and we don't store them long term. So being transparent is one thing.
But I think also it's it's a new technology. And it's interesting to me to see the difference between people who perhaps are a little suspicious about something like a smart speaker sitting in their kitchen counter. But they're happy to carry a phone around them all day that has a microphone and is more easily, you know hackable. So part of its perception but perception is more important than reality. So it's important to us to make sure that we address that and don't try to dismiss it. But really listen to what people are concerned about and make sure we work to really understand that and address that.
And Sam, when we talked earlier last week, you were talking about how ideally maybe you want to record everything.
Yeah, this is something very scary and made a lot may make a lot of people uncomfortable, but eventually I think the system could be always on. You know for myself I would love to have something that I can use to ... listen to what my mother told me when I was in high school. Actually a lot of precious moments that are not scheduled. It just happens naturally spontaneously. So people you know of course there's a privacy concern we can touch on talk about that later but if you think about your life it's actually pretty short you know how long can you live, 100 years? But you know what if all the people you meet, all the things you do, extremely precious, I hope, you know, we our system eventually can record all of them, transcribe all of them, make everything searchable. You know I when I'm coming to Disrupt, I'm probably already talking to 50 or hundred people already you know there's no way for me to remember everything I heard. So when I walk around now I just turned on Otter so whoever talk to me it's captured so I remember what they said and then later I can ...
Is that legal? Is that legal? Can you just read I forget
This is a public event, actually legally you don't have to tell people because people don't expect privacy in this type of event. And the federal law is one party consent although California is two party consent.
All right. How many people here in the audience want everything recorded? One guy one guy, two guys, okay, three people three out of a few hundred. That's your well if you've got some some work to do.
There are some psychological resistant but if you look at the history in a look back in 30 years. 40 years. Uh, how many things are recorded 40 years ago compared to what we're doing today? Think about what Google is doing what Facebook is doing. When you're walking on the street, you are being recorded by the video camera on the street anyway.
All right, Cathy, you look rather skeptical. You guys just had the duplex thing happening. Where if I know if you guys remember duplex, but it was the Google Assistant making phone calls on your behalf. And everything was a little bit complicated about that legally and ethically.
Yeah, I mean, I think personally for me, although it's true, we're becoming a society of recordings, so many things, that doesn't necessarily mean it's the right direction to go. And so it's this it's this real catch 22 because if we had never collected data from people speaking, like remember, Google 411, things like that, to build these data models, there would be no google assistant or any of these things.
But that being said, we can still be you know, try to be ethically responsible with how we record the data, how we tell people recording their data, let them control that as much as we can. It's a complicated subject. But I think it's something we have to think about carefully as we continue on this, this road where everything is more and more things are being recorded.
All right, I didn't think we're gonna go down this rabbit hole, but we did. And that's very for the better on the technical side. To switch gears a little bit
have we have we solved voice recognition?
So it's, it's, it's more than a break voice recognition down into two main things, which is the ASR the automated speech recognition, which is the actual knowing the words that somebody said versus the natural language understanding or the intent. It's kind of like, you know, I don't speak German. If you're speaking German to me, I could probably write down the sounds you were making and kind of get that correct. I would have no idea what you said. As far as the recognition accuracy goes, we're doing very well. It's actually it's not perfect by any means. Neither is human speech recognition, but it's good enough that we can do a lot of things.
Where we really are a long way from, from solving is more the intense side of things. That's a huge problem. One of the biggest challenges in this space is what we call discover ability with something like a voice assistant. It could do thousands of things. But how do you know what they are or what they aren't? We need to move towards a world where we can ask it anything, even if it can't do everything kind of like if I go to a concierge in a hotel, and I say, can you rent me a car? And they might say, No, they're not going to say I don't understand. Or they might say go across the street, there's a car rental place and so we need our voice assistants to get to the place where I can ask anything and it can correctly direct me to something or just admit you know, no, we don't know how to do that. And we're a long way from that. It's a very big challenge.
Actually a big part of that challenge to to address that challenge is
context awareness. So when you when you're using a mobile device, you know, I used to work on Google Map location service so I know a lot location context. So whether you're at the conference here, or you were at work or at you were at home,
the things you speak about is actually quite different. The phone actually knows the location, it can adjust the speech recognition system and can adjust the language model. And also there's a new technology called diarization and speaker ID, which we actually built into otter to recognize who the speaker is.
So once you you know, understand who is talking, you can actually the system can look up their LinkedIn page can look up their Facebook page, understand the background, that person then it can incorporate that knowledge into the speech recognition system to better understand what they're talking about.
So the ideal would be a Google Glass that's recording everything and gives you your LinkedIn profile that
Yeah, my view is that it's not about logging. The data is about who can get access to the data. You have to control the, the access, but logging the data, actually is there's a lot of benefits for news user
Alright, fair enough. Fair enough. One thing Google has done recently with the assistant program is the smart displays as well which just adds another dimension to working with with a digital assistant and working with voice.
How's that working out so far? What's your experience been and how people have been using those? There's only one on the market so far, but kind of what's how's it How does and how does it change voice interaction?
Yeah, I think sometimes people think that when you design voice user interfaces, that your voice only kind of have a voice only philosophy like everything has to be voice only and screens are going to disappear forever. And, and we don't really think that way.
Voice is a great medium for a lot of cases. But a lot of times a visual medium is better. And so some of these smart displays are great. If you want to browse for something like you're shopping for a new shirt, or you're picking a pink color. Imagine if it's like, which color blue would you like light blue light, light blue. I mean, that would be a terrible voice on the interface. So visuals, of course, come in very handy there. And again, back to this sort of hands free. One of the big use cases right now is in the kitchen, you're cooking, your hands are dirty, but you want to be following a recipe. You want to watch maybe a video of how somebody stirs something or chops something and having that additional visual element is really handy.
But I think it's it's kind of a brave new world and I can't wait to find out how people are going to build stuff for these new displays. I think there's gonna be a lot of creativity and interesting things coming.
I've tried that recipe thing. It's great. I've never I don't have the ingredients and so it's there's no point for me to use it but
you just have to add shopping
shopping list as well yeah sure I can do that and do that
What's the future of voice at this point what would that look like for you recording everything and understanding everything but what what's next in voice?
well to achieve that that actually will take a long time to fully understand what people are talking about their intention you know one of the use cases actually related to health you know
if you can record everything it listen to everything you say hears the other people's work you know it can detect it depression before you realize you're depressed. It can understand your emotion understand you know when you feel you know upset you know it knows all the historical conversations you had, you know, who did you talk in the last week,
You know, students, you mentioned accessibility students are actually using otter in university lecture rooms to take the lecture notes, and, you know, people who have hearing difficulties. So what's next? All of this, I was think, there are thousands of use cases to by leveraging this kind of technologies. Now, listen to everything, understand everything
I'd add to that. I think personalization is going to play a big part right now, you know, we can do mildly personal things like during breakfast, I can say, hey, Google was my flight on time, which was great because I don't have to have, you know, I've got like, 15 airline apps on my phone and I can't log into them half the time. I don't even have to say the name of the airline or anything,
but I can imagine a future maybe where you're kind of growing up with your own virtual assistant and it gets to know you because designing one Assistant that is all things to all people is very, very challenging because we're all different. Even each one of us during the day is different. So maybe in the morning, I'm grumpy, and I don't want to hear your dad jokes. But in the evening, I'm relaxed. And I want to have a chat. And so I think having these assistants know you a little better and be able to to modify the things they might offer you or the way they might communicate with you could be a potential thing happening.
No dad jokes in the morning
maybe. But after about 1030
Alright 1030, that sounds good. It is our time is up so we can start doing the dad jokes now backstage. Thank you.