Building for Voice with Jason Mars (Clinc) and Alex Smola (Amazon Web Services) | Disrupt SF (Day 2)
2:50AM Sep 7, 2018
Now Now we're all familiar. Are we not now with the voice interface as a new way to interact with computers? Okay, let's do the survey audience survey who's got an Amazon Alexa quite a lot. Who's got Google Home?
Who the people who have an Amazon Alexa also have a Google Home? What's going on that I don't know who's got a Siri home pod just me then, right?
But it's clearly going to be a new way of interacting with the cloud with computing. And of course, it'll be pretty much everywhere in our homes and our cars. So here to discuss the new wave of voice communication. We're going to hear now from Jason Mars from Clinc and Alex Smola from Amazon Web Services in conversation with my fellow editor at large John Biggs from TechCrunch round of applause, everybody.
Just set it up. So we just have a echo over here. And then one of your machines right here, and we'll just all I'll just talk to the machines when that'd be better. Great idea. Let's do it. And would that be more
might at some point gets an endless loop.
But that's fine, because that's what disrupt is ultimately, for me, it's an endless loop. I come here every year. I do exactly the same thing over and over again. Til I die ultimately. That's awesome. So that was sad. Okay.
Talk to me about your approaches. You guys are coming from different worlds. You're a startup, you're well established. Why don't you start from there? We'll go down and talk about the approaches to to speech and speech integration and the products that you're working on,
Okay, so wearing my AWS hat. So this is not about Alexa, even though of course, we work with Alexa and performance are in AWS. So one thing that AWS really emphasizes is to do things the way that our customers want to do them. And usually there are many different ways of getting to a solution. And not every customer needs every component of that solution, right? So for instance, we have four or five different databases. And yes, if you want to run Oracle on AWS, you can do it. Okay. Now, the reason why we do this is because our customers are very diverse, both in terms of problems, skills, intents and, you know,
just personal preferences. Now, zooming into voice actually, voice isn't just a single thing. It's many, many components. So if you think about it, the first
thing you want to be able to do is you want to have your system produce something that's human understandable. So in other words, text to speech in a natural way that can be customized and adjusted to humans, right, that's called poly has, I think, 15 different languages and, you know, many, many different voices. And if I basically go to the AWS demo page, and you can hear all those voices, so for instance, is Australian English or English, English or American English, but be that as it may, so besides being able to speak feels the need to be able to listen. So for that there's something called transcribe, which will turn long form conversations into text and segment and diarize and partition them and so on. So the next thing that you want to be able to do is you want to be able to have a conversation so the first thing you need is a device with which you have to have a conversation you can use anything else but you can for instance, use AWS Connect as a call center solution. So you can call a phone number,
then you need a system that takes care of the dialogue that happens we've dealt with sound in sound out, and the number that you can call all of those components can of course be used separately. This is exactly what Lex provides. And well as the name suggests, it is to some extent the heart of Alexa. Well, different products have slightly different preferences and requirements. But you can then go and build with dialog manager, the entire interaction flow, you can have it fill slots, you can suggest what types of queries, responses and interactions might fit in.
And then of course, in addition to that, you can fall back to humans, you can execute things so you can make all of the server lists you can, you know, interact with databases, load from databases, dumped into databases, you know, just as you would
In the end, if you want to do other finer analysis, then comprehend will, for instance, tell you how a customer felt, you know what the sentiment was, you know,
basically just perform additional analysis. So even if you have a call center with humans, you can apply this. The key point is really, you have all the components, you can take them all, you can take only a subset, you can take only one and integrate them in whichever way works for you. And this way, you can get a very broad range of services. So, that's the high level view. And I should probably hand over to Jason to here.
Yeah, so our approach is based on the mission and the goal of encouraging these experiences to feel like a human in the room, right, where you can interact with it, it understands you and you can rely on it understanding you if you use messy unconstrained speech on constraint language
Right. So, you know, where most technologies take the route of specifying what could be said to the system. We take an approach where we, we, we collect observations as to how people talk about a certain domain and use that data itself to train the system. So there are no top down specifications of grammars. There's no rules, there's no key words, synonyms, dictionaries, anything specified other than the raw language use as as the corpus to train the system end to end it turns out most approaches out there will will leverage something called parts of speech tagging and build trees out of the nouns, verbs and adjectives. That was said, however, you know, if I said I'd give you $100, if you can tell me the last 10 verbs that I said, you know, most folks won't be $100 richer yet. You understand everything that I'm saying right now. Even though I'm speaking fast, the words are washing over you and you understand me right. And so we took that approach end to end in how we designed our technology stack.
So, you know, you can say long, messy utterances you can say something like, Hey, man, you know, I remember I was in California last year, and I dropped a bunch of cash on groceries. Could you just sum that up real quick, and just give it to me, right? There's a lot of salient bits of information in that messy utterance. And our system would be able to extract all of the bits of information relevant to answer your question cleanly and crisply, so it's divine design from scratch. With this in mind, you know,
there's you can correct things we have conversational healing along the conversation, you can divert and say, Oh, wait, let me go back to the last thing that I said and change something real quick. And at any point, you can have that flexibility in that freedom to go off script, correct things and then come back and so forth. And so you know,
it's built in house from scratch based on research from the University of Michigan and In another life, I'm a Professor of Computer Science and Engineering. And, you know, we've taken it to market and have massive amounts of success over the last six quarters.
How much of Alex of software using your software?
Absolutely zero. How? Well it turns out that customer. Yeah, exactly. And, you know, honestly, you know it. The way we've designed the stack is, this is it's a platform that empowers you to build virtual assistants, right? So if you have an app, if you have a product if you have a call center, if you have a device and you want a conversational assistant, you use our platform to build that much like a programming language to build a program.
Yeah. What's your view on the future of voice interaction? Is this a feature that's added on to modern products, or is this an entirely new track of interaction? Are we going to be talking to our computers most of the time are we going to be talking looking at it Interacting with our computers most of the time, you're going to start, Alex,
I think it's both. So there are certain established products where a voice interface will make them much more easily usable. I mean cars right now, when you mostly press buttons, and I don't think it'll entirely go away, because haptic feedback is really nice, like a steering wheel as long as you need one. So you're probably not going to say, hey, turn left turn right, so much faster. At the same time. If you need to go into some very detailed settings, a voice command is much easier. So this is one of those cases where the product will get better by adding voice.
And then you have products which are essentially voice only. I mean, yeah, by now. I mean, there's an echo with a display, it's called the echo show, but ultimately, it's a voice product. You talk to the device, it talks back at you and as time progresses, it gets smarter and it learns based on you and interactions with everybody else. How to make it better. Now, if you want to,
You know, build your own Alexa like product, you can go to AWS and use Lex and other tools to build this. And we've given you all the components. So for instance, if you want to build a dialogue, well, you can essentially model the various parts that you require. give a couple of examples of what you expect as an answer. And then we use appropriate deep learning to match this also against more freeform conversations, right. So I don't know whether I would necessarily whether we are necessarily able to catch slang quite as the wave of what we just heard, but
Okay, I think I would have trouble sometimes understanding that but I'm not a native speaker. But that's it. It's
it's the each individual component and so for instance, if you have a very simple device You just press a button and it says, well, the garage door result or the garage door is closed, then the only thing you really need is text to speech, right? So it's not clear that you always need the full dialogue agent, you need voice as an augmenting device. And this is really what we're here for.
Ya know, so voice and conversational experiences is is is is a is a means to an end right and not an end in and of itself.
What we aim to do with technology period. Just the overarching thing about technology is we want to reduce complexity in our lives and improve our quality of life. That's it. That's the motivating function for evolution and revolutions with technology and in so far as speaking and interacting with language which we learn to do from the age of I don't even know one
if we can just use what's natural and innate in all of us to make our lives easier to do something complex. I want to take a long road trip to, you know, New York, what do you think I'm not gonna be able to get there with two tanks of gas and then that interaction being able to get the inside of, well actually you're gonna have to stop at four gas stations for this trip and optimally given how you've driven for the last year you should stop at this location, this location, this location and this location on the way for a most efficient trip that's what we want. That's improved quality of life. That's a reduced complexity and so when we're doing it right that is the outcome when we apply conversational AI voice AI to a problem the outcome of doing it right is this is better right? So I do think it's, it's coming and as we develop it'll become more prevalent because it's it's natural and innate, you know,
Are you guys in charge of that interaction or you just in charge of the of the voice going back and forth you can I guess you start Jason. Because I mean that that interaction you just described, and there was a demo online of what you're working on, and everybody's probably seen AWS services, is the expectation that eventually one of the big players is going to connect to your platforms and say, Okay, this is the data that we have accessible to that user, and let's spit it out when they asked for it. Right? Are you guys in charge of that?
So the way that this has to look is as a tool kit for development, right? Because when you look at code bases and tech and integrations, you know, it's it's code, it's API's, it's the data is behind some kind of API or query language SQL or or cypher query or, you know, query language, etc. So the conversational AI piece is another kind of toolkit for development, right? So you'll build the AI to understand how humans speak and then that AI will have you know, the
tooling so that you can integrate it into code, right. And so developers and designers will have at their disposal, you know, SDKs and so forth for the code pieces. And then they'll have a platform or a toolkit for building the, the brain that they then hook in. And so we have a brain clinc has a brain it's an extremely powerful brain that's built from scratch, and it looks different than anything else in the market. And we believe that brain represents how this should work in the future. And then, you know, you have other other services out there with with with other approaches and other brains. And so I think that's how it fits into our ecosystem.
Alex okay. So having some it's quite reasonable to assume that some customers may not be NLP or conversational AI experts. Actually, a lot of customers probably aren't, and for them, probably templates may be the right way to go to have
A couple of reference solutions and tell them look start from there, modified, modified, and maybe a minor way and you get something and we have some blueprints for instance, right now for simple examples how to do this at the same time, you want to not constrain
the developer and want to be a want to allow them to add into modifying to extend the system as far as they're able to right. And I mean, this goes as far as us for instance making available in source code for deep learning the you know, full NLP suite of tools on the global NLP toolkit and you can basically take any of the latest papers and we have an implementation that's tuned efficiently and you can just open source git clone it and build your own system entirely from scratch there. So that's how far we go if you want to do it by yourself
Now, most customers, it's somebody in between. and they may have some very specific expertise. And this is where they may actually innovate. And then for the rest we take care of it. Now, one thing that's quite often confused, and this is just because humans work this way is just because something sounds natural, both in terms of voice maybe, or in terms of language doesn't actually mean it's intelligent.
Probably the best example would be Eliza, I don't know who's Okay, this is proof half the audience is too young to remember this, but this was one of the first text based communication systems and it would basically just turn whatever you say back to you as a query. And it was meant as a therapeutic tool for psychoanalysis Mind you, and the first time you use it, and if you don't know what's going on for the first five minutes for the first minute or two, maybe four, five minutes. In some cases you don't notice.
That you're actually talking to machine. So the bar for building something that seems intelligent isn't that high, at least if you want to fool people for a short term for something that's more persistent, there's quite the gap between the pieces that you need for the actual, you know, chatty conversation and the intelligence that goes on behind the scenes, you want to be able to query a knowledge base, you want to integrate it with a lot of other systems, which can, for instance, ingest data. And this is the, the full system where, yeah, essentially we're helping developers to build things. And so basically, stay tuned until reinvent and you might see some nice things and that's all I can say.
I want to go back to the sci fi stuff near the end because this is cool.
And one of the most interesting things I've learned about natural language processing is the use of the Enron corpus. I don't know if you heard about this thing. It was basically just a massive data dump from Enron because it
The only was the only example of text that they had human text that people were talking. But it was all a bunch of dudes just talking about their boats. And like, I don't know, Dallas and where they're going to go which strip club they're going to go to at night. So that's the Enron corpus. So it actually created a prejudice within a lot of natural language programming systems. How do you guys avoid that? How do you avoid the tendency as a non native speaker or as for the machine to not be able to hear you, I guess, and how do you fix those things that are going to be that are actually going to be issues over the next couple years?
Yeah, I mean, this is a that's a really important point. It turns out that these models are, are completely subject to the data that you expose them to, right. So everyone knows about Microsoft's Tay debacle. Right? Tay went out on Twitter, there's some bad data on Twitter, you know, like so this is a bot that start tweeting and learn from what it was seeing on Twitter and it learn all of the most it was offensive the worst stuff possible. Yeah, the worst stuff possible. The worst of humanity, racist, sexist, all that stuff.
And so this is you have to be very careful, you know, there's going to be known bad data that trains these models to not be great. And then you'll also have unknown biases and the data sets you you have, and depending on how you're applying the AI and how that could manifest itself that needs to be something that designers and you know, folks wielding this technology needs to be conscientious about and cautious. I mean, it was very expensive for Microsoft that mistake and we need to watch out to not make those
Alex. So I think this is really the point where you were a lot of the hidden costs in building such systems comes in that you really want to do good quality control. So for instance, if you
Have a picture of monkeys and they get Miss recognized as humans or vice versa then this is a no no and it's officially a no no in this country right and well for machine learner in general Well okay, fine and Miss Miss categorize Category A for B well no big deal that specific error is really embarrassing and you really need to take great care to avoid that specific one so this is where a lot of the testing making sure that the distribution of what you're getting from your customers matches what you train on to some extent you're only going to see that once you start deploying a system or for instance if you have a product that wants to look at what is not safe for work well, you'd say Well, hey look, that's that's an easy thing, right? I mean, as long as people are closed, it's fine. Well, actually no quiet because what goes in the New York Times and what goes in a tabloid may be very different and what is offensive here and what is considered offensive
For instance, in the Arab world may be very different still. So that is something where only by working with your customers by getting their feedback, you can really ensure that you're going to take care of the well. unintended biases that you iterate very quickly that you allow customers to define some of their data sets to define some of the specific terminology, conversation flows, and so on. And to make sure the entire system remains debuggable. And to some extent, that's a little bit the tension that we're seeing right now with a lot of fully integrated deep learning models, which can make it non trivial to debug them. Right.
And we've had fun with that ourselves. For instance, you might, you know, use the machine translation system and then you run some experiment and all of a sudden it produces a translation that you didn't expect and then you need to work back and you know the
More end to end trainable it is, the harder it becomes to analyze. So this is essentially where all the engineering comes in. Because otherwise you can git clone, you know the models like in the glue on NLP tool kits and just use them but then you need to know what you're doing. So that's why I think it's an exciting time
in terms of privacy would you ever want what is it is tempting to leave the microphone on all the time. So you get as much data as possible and how would you really how would you react to that? How would How would you do that?
Yeah, well, data is gold in this in this AI revolution, right data enables us to to to make these models really useful. So we do have a very delicate balance to make between, you know, having copious amounts of data to build good stuff and making sure that we're conscientious about how that data might be misused or what hands it might fall into. Now, from my perspective, I want to get as much data as possible. I mean, come on. Look, Google has Gmail. They love getting all that data. Everybody, the whole planets, emails, come on, right? And you have to be very cautious. And REL self regulate how, how you manage that data. And when you make a big boo boo like Facebook, sorry, Facebook, but when you make a big boo boo, the society comes down on you like a sack of bricks, right. But, you know, I think what we're gonna see, and I'll just say, what is actually going to happen is happening is we're erring on the side of development and tech were erring on the side of collecting the data and then trying to be responsible because that's what's allowing us to create the kinds of technologies that are improving our lives. And so I'm a big proponent of you know, collected data, do good stuff with it, and self regulate as we go as we go forward. Right?
So this is actually the wonderfull tension between doing a startup and while working in a slightly larger place. So. So the thing is, you absolutely do not want to violate the trust that your customers place in you, right? And so you want to be super careful that you never do anything that you didn't tell your customers that you treat the data of the customer with respect that you make sure that there is no data leakage that you make absolutely no mistakes.
Because with a startup, let's face it, okay, fine. I mean, you fail, you go to an angel investor, you do another version and you're back in business, or maybe you just renamed the thing right now we can't rename, right.
So this is one of the things where, well, to some extent, you're going to get a much more conservative and with that maybe also somewhat expensive approach from a place like AWS because we really value the privacy of the data that our customers have, right? So,
you know, that's why you have HIPAA compliance. That's where you have all the other security encryption and protection guidelines which make sure that the data that the customer interests us with is really safe and secure and not abused in any way.
And so whereas otherwise, I mean I can go and scraped out and you know, copyright be damned a train them modeling, nobody's going to come after me if I'm a start up right and been there as well. So this is
okay. Yeah, I think they're they're statute limitations is probably over bond. So there you go.
Yeah, I mean, so first of all, I wasn't advocating taking people's data and doing nefarious things with it, but but, um, the thing is, there's some industries that are regulated. So you mentioned HIPAA compliance to healthcare is regulated do your privacy is regulated legally, you know, in financial services, there's there's laws as to how data can and can't be use and how data can and cannot be transferred. However, when it comes to other kinds of data, your personal data, it's not regulated. You have a terms of service, you have a contract that you sign a EULA, right? That. You click. I want to log in and I approve this message. re I read the thing which no one reads. That's how it's regulated. No one reads that you just we just have an innate trust, right. And when it's violated, Facebook isn't criminally being prosecuted for breaking a law right now, you know, he did have to speak to Congress. But in this unregulated world, there are norms and ethics that are emerging social mores that are being defined now as the things that are just not good to do with data. And we're behind where we're behind the curve. So what are the established legal specifications is how they could be could be could be used now, I don't know man, I don't know where you guys sit on this, but
You can imagine some folks would want more government telling us like what, you know how our data should be used. And then you have libertarians that will be like, I'll sign the contract, contract terrorism, etc. So I think that we have to understand that landscape and be conscientious of that landscape as we move forward. Right?
Yeah. Okay. Our sci fi question we only got about two and a half minutes left at what point What year are we literally going to be able to set up three machines to talk to each other and the way that we just talked to each other as speech speech guys right? So let's think about that. What's when when will we be able to Eliza this thing out to a degree that it sounds human?
You know, that's a really I love that question. Like, that's a very interesting question. And I think it's coming. I think that eventually we're going to get to the place and we see it in sci fi movies now, but we haven't realized that really in industry or technology will come to a place where we start building these assistants for recreation will build companions to literally be our friends.
And those companions will have personalities, etc. that's gonna happen, right? That's just the human psychology, right? We're gonna want to recreationally have a buddy, when it's good enough to be interesting. When we have that, then we're gonna have a little bit of diversity in these personalities, and then we'll be able to set them up to talk to each other. That's my thought. You think Alex?
Okay. So there's one part where I think it's a terrible idea to have this happen. When I get I get this, go eat lunch. So,
okay. So basically, if you want to actually negotiate, discuss, achieve some outcome language is a horribly and precise way of defining this. I mean, mind you, this is why people are putting contracts on Ethereum right, such that is a computer program that is executed and it's irrevocable, and so on.
And you want to have for instance theory improvers that proved that the voting machine is correct if the law is written in an ambiguous ways all those things or where people said Well, no, that's actually not what I meant, even though
They tweeted about it the previous day, right? So this is why language as a means for communication between machines is making things a lot less precise than what they can be. And I don't think it should be a goal onto itself to accomplish this might have social interactions in the context of well, entertainment. Yes, might have it in the context of where no other way of communication between those devices might exist. Yes, but short of that, I think we have a lot of other challenges to overcome before we even get to that. So being able to talk to this device like to a companion I think this is a great thing that we will see eventually in the future, especially with people getting older, being lonely. It's a lot better to talk to Alexa rather than not have to nurse show up in the old age care home when you push the button and so
There are plenty of opportunities and I hope they'll make our life better.
Wonderful. Well, thank you guys. This is fascinating, fascinating stuff. Thank you. Cool. Thanks.