Blaise Thomson, Vocal IQ
Blaise Thomson from Vocal IQ, a spin off company from the University of Cambridge, explains to a live audience at the Cambridge Science Centre how he designs spoken language interfaces to allow computers to understand us and respond appropriately...
Chris - So Blaise, let’s kick-off with you. Tell us a bit about what you actually do to try and make computers understand what we say. Why is that so difficult?
Blaise - Well, it’s difficult because there are so many ways that we produce language. There are so many different accents and there are so many ways of phrasing things. More problematic actually, people keep coming up with new ways. So, when you think you've got everything sorted then people will start adding some slang that never got seen before. They rephrase what they were saying and also, they speak in ways which are not actually grammatical. So, a lot of the time, people think that we’re speaking in full sentences, but of course, we’re just speaking in phrases or sections of words, and people have some way of understanding that even if it doesn’t really make sense.
Chris - Didn’t the people of Birmingham become rather offended recently when it turned out that the speech software on the council’s telephone system couldn’t understand what they were saying. That is true, isn’t it?
Blaise - I believe that is true actually.
Chris - But why should the system object to the population of Birmingham?
Blaise - Well, it’s not a personal thing.
Chris - I should hope not.
Blaise - I mean, of course, there are some engineers who might have personal vendettas against the population of Birmingham. But in this case, it’s probably because the computer hasn’t seen any examples of people speaking in that particular way. And that particular way is possibly so different from what it has seen that it found it difficult to understand. So, I suppose what you need to understand is that the way these systems understand language is that they look at examples of what they've seen before and they look at examples of how people have spoken in a database that they've seen. So, they'll look through a whole collection of sounds. So for example, we have sounds like Aah, Bah, Kah, etc and each of those sounds has a different representation of a soundwave. And so, when you look at the soundwave and you do some analysis of it, each of these different sounds has a different soundwave that comes out. Effectively, what's happening is, the computer is trying to find the most likely soundwave or the most likely sounds for a soundwave that it gets shown. In the case of the people in your example, probably what happened was that the most likely soundwaves according to what it had seen from other people was actually different to what the person was saying.
Chris - So, it doesn’t have a problem with scouser of people going ‘eh’ at the end of each line then?
Blaise - Well same principle. It wouldn't, as long as it had seen examples of it and I think that's actually true of humans as well.
Ginny - This is actually a really good point for me to do the first little bit of my demonstration. So, what the people from Liverpool sounded like to a computer, we can't really understand, but I've got an example of something that might give you an idea of what they would’ve sounded like to the computer. I just have to get back to the laptop. Okay, so I want you to listen to this...
Ginny - Who thinks they understood that? A few people, but not everyone. Did some of you just not understand that at all. Yeah, we’ve got quite a few people who didn’t understand. So, if you imagine you're a computer that's someone from Liverpool, that's what they would’ve sounded like. But your brains are amazing and they can use experience and learn from it. So, if I carry on playing this...
Recording - Please wave your arms in the air.
Ginny - Now, I'm going to play the exact same thing I played to you before again, but hopefully this time, you should understand it.
Ginny - Did everyone understand it the second time?
Audience - Yes.
Ginny - So, that's something called sine-wave speech. It’s just been put through a sort of scrambling programme to make it sound weird. And to start with, your brain doesn’t know what to do with it. It’s never heard those kind of noises before. I think I found it sounded a bit kind of like bird song, but I wasn’t really sure what was going on. But once you know that it’s speech and you know what it should say, your brain can unscramble it and it can figure out what it’s saying. And actually, now you've heard one example of sine-wave speech, you should be able to hear almost anything set in sine-wave speech. So, I've got an example of me saying something different...
Ginny - If you understood that, can you do what it says?
Audience - Clapping.
Ginny - Thank you very much. So, once you've had an example of sine-wave speech to use, you can understand other sine-wave speech and that, I believe is what you're trying to train your computers to do.
Blaise - Exactly. So, computers are exactly like that. The more examples they see of someone speaking, the better they are at understanding. And probably, what happened in that particular case is that they hadn’t seen enough examples of people from that area. If they had done, then they'd be able to learn.
Chris - There was one bit of medical transcription I saw and the doctor had actually said into one of these speech interpretation things, “I would partially halve the narcotics” and the computer translated it as, “I would prescribe parties and halve the narcotics.” So, it must be quite difficult when you've got words which some people tend to link words together, others speak with their words very clearly separated. How can you get a computer to tell those things apart because the patterns must end up looking quite similar?
Blaise - They do, but there's a process by which the computer can just try out the different alternatives and it’ll essentially try every possible alternative and just look at which one is the most likely. So in your case, what's happening actually is that the computer is splitting up its analysis into two parts. So, there's one part which looks at the sound, but there's also another part which look at the sequence of words. And some sequences of words are more likely than others. In fact, that's one place where you can get a lot of context. So for example, if you type into Google or if you type into any search engine, you can just write a few words and then it’ll predict another word for you. The same happens on your phone if you use something like swift key or swipe. These are systems where you type in your phone a few words and then it predicts what the next one is to help you go more quickly.
Chris - I've seen a few accidents where that's concerned though.
Blaise - Yeah, so that's exactly the same kind of accident that can happen so the speech recogniser will incorporate that and of course, that can go wrong, but for the most part, it actually helps. So, one big thing I've been looking at is how you can actually incorporate a whole dialogues context and incorporate the fact that you know what the person’s been talking about and actually, use that as well as just the most recent words.
Chris - Ohthat's clever. What you're saying is that by looking at the context, in other words the conversation that a person has already had then when it sees a new word, you're narrowing down your search for what the possible words might be based on analysing a bit about the conversation. That sounds a bit spooky though. It sounds a bit like kind of the computer's eavesdropping on what you're saying.
Blaise - Well, I think it’s something that we all do. So, when we’re talking to someone, we know what we’re talking about and what the subject is and so, if there is two possible interpretations then we’ll choose the one which makes the most sense in the context.
Chris - Has anyone got any questions for Blaise? We need some questions now on anything to do with electronic technology, speech recognition, your funniest word prediction moment. Anyone got any howlers that they've seen when they've texted something to somebody and something they didn’t expect to come up? Yes, who’s this gentleman here?
Rhys - Hello. I'm Rhys Edmunds from Comberton. What is the highest wavelength and the lowest wavelength the computer can transmit?
Chris - Can you just clarify what you mean by transmit? Do you mean as in sounds that the computer can make?
Blaise - To be honest, I'm not really sure what the answer to that question is. I believe it’s several kilohertz. So, that's the frequency rather than the wavelength, but there is a range. Actually, the computer can both record and transmit higher and lower frequencies than our ear can hear.
Chris - Yeah, I mean your ear is sensitive to about 50 hertz which is 50 cycles a second. That's the lowest sound-ish that you can hear, about 50. Some people, down to 20 and it’s as high as 20,000 hertz when you're very young and most people in this room who are a bit older will stop hearing sounds above 15,000 hertz. In fact, people know that young people can hear some of these sounds better than older people and they're using them as a form of repellent for young people. There is in fact a shop. Is it in Wales Ginny? Do you remember this? They actually had a problem with youths loitering outside so they installed this buzzing system that bleeps at about 18,000 to 20,000 hertz and they can hear it and it’s enormously annoying, but the adults don't get deterred from going in and buying their paper because they can't hear it.
Ginny - But then the teenagers got their own back because they started using it as their ringtone because when their phone went off in class, the teacher couldn’t hear it going off, but they knew it was. So, they got their own back.
Chris - How is that for ingenuity? Next question. Who else has got a question?
Mark - Hello. I'm Mark from Cambridge. I was just wondering what's the limiting factor at the moment on the sort of processing of speech? Is it our understanding of how speech works or are we waiting for process of power to increase?
Blaise - I’d say it’s more of an understanding thing. So, we already have quite a lot of processing power. So actually, I’d say there are two big things. One is collecting examples of people speaking so that we call that data – a big data source. A big reason for big companies being much better at speech recognition is because they have a lot more of this data. The actual processing power is not such a big problem at the moment. It’s very easy for us to get a large collection of servers together, computers together to do the processing. So, there is quite a lot more to do on that understanding of how to build better speech recognisers. And even that's been developing quite dramatically in the last few years. In the last 10 years or so, these systems have become really a lot more useable.
Chris - I've been quite impressed though because I've rung up a number of industries that have automatic telephone answering systems and whilst I find them infuriating, they do work really rather well. And they say, “Tell me in a few words what your call is about” and you might say “My broadband doesn’t work” and then it says, “Are you calling me about your broadband?” and you say, “Yes.” Well it's very tempting is be incredibly sarcastic, isn’t it? Does it understand sarcasm? Does that wash, if you go, “No, I'm not.” Will it nonetheless still put you through to the broadband person?
Blaise - I think it probably depends on how the persons designed it, but in most cases, not. So actually, this is another thing we’ve been looking at, is trying to get computers to learn as well as the analysis of the words, the meanings of sentences. And a big thing for that is trying to work out whether you've done something right or wrong. What we’re doing is getting the computer to try out different strategies of what it might do. So, for example, in your case, you might try out and say, “I think you want a broadband” and then see if the person did. And then get the computer to learn by itself what these things mean. And for that, it’s very important to see if you did the right thing or if you did the wrong thing because that can inform you about how you should change your strategy in the future. Actually, a pretty good signal for that for example is whether people swear at the system. And generally, that means that you've done something wrong. But actually, to check if you've done something right is a lot more difficult actually.
Chris - Do you remember that Microsoft paper clip that used to pop up? Do you remember that Microsoft paper clip? Who here found that bloody annoying? I saw a gag someone had done once and they were so fed up with it. They'd written this sort of thing saying, “I've decided to kill myself and whatever” and then the paper clip pops up and said, “This looks like you're writing a suicide note. Would you like some help with that?” Are you able to do emotion?
Blaise - It’s not something that I've looked at and in fact, generally, people have started looking at this, but it’s actually very difficult. Partly because it’s very difficult even for humans to decide what the emotion is, just from listening to someone. So, if you get 2 or 3 people and all of them listen to the same segment of audio, some will say that the person is very angry. Others will say they have no emotion and others will say that they're happy. Actually, the agreement amongst humans is very low, around 60%. So, it makes it very difficult for computers then to learn and of course, their agreement is even lower.
Chris - Any other questions coming in so far? There's one gentleman at the back.
Carlo - Good evening. I'm Carlo for Cambridge. I would like to knowledge whether there is any language which is easier to be understood through computers or they are more or less all the same.
Blaise - Yeah, that's an interesting question. So, there is a difference in the ease of understanding. In some respects, English is actually the easiest of all and that's partly because we have a lot of what I call data, a lot of examples because most of the internet is in English. The other languages which are more difficult for various reasons, so one is Chinese and that's more difficult in some ways because of two reasons. One is that there's tonal variations in the understanding of the language. There are examples for example of words which to you or me might sound the same, but one might mean horse and the other mother. If you say the wrong one, you can get into deep trouble.
Chris - In some supermarkets, horse and beef ends up being confused, doesn’t it?
Blaise - That's true, but I believe that's not because of the change in tone. So, that's one interesting reason for the difficulty. Another in the case of Chinese is because the writing doesn’t actually really segment words in the same way that we do with English. So, there's a different idea of the meaning of a word. The other languages which are difficult for other reasons – so one is Turkish and that's because of the way the words are structured. So in English, we have prefixes and suffixes and that's quite easy to deal with. But in other languages you get what are called infixes, so the words can change because of a change in the middle of the word. That makes things a lot more difficult for various technical reasons because you get these huge explosion in the number of words that are possible.
Chris - Presumably, the system can learn though. Can it not teach itself and slowly get better?
Blaise - Yes. So, that's what we do in all of these cases with Chinese, Turkish, Arabic, etc. All of them have the best systems now, being a system that's been learned automatically. But it becomes more difficult when you have a very large explosion in the number of words or when you start having to add extra things that affect the meaning.
Chris - It’s obvious that if you want to make someone understand, you just speak loudly and clearly in English, don't you? Any other questions?
Fergal - Hi, this is Fergal from Cambridge. I'm wondering how the speech recognition works in general. Is it more by pattern matching or does it break all sounds down into basic elements, wavelength, frequency, amplitude, and what are similar parameters make for different sounds and different accents and tones?
Blaise - So, what happens is that it takes a soundwave and then it breaks it up into a sequence of what we call features. So, these are a bit like what you called amplitude and so on. Actually, they're slightly different but effectively, the idea is the same. And then what it does is, it tries to find a sequence of sounds which we call phones. So, these are things like ahh, uhh, buhh, etc. So, there's a collection which pretty much define all the sounds that we can produce as humans, a subset of which is used in English. And then it tries to find the most likely sequence of these phones, given the example of speech that it’s been given. The way it decides if something is likely is it looks at the distribution or the patterns that you would usually get for that sound of these features. So, things like the amplitude. The specific ones which are typically used are a little bit more complicated. They're called malfrequency cepstral coefficients. which is...
Chris - Say that again.
Blaise - Malfrequency cepstral coefficient. Also conceptual linear prediction coefficients.
Chris - How it is like going down at parties?
Blaise - It usually kills off all conversation to be honest which is why I usually try to leave it for as long as possible before bringing that in.
Chris - Any other questions?
Jenny - Hi, Jenny from Cambridge. I was just wondering if the video phones were developed more, would that help, especially with emotion? So, would you be able to match image with sound and you'd be able to tell by someone’s face, their emotion more?
Blaise - Yes, so video gives a lot of help for emotion detection. It actually helps a few other things as well. So, you wouldn't think it’s a problem, but actually, one of the most difficult things to decide is when a computer should talk and when it thinks you're talking. Because if there's noise in the environment, then it’s very difficult to know when the person's stopped talking or when they're carrying on talking. And people pause and it’s never quite clear whether the computer should start talking now or not. And even people struggle with this because people will interject and fight about who's talking. So if you have a video, it’s much easier because you can actually – as well as listening to the audio, you can look at their lips and see whether they're still planning to speak. You can look at their face and that gives some indication also of whether they're planning to speak. So, a video does help for more things than just the emotion. It also helps for other things, for what we called turn-taking.
Chris - Can you also use facial expression to work out what someone might be trying to say because there's this famous McGurk effect, isn’t there Ginny, where actually, if you show someone mouthing a word one way, but play them a different sound, the person who’s listening can get confused and hear a very different thing?
Ginny - Yeah, exactly. We very much use the shapes being made by the face to help us workout what someone is saying, particularly in noisy environments. So, if you see someone going bah and making the kind of face they make when they make a bah sound, and you hear something that's a bit kind of – it could be a bah, it could be a mah, you'd assume that it’s a bah because of what you can see them doing. So, we definitely use that. So, I would’ve thought that that would be helpful of the computer as well. At least for some of those sounds.
Blaise - Probably, yeah.