Artificial tongue could improve speech recognition

27 July 2008

Interview with

Roger Moore

Chris - Improved voice recognition software usually just means a bigger library of sounds to compare your voice against. Scientists at the University of Sheffield are trying to take a different approach to tackling the problem. They've done it by building an artificial tongue and jaw which replicates the movements we make during speech so we can understand a bit more about how speech gets made. To find a bit more we've invited Professor Roger Moore from the University of Sheffield to join us. Hello Roger.tongue

Roger - Hello Chris. Hi, how are you?

Chris - Thank you for joining us on the Naked Scientists. Very well, thank you. Tell us a bit about your study.

Roger - This is actually the research of one of my PhD students, called Robin Hofe. He's essentially put together an animatronic tongue and vocal tract which we call AnTon. It's just beginning to move right now. It's at a very early stage of development. What we're going to be doing with this is studying the speech production mechanism in animatronic form. We'll be able to access data about how speech is produced which is very difficult to get directly from the human system.

Chris - Describe the system a bit for us. What does this tongue look like?

Roger - It actually does look like a tongue you'll be relieved to hear. It's made form a silicone gel with a latex cover. It's been moulded on a medical model. It's embedded in a plastic skull. The skull is also derived from a real skull. We're, right now, filling in the remainder of the vocal tract in order to complete the cavity which will surround the tongue. The most important part of all is that it has to move. We are connecting motors through filaments in the tongue itself to pull the equivalent of the different muscle groups.

Chris - You can actually accurately model the muscles that are natively in the tongue in order to faithfully reproduce the movements  a tongue would make when saying different words?

Roger - This is the idea, yes. We're trying to be as anatomically correct as we possibly can because this is important to the research. One of the key things that we need to find out which is very difficult to do in any other way is to measure the energy that is used to move the tongue into different positions.

Chris - How will you do that with this model?

Roger - It turns out that because it's a physical model all we have to do is measure the energy being used by each of the different motors which is pulling on the respective muscle groups. We will therefore have a very easy way of assessing the different energy uses across the whole of the model.

Chris - How do you know which movements to make the tongue make in order to say different things?

Roger - Initially we're using some MRI scans from our colleagues in Grenoble. We're doing that to establish static positions that we will set up as targets.

Chris - So in other words, what people do with their tongue when they're saying something.

Roger - That's what we have to work with: static sounds, simple vowel sounds. Then we'll move on to dynamic sounds as the complexity of the model progresses. It would be possible to use machine learning techniques to allow the device (because it's all controlled by computer, of course) to learn the relationship between the sound output and the control structures in order to then create the sound that it wants by moving the tongue into the appropriate positions, using auditory feedback on the sound that's produced.

Chris - Will it actually speak to us? Is it possible to make it speak rather than just make movements?

Roger - Eventually it will speak. Right now it's moving but it doesn't actually make a sound at all at the moment. It will be vocalising, we estimate in a couple of months, with the rest of the vocal tract. It would be possible to control it with a whole range of sequence of movements which would correspond to it saying something.

Chris - The ultimate end point of doing this is exactly what?

Roger - As I say, we really need to understand something about the energetic of speed. This is implicated in a lot of the so-called random variation that a lot of our automatic speech recognition devices experience right now. Although speech is highly variable it's not random at all. People are compensating typically for all kinds of situations that they find themselves in and they change their voices. For example, the way that I'm speaking to you right now is very different to the way I would speak to my family. We all know that if you're in the pub you're virtually shouting at your friends in order to be heard above the noise level. Those kinds of changes in the voice are well-known but not well-modelled. That's what we're trying to get a handle on.

Chris - Thank you, Roger. If people want to find out a bit more about your work where should they look?

Roger - Go to the Department of Computer Science website at Sheffield University. They will find myself and Robin Hofe. Robin, in fact, has set up a site which has some videos. There's a bit of an article in New Scientist this week which also points to that.

Add a comment