Chris - Improved voice recognition software usually just means a bigger library of sounds to compare your voice against. Scientists at the University of Sheffield are trying to take a different approach to tackling the problem. Theyíve done it by building an artificial tongue and jaw which replicates the movements we make during speech so we can understand a bit more about how speech gets made. To find a bit more weíve invited Professor Roger Moore from the University of Sheffield to join us. Hello Roger.
Roger - Hello Chris. Hi, how are you?
Chris - Thank you for joining us on the Naked Scientists. Very well, thank you. Tell us a bit about your study.
Roger - This is actually the research of one of my PhD students, called Robin Hofe. Heís essentially put together an animatronic tongue and vocal tract which we call AnTon. Itís just beginning to move right now. Itís at a very early stage of development. What weíre going to be doing with this is studying the speech production mechanism in animatronic form. Weíll be able to access data about how speech is produced which is very difficult to get directly from the human system.
Chris - Describe the system a bit for us. What does this tongue look like?
Roger - It actually does look like a tongue youíll be relieved to hear. Itís made form a silicone gel with a latex cover. Itís been moulded on a medical model. Itís embedded in a plastic skull. The skull is also derived from a real skull. Weíre, right now, filling in the remainder of the vocal tract in order to complete the cavity which will surround the tongue. The most important part of all is that it has to move. We are connecting motors through filaments in the tongue itself to pull the equivalent of the different muscle groups.
Chris - You can actually accurately model the muscles that are natively in the tongue in order to faithfully reproduce the movements a tongue would make when saying different words?
Roger - This is the idea, yes. Weíre trying to be as anatomically correct as we possibly can because this is important to the research. One of the key things that we need to find out which is very difficult to do in any other way is to measure the energy that is used to move the tongue into different positions.
Chris - How will you do that with this model?
Roger - It turns out that because itís a physical model all we have to do is measure the energy being used by each of the different motors which is pulling on the respective muscle groups. We will therefore have a very easy way of assessing the different energy uses across the whole of the model.
Chris - How do you know which movements to make the tongue make in order to say different things?
Roger - Initially weíre using some MRI scans from our colleagues in Grenoble. Weíre doing that to establish static positions that we will set up as targets.
Chris - So in other words, what people do with their tongue when theyíre saying something.
Roger - Thatís what we have to work with: static sounds, simple vowel sounds. Then weíll move on to dynamic sounds as the complexity of the model progresses. It would be possible to use machine learning techniques to allow the device (because itís all controlled by computer, of course) to learn the relationship between the sound output and the control structures in order to then create the sound that it wants by moving the tongue into the appropriate positions, using auditory feedback on the sound thatís produced.
Chris - Will it actually speak to us? Is it possible to make it speak rather than just make movements?
Roger - Eventually it will speak. Right now itís moving but it doesnít actually make a sound at all at the moment. It will be vocalising, we estimate in a couple of months, with the rest of the vocal tract. It would be possible to control it with a whole range of sequence of movements which would correspond to it saying something.
Chris - The ultimate end point of doing this is exactly what?
Roger - As I say, we really need to understand something about the energetic of speed. This is implicated in a lot of the so-called random variation that a lot of our automatic speech recognition devices experience right now. Although speech is highly variable itís not random at all. People are compensating typically for all kinds of situations that they find themselves in and they change their voices. For example, the way that Iím speaking to you right now is very different to the way I would speak to my family. We all know that if youíre in the pub youíre virtually shouting at your friends in order to be heard above the noise level. Those kinds of changes in the voice are well-known but not well-modelled. Thatís what weíre trying to get a handle on.
Chris - Thank you, Roger. If people want to find out a bit more about your work where should they look?
Roger - Go to the Department of Computer Science website at Sheffield University. They will find myself and Robin Hofe. Robin, in fact, has set up a site which has some videos. Thereís a bit of an article in New Scientist this week which also points to that.