Smashing research teaches machines to listen
How do you teach a machine to listen, and why would you want to do so? Georgia Mills broke the ice, or in this case glass, with Chris Mitchell from the Cambridgeshire-based company Audio Analytic...
Chris - Audio Analytic does a completely new type of sound signal processing. We call it audio artificial intelligence and it looks at giving computers the ability to respond to sounds such a baby crying, or glass breaking as opposed to speech or music type signals. One of the shortest descriptions I’ve seen of what we do we’re like the Shazam for real world sounds which, I think, is a reasonably short way of describing what we do. It’s technically highly inaccurate but it’s very short.
Georgia - How are you teaching a machine to tell apart sounds?
Chris - First of all you need to get the sounds. One of the fundamental problems we have is what’s referred to as a zero data problem, which means there is no data to train a machine off. However clever AI is, it’s not clever at all if you don’t have data to train off in the first place so we had to go out and collect the data. That means literally smashing, crashing, and bashing, and beeping our way through all of the sounds that you can possibly imagine. Once we’ve collected all the data you’ve then got to go through and label it.
Then you move onto the more AI side of it, which is then trying to model these sounds and the combination and ordering in which they occur. Now that is very different in itself from speech in the fact that there’s no language associated with sounds, so you and I having this conversation is governed by a relatively formal communication structure we both understand. We know the rules of it, around it and that means that at a sound analysis level you can guess which sounds are going to come where based on those rules. My dog doesn’t really have a sense of those rules in any meaningful sense so that doesn’t apply. A lot of that language technology and other aspects that you would approach from the speech side just simply don’t translate and, hence, you have to build a whole new set of AI to be able to accurately detect and respond to those sounds as they occur.
Georgia - So the process, vastly simplified, involves breaking the sounds down into tiny components and getting artificial intelligence to learn which components come from which sounds, and in what context. But I was most interested in the side of data collection, especially the sounds of smashing windows. Now a lot of this happened in a soundproof portakabin in Cambridge where I was shown their glass breaking rig…
Moray - My names Moray Grieve. I’m the data and QA manager of Audio Analytic.
Georgia - So I can see a sledge hammer in the back there - it looks very exciting. So I’m going to smash this window?
Moray - You’re going to smash it - correct. What we’ll do is we’ll get you kitted up in safety equipment. You’ll be behind the glass break rig so you will be as if you were a burglar breaking into the room. It may take you several attempts. It’s actually a lot harder than a lot of people think. A lot of people think you can just tap the sledge hammer to the glass and it will break but, in reality, you do have to give a reasonable amount of force in order to break it.
Georgia - I’m so excited. I cannot tell you how excited I am swing a sledge hammer into a window. This must be the best place to work ever!
Moray - It’s great fun - it is. It’s all been a bit of an initiation for everyone when they start.
Georgia - Right. So I’d better get kitted up then.
I was wearing about five layers of protective clothing to prevent glass from shredding me, and ear defenders as it was incredibly loud. So, as fun as this was, not something to try again at home. And, believe it or not, it took seven attempts with a sledge hammer to get through a pane of the famously breakable material… glass.
But eventually… glass breaking.
Georgia - What’s the point of this learning to hear for machines?
Chris - There’s security and safety applications to what we do. If we’re selling into cameras that people store in the homes, detecting whether the windows are smashed and then sending you a text message alert is, obviously, extremely valuable in terms of being able to do that. Through to modern smart speakers and understanding the sound environment in which you project sounds. If you can have an understanding of that - so that if you know you’re frying bacon, that has an impact if you’re trying to play jazz - so understanding that broader scene from a sound is very important as well.
Georgia - Oh, I see. I have this problem when I’m trying to my podcasts while making dinner. So how would understanding the sounds help me, I suppose, listening to my podcast?
Chris - Speakers are generally, at the moment, are pretty dumb. If it could understand that you’re making bacon - I sound like I’m obsessed with bacon but fair enough! If we go back to the bacon piece knowing the sorts of sounds involved in that and inflecting the way the music is produced so that it has the best chance of getting over the top of that sound and around it, means that you’ll be able to enjoy your podcast while still cooking the bacon in this case.
Georgia - How accurate is this? I guess compared to humans because quite often, we mentioned babies crying, I’ve been spooked by what I thought was a baby and it turned out to be a fox in the garden screaming. Are there any sounds it’s confused by and how accurate is it?
Chris - It’s highly accurate in the metrics we use. There are, inevitably, things that will sound very similar to it and will confuse humans if you want to use that as a metric, which some people will do. For example, I have an engineer who’s extremely good at reproducing a baby crying - very, very eerie. It’s taken him many years to be able to reproduce that so I don't know how much of a practical problem that is.
We’ve got a large deployment in Europe and we saw that there was a specific type of bird that’s kept in houses, especially in the south part of France, which sounds exactly like a North American smoke alarm. They are literally identical and the guys had to spend a bunch of effort getting rid of that to make sure they could be discerned. So you do come across these instances, but generally it’s highly accurate.