A new and improved sarcasm detector? Yeah, right.

Trained on US sitcoms, the system aims to make interactions with AI assistants more natural...

31 May 2024

Interview with

Matt Coler & Shekhar Nayak, University of Groningen

Part of the show Fruit fly vision, sensing sarcasm, and social connection

THUMBS-UP

Credit:

CC0, via Pixabay

Play Download

As frequent users of current AI assistants know, Siri and Alexa can struggle to pick up on the double meanings we encode in our speech, and it can make for some pretty unnatural interactions.

To work around the problem, Matt Coler and Shekhar Nayak used an innovative new method to train their model, feeding it heavily annotated clips from US sitcoms to capture the tone of voice and facial gestures of sarcastic asides, and the context in which they were deployed…

Matt - Let me start off by saying how excited I am to be here. And I want to underscore that, I mean that sincerely and, and not sarcastically. Sarcasm, like humour, irony, hyperbole, politeness and so on are what linguists call pragmatic phenomena. So they involve the use of language and context to express meaning that goes beyond the literal interpretation of the words. So a sentence like, 'oh, what a terrific party this is', and 'what a terrific party this is' mean very different things other than different contexts, right?

James - Obviously it's directly the opposite, meaning to what we're saying. And I suppose that is a big reason why we're going to struggle when it comes to AI appreciation of this phenomenon.

Matt - It comes very naturally to us as humans, right? We're pretty good, although remarkably not perfect at understanding pragmatic phenomena like irony and sarcasm. And we recognise that this meaning is not always in the words, but in the melody of the words. It's not what you say, but how you say it. And the machines around us don't have any sensitivity to this. And so for us, sarcasm isn't just about sarcasm, it's about natural human machine interaction.

James - It's not like they're going to find it funny though, is it? I mean, you might be right that my conversation with my Alexa is perhaps not as fluent as it is in my conversations with other people. So why might we want AI assistance to understand it other than this sort of superficially more fluid conversation?

Matt - I think I want to respond to that by reflecting on the fact that you don't ever have a conversation with Alexa, right? You have kind of pairwise question answers with some little follow up in some specific context. But the way we interact with an Alexa device is nothing like a human conversation. There's no back and forth like we're having now. The promise has been that we should be able to engage with our devices naturally. So we're trying to pioneer this more human centred way of communicating with our machines, where humans don't adjust our language to the abilities of the machine like we do usually now, but that the machines improve their ability to understand human discourse.

James - And when it comes to the challenges with helping artificial intelligence to pick up on sarcasm, and I'm coming to you here Shekar, how can you work to overcome them?

Shekhar - With AI, you need good data. It's not always explicit sarcasm. Even the tone may not change at certain moments of time. You have to take in the context what was spoken earlier, what was spoken later. You can see the speaker's facial expressions.

James - So the question becomes, with all these different modalities to consider, where do you look to find training data which might be able to synthesise all these different signposting effects into your model for it to learn from?

Shekhar - The most common data sets which we are using are TV shows, sitcoms, which have these videos, which also have audio modality and the text is usually transcribed. Labelling is done, human annotators listen to the audio and then mark either sarcastic or non-sarcastic, these data sets have further rich information. It may be on sentiment, it may be on other specific aspects like emotions.

James - Interesting. So would it be fair to say that the strength of your model comes not necessarily from the vastness of the data set, but it's the specificity and the level of detail you've gone in annotating the training data you have used? And that's what has really supercharged this particular model

Shekhar - Fusion of information from different modalities, text, audio, and video. That's like the key, having a good fusion backend, which can extract the best information from the right feature representations from each modality.

James - And just a quick point, can I ask some examples of the types of TV shows? I know they weren't the only things you used to train data, but what TV shows are we talking about here?

Matt - The TV sitcoms we used were Friends, Big Bang Theory, Silicon Valley, Golden Girls, and some others.

James - I mean, thinking about those sitcoms you're talking about and the uses of sarcasm from them, I can remember at least Chandler Bing from Friends, for example. I'm thinking of most prominently these are some of the most clear and obvious examples of sarcasm perhaps ever spoken in English. So are you confident that the model will hold up if the irony is perhaps a bit more subtle?

Matt - No, and I think that's a great point. So the headline is sarcasm detection works, but the fact is this is sarcasm performed by professional actors and written by professional writers for comedic effect, right? So contrary to that, most of our sarcasm that we use has none of those features, right? I'm not an actor and what I say off the cuff sarcastically or not on the fly may or may not be meant to be interpreted even as sarcastic. And the ground truth is harder to derive. Nonetheless, the fact that the dataset exists and is annotated in a scientifically rigorous way means that it provides really good and useful training data that we could use as a first step. And indeed more work will be required to do this in a reliable way on the fly for fluent spoken sarcasm by you and me. But that's something that's on our radar already. First, we want to continue working on this and look at what aspects of the multimodal approach are the most meaningful to produce accurate sarcasm readings. So, in other words, how little data is enough? And I think once that picture starts to come together, we'll proceed to expand our research to encompass also non-scripted fluent sarcasm uttered by ordinary people,

James - Notwithstanding those caveats. What else other than sarcasm might this training method be able to decipher for other sort of encoded meanings that humans use in their speech?

Matt - I think that's a really fun point, and that's one of the reasons that we're all very excited about the research is that since this works, it's a proof of concept on how we could expand to new avenues, including, for example humour and irony, exaggeration, politeness and others, many of which are really missing from our human device interactions. And so far we've only been speaking about recognition, so the ability of the devices to recognise our intended meaning as speakers. But so too can we imagine working on synthesis, in other words, the ability of the devices to produce, for example, sarcastic, humorous, ironic text that can be understood as such by human listeners. Taken together this workflow allows us to expand, to work on more natural human machine interaction, but also to look at very small languages or very small data sets. This is the kind of proof of concept of working with very small data compared to other approaches. And that's also on our mind. You know, our faculty is located in the north of the Netherlands where there's a small language, an under-resourced language called Frisian. And, we intend to use some of these techniques to help improve speech recognition for Frisian and speech synthesis for Frisian as well.