This article returns to the thread of the last few months by looking at how robots can measure our emotions and body language.
My aunt, a Tennessee tobacco grower, used to remind me that God gave me two ears and one mouth for a reason. What she meant is that a good conversationalist is not so much someone with the ability to talk, but the ability to listen.
Robots can take a cue from my aunt.
As I’ve detailed in previous articles, about half of human communication is body language. Approximately sixty percent of information we communicate in a conversation can be transcribed to text, like you read here. The other forty percent is what we’ve been grouping together as “Body Language.” This amounts to how a person moves, appears, and sounds.
As roboticists and designers, our ability to build connected systems that listen to what is said and how it is said is probably the single most important aspect of our designs, at least if the system is interacting with a human. Input is critical to human-machine interaction, whether its a command line, word, or gesture. Robots need to pay attention to the words, movements, and sounds humans make to better understand what is being communicated.
It’s commonly understood that Natural Language Processing breaks down into 1) Natural Language Generation, and 2) Understanding. This is as true for textual language as it is for body language.
|Semantic Preparation & Lookups
|Preparation / Delivery
|Video and Audio analysis
|Animation, Voice output
A robot can process words. It can also process gesture, posture, and tone of voice.
Psychologist Skip Rizzo is a researcher who specializes in the design, development and evaluation of virtual reality systems that are used in clinical assessment, treatment, and rehabilitation. His work wraps around the disparate worlds of psychology, cognition, and motor function in both humans and avatars. Rizzo received the American Psychological Association’s 2010 Award for Outstanding Contributions to the Treatment of Trauma and he’s the associate director for medical virtual reality at the USC Institute for Creative Technologies, in Los Angeles. These days he uses his background in psychology to bridge the worlds of video game avatars and American veterans suffering from post-traumatic stress disorder (PTSD).
I met Dr. Rizzo a few months ago in his office in Los Angeles, and we spent some time comparing notes on how to measure, interpret, and analyze both words and body language.
Skip discussed a project named Ellie – a virtual assistant avatar that patients interact with so that Ellie can build assessment models. The system – a software robot – was designed to interview patients and be able to build psychological models of the conversant.
Ellie was started two years ago when Rizzo began working with computer scientist Louis-Philippe Morency.
Funded by DARPA, within a larger project called Detection and Computational Analysis of Psychological Signals ( see here and here), Ellie was designed to detect people in emotional distress that might be at risk for suicide. Aside from Morency and Rizzo, other participants in the project focused on the language programming, some on the visual appearance of the avatar, and some on the training of psychological cues. And there were a range of other means that the crew used to build assessment information.
The team spent months polishing every element of Ellie’s presentation and interaction with patients, experimenting with a range of different personalities, outfits and vocal approaches including a well-timed “uh-huh” that reflects a common function of human conversation.
Under the wide screen where Ellie’s image sits, there are three devices. A video camera tracks facial expressions of the person sitting opposite. A movement sensor — Microsoft Kinect — tracks the person’s gestures, such as fidgeting or other unconscious gestures. A microphone records vocal prosody and tone of voice.
“How do you measure people?” I ask Rizzo.
He puts his feet on his desk, leans back, and crosses his hands across his stomach. “Using a webcam and a good microphone we were able to track, capture, and pick up behavioral symbols. From there we were able to make inferences from them as to whether the person was more distressed than their pure language might indicate. We were able to collect this data in face-to-face interviews with known groups of distressed people – veterans, in particular.”
“Now we have an AI version that runs completely on its own. It will try to develop a rapport, dig deeper, and try to end on a positive note. Some of the questions are projective questions like, ‘In the last few months are there things you wish you had done differently,’ or, ‘Are there things you’ve done lately that you regret,’ or ‘when was the last time you felt really happy.’” These questions were designed to be open to interpretation with no real right or wrong; a verbal Rorshach test.
“And what are the cues that build redundant data, that give you confirmation on how the person is feeling?” I ask.
“Oftentimes you get an answer that doesn’t belay much and you have to have a follow-up question… But when you look at how they said it, things like a vertical gaze, fidgeting, a delayed response, space between words, pitch, variation or lack thereof – that’s the stuff that’s great to get.”
The three primary points of data that Ellie collects are facial expression, tone of voice, and gesture – the same criteria humans use to assess other humans. There are a total of twelve vectors that offer good validity data. In some cases the interviews go for 20 minutes, usually not more than 40. The ultimate goal, on the application side, is to build a kiosk, a bit like a confessional, in which Ellie privately interviews soldiers; information from this interview can then be used by a company unit to evaluate its staff before going in to battle, or to determine if a person is suitable for a particular role.
And as easy as making a power-point presentation, they can start to make training cases, or National Board of Medical Examiners can make cases for certification purposes. Rizzo’s team has done everything from exposure therapy (using simulations of warzones) that clinicians can use to deliver evidence-based care to a virtual classroom where kids with ADHD can learn how to concentrate and control their unconscious behavior. So the work has broad applications. The psychological and cognitive use of these software robots – either as patients for clinical training or as agents to help people access information and guide them in a private way towards health care – will surely change the face of not only the defense department, but the medical industry, education, and a host of other industries.
Somatic analysis, as I call it, isn’t just about capturing aggregate number of smiles and vertical gazes… it’s also about driving the behavior of the system. Ellie can tell if you’re hesitant to reply and can make a prediction about that. So the chain of questions that she may ask is a real-time sympathetic feedback loop. If you interact with Ellie ten times you can see progression and change in her responses. She becomes more sensitive to the methods and systems of interaction..
This may represent an important trend in robotics design. These methods of semantic and somatic analysis are the eyes and ears that proper robotic interfaces need to have. And the trends are being reflected in the software industry. For example, Apple is beefing up its personal assistant capabilities by purchasing ($40m) Cue, a company that integrates outside realtime data similar to Google Now, Nuance’s Wintermute, or http://don.na.
This research is what we’re focused on today at Geppetto Labs (and recently at Figaro Avatars, as well). Our primary focus has been to make sure that our systems are listening to our users, able to accurately measure their emotions, and able to reflect back the emotions that are most appropriate for the task at hand.
If, for example, someone is speaking with a system to provide medical assistance, they need to feel comfortable and confident that the system is helping them. If we are able to determine that they do not feel comfortable, then we need to be able to quickly change the interaction so that we are addressing what is making them feel uncomfortable. What this means is that we measure the affect values of the words they are using and we use that to then build a goal-driven interaction that is intended to create a sympathetic feedback loop.
One group working on the Ellie project managed to gain access to the Facebook pages of people who had committed suicide, and did a complete linguistic analysis of what they said and how they said it, and then used the data to ask if there were things that could be spotted in advance to indicate if that person was at risk. They measured semantics, sentiment, syntax, affect and other vectors to build linguistic models that mapped to emotional state. Now that they have finished their DARPA funding, Rizzo supposes that Facebook may be implementing some of these algorithms on their site. Not that if someone says the magic words that people in white coats are going to come pounding on their door, but he observes that the types of ads and messages that appear on the pages of these people are pushed more towards mental health.
This is how all humans, or almost all humans, interact. We each talk with one another in a manner that hopefully provides mutual comfort and encouragement – an interaction that is designed to create closeness confidence and trust.
After all, it’s been said that most important debates around things like politics and religion have little to do with the words and everything to do with the emotions. Affective computing is one of the most important revolutions in user-interface design because emotion is of one of the most important human motivators. And work like Rizzo’s is paving the way to understand this better.
As my aunt said, “You have two ears and one mouth for a reason.”