In-corporating body language into NLP (or, More notes on the design of automated body language)
This article discusses how body language is a part of natural language, personality, and NLP design. The article covers various methods for approaching this problem and makes recommendations for the real-time generation of animation to accompany natural language for avatars and robots.
It’s hard to communicate with words. Some researchers claim that almost half of our communication relies on things that aren’t words: body language, tone of voice, and stuff that just isn’t conveyed by text. This includes prosody (tone, pitch and speed of words), facial expression, hand gesture, stance and posture. This probably explains why about 40% of emails are misunderstood. As designers of robots (or avatars) we need to consider these statistics and consider how to integrate body language into natural language communication. Therefor Geppetto Labs has built a platform to automatically generate body language and coordinate it with what a robot (or avatar) is saying.
Most NLP systems today, be they Siri or Watson, amount to conducting chat via the thin pipe of a text interface. Siri doesn’t have a lot of choice on the matter since Apple had to simplify the complexity of communication, but this text interface reduced the communication itself. If you think that Natural Language Processing is about only text, then step away from the computer, go to a café or bar, and watch people interact for a half an hour.
Videos do an excellent job of conveying the importance of body language. A great video to watch is The History Channel’s Secrets of Body Language. This documentary looks at politicians, cops, athletes, and others, interviewing experts of body language to decodify everyone from Richard Nixon to Marion Jones. Gesture, expression, and tone of voice are all looked at as valuable and important data channels.
This is why face-to-face meetings are so much more productive. Each party can better understand the other because there is a higher throughput of communication. In a lovers’ relationship, or in a family’s relationships, body language is even more important than in a business meeting. Consider the fact that it’s the most intimate relationships (between lovers, primarily, but also between family members, close friends, and others) that involve the most touching. These are also the relationships that rely the most on body language, because body language actually defines the proximal closeness and intimacy.
So if we want people to engage emotionally with robots, or avatars (or any other kind of character that is rigged up to an NLP system), we need to consider using body language as part of that system. We humans are hardwired that way.
At Geppetto Labs we have begun considering Body Language Processing as a sub-set of Natural Language Processing. So just as Natural Language Processing has NLU (understanding) and NLG (generation), we can consider Body Language to have BLU and BLG. I’ll be focusing on the generation of it, but others, such as Skip Rizzo, Noldus Information Technology, and others are also looking at the understanding of body language and facial expressions.
Generating body language requires coordination with the textual components of Natural Language Processing. A gesture or animation has to have the same duration of time, happen at the same moment, and include the same emotional content, or affect, as the message conveyed. “Hi” should, of course, be accompanied by a gesture that is about one second long — a friendly-looking signifier that’s commonly understood. Raising the hand and wagging it back and forth usually gets the job done. But building this can be tricky. It gets more complicated when there is a sentence like this one that doesn’t have clear emotional content, isn’t the kind of thing you hear as often as “Hi,” and is long enough that the animation needs to be at least ten seconds long.
At Geppetto Labs we’ve developed the ACTR platform in order to accomplish this. The core process, at least as it relates to text, is to generate body language (as opposed to voice output) as follows:
First, we take the nude NL text and determine the three variables of the Duration (timing), Affect (emotion), and Signifiers (specific gestures):
1) Duration, or timing. How long is the sound or string of text we’re dealing with? This is the easiest to calculate directly from the text. Most spoken conversation ranges from between 150 and 175 words per minute, but that can speed up or slow down depending on the emotion of the speaker, and other factors. But let’s call it 150 words per minute. A “word” is calculated in these kinds of standards as five UTF characters, which is also five bytes. So that means that most of us speak at around 750 bytes per minute. Now if we back this out it means that around 12 bytes should leave the system per second, and this is then used to calculate the duration of a given animation. We’ll call this integer between one and 150 a “duration tag.”
2) Affect, or emotion. What is the emotional value of that source string of text? This is the second factor we need to know in order to calculate an animation, and it’s harder than just measuring the letters in a line: it requires either realtime sentiment analysis and/or a pre-built library that identifies the emotional content of a word. One solution is WordNet-Affect. Words in WordNet-Affect are derived from Princeton’s fantastic WordNet project and have been flagged with particular meaning that indicate a range of values, most of which relate to what kind of psychosomatic reaction that word might cause or what kind of state it might indicate. Some simple examples would be happiness, fear, cold, etc. There’s a ton of really sticky material in this labyrinth of language called “affect,” and the ways that words link to one another make it all even stickier. But for this explanation, let’s say that we can take a given word and that word will fall within a bucket of nine different emotions. So we give it a value from one to nine. Fear is a one. Happiness is a nine. If we then take the average affect of the text string in question (again, speaking very simply) we end up with a number that equals the emotion of that sentence. We’ll call this integer between one and nine an “affect tag.”
(Before we go on I want to take a break because we now have enough to make an animation match our sentence.
“How in the world do we build that?” is an eight-word sentence, so we know the duration would be about three seconds. The affect is harder to measure, but for this example let’s say that it ends up being a value of 5. So we have Duration=3, Affect=5. These two bits of information, alone, are enough to calculate a rough animation, but first we need to build a small bucket of animations. They are probably keyframes because we want to interpolate them so that they form a chain. We make them of various durations (1 second, 2 seconds, 3 seconds, etc.) so that if we want a three-second chain we can combine 1-second and 2-second duration animations, or, if we want to avoid replaying the same animation we can reverse the order of these links and combine the 2-second then the 1-second animations. And we make sure that we have these various animation links ready in separate buckets – one for each animation. So if we get a Duration=3 and Affect=5 we go into the bucket labeled Affect #5 and dig up the animation links that add up to three seconds.
The longer the duration, the trickier it gets. If you have a twelve-second animation you might then have to chain together that two-second animation six times, or your one-second animation twelve times, to get the proper duration. Does that make sense?
No. I hope that at this point you’ve stopped and said, “Wait, no, that would be really dumb. To play an animation twelve times would just look like the character is convulsing. That’s bad body language, Mark!”
A spoonful of art and design can help. When you are combining the animation links to build a chain of proper duration, you need to avoid spasms and try to use the longest possible animation in your bucket. Then you frame it outwards with smaller animations to build the proper duration sum. If you want a twelve second animation, then try using a ten-second link with two at the front, or one on each end. That will avoid the spasms.
We’ve found that Fibonacci sequences work great for this because they have huge flexibility and, when interpolated together, are nearly invisible. This is the process we have found to be most helpful, flexible, simplest, and that looks the best. So this means that you’d need to build a relatively large number of animation links that are the following values: 1, 1, 2, 3, 5, 8, 13, 21, and so on. And you’d do that for each affect “bucket.”
Ok, so far so good. The next step is to play this animation and, as your character moves, watch it and ask yourself if that’s the kind of thing a person would do while saying that particular phrase.
Now back to our three variables…)
3) Signifiers, or specific animations. Is there a common gesture that normally accompanies the text? The wave of a hand that goes with the word “Hi” isn’t normally used in conversation as much as, say, nodding, or showing our palms when we speak. Most of our body language is unspecified. So specific signifiers need to be manually wired up to particular output phrases from the system. This is a hand-authored component that requires an author to flag specific lines that require specific gestures. We are developing methods for automating this, but for now it is necessary to build a few specific gestures that are used in unique moments of conversation.
Once all the variables are determined, they get packed together and our ACTR platform fires off a control file in XML that drives animations on the client side. This keeps the performance hit on the server low, allows a nearly infinite range of movements for any connected system, and provides a robot (or avatar) with the ability to gesture as it speaks.
Though this implementation is new, these concepts have been knocked about since the mid-1600s. In 1605 Francis Bacon, in The Advancement of Learning, looked at gestures as a frame around spoken communications. In 1644 John Bulwer‘s Natural History of the Hand, and Gilbert Austin‘s Chironomia (1806) examined the details of hand gestures in speech. Even Charles Darwin dug into it. Of course Freud and Jung looked into it, as did Julius Fast, an American author who first garnered some major interest in the topic in the early seventies with his book Body Language, and more recently, there’s the work of Paul Ekman, who identified some three thousand cross-cultural facial expressions. Other people who have looked into body language include: Louis Gottschalk, Erik Erikson, Charles Osgood, Otto Rank, Albert Bandura, Gordon Allport, George Kelly, Snygg and Combs, Maslow, Rogers, Jean Piaget, and others.
But it seems overlooked in most robotics design circles and certainly in most labs that are working on NLP. Some of the most talented natural language engineers are some of the least talented communicators. They understand the language of code better than the code of language. There is a terminal amputation that cuts the text of language from the language of the body. That’s because important elements of natural language, like body language, are easily overlooked if we focus too much on the code and not enough on the people.
Because ultimately, that’s what we roboticists are designing: a kind of people.