This article looks at how the robotics industry of today is following in the footsteps of the personal computer industry of yesterday, and why Natural Language Processing, like the Graphical User Interface, plays a key role in this industry-wide evolution.
As Bill Gates knows, the robotics industry of today is copying the steps of the home computer industry from the ’80s. Those big, clunky, professional, industrial robotics systems are getting small, slick, personal, and homey. It’s why roboticists around the world are trying to figure out how to make robots “user friendly.” But there’s at least one simple solution.
If you and I were to sit down and have a talk at, say, a coffee shop, we’d be doing what may be the single most important thing the robotics engineers can learn from: talking.
Back in the late 1980s, Apple’s Macintosh computer was gummy old crank-shaft of a system that took a couple of minutes to test memory, initialize the OS, and load the Finder. One day, according to legend, Steve Jobs approached Larry Kenyon, the engineer heading system development, and told him it booted too slowly.
Jobs said, “How many people are going to be using the Macintosh per day? A million? Five million?”
Poor Mr. Kenyon, a man working with the eight megahertz CPU, and 128k of RAM, looked at him and agreed.
Jobs continued, “If you shave 10 seconds off of that boot time, multiply that by five million users you have fifty million seconds, every single day. Over a year, that’s dozens of lifetimes. So if you make it boot ten seconds faster, you’ve saved a dozen lives. That’s really worth it, don’t you think?”
These days, the robotics industry is in the same boat. If we are able to improve system speed we can save lifetimes. While most industrial manufacturing robots need some typing and command-line editing, even more time-saving benefits can come to home systems for health care, service, and entertainment, like Aldebaran’s NAO models. These robots are now programmed and controlled with some voice input, some 3D model manipulation, some driver commands, and some API text input. By consolidating these things with Natural Language Processing (NLP) and voice input, we can save heartbeats.
Consider how long input and output takes with text. Most of us type about 40 words per minute and read about 250. That’s written language input and output speeds.
But most of us talk about one 150 per minute and comfortably listen to about the same speed. That’s spoken language input and output speeds.
If we want to save time we should be talking with our computers, be they desk machines, hand-held mobile devices, or robots. It just doesn’t make sense for us to type to them any more. Especially if they have a small screen. Most robots, after all, don’t have a keyboard and monitor. Voice interface and Natural Language Processing provides a deep and simple interface that also saves time.
But the robotics industry has a much more interesting, and far more difficult problem than saving time. What we really need to do is improve emotional engagement with the system.
Apple has always used a human face on their computers because users need to relate – emotionally – to their machines. Apple knew that saving time was important but, more critically, it also realized that the Mac had to be accessible, useful, and understandable. It had to be emotionally engaging. These ideas were central to the design of the most successful home computer ever made. Whether it was the metaphors or the graphics, Apple consistently bent their back to the task of making computers “User Friendly.” We need to do the same with robots.
A couple of years ago I had the pleasure of meeting Dr. Hiro Hirukawa, the director of Intelligent Systems Research Institute (ISRI), the group dedicated to robotics research at AIST, a Japanese national research center.
In the early 1990s ISRI’s non-humanoid robots (armatures, rolling carts, automated cameras, etc.) were functioning quite profitably in factories and warehouses throughout Japan. They were bolting bolts onto cars, guarding hallways in buildings, stacking boxes in storage centers, and so forth. But everyone expected to see androids, humanoid-shaped robots, not rolling boxes. Especially the researchers’ kids.
Dr. Hirukawa explained to me, “Our question was, ‘What good is a humanoid?’” After all, what could be a worse design for a robot? Put the center of gravity up high on a system that is barely able to balance, give it two little pegs to stumble about on, then attach a gripping system which, because it can’t be retracted into the body, has to be countered against whenever asymmetric five-tentacled pincers are used. Oh, and give it a thing called a “head”, which has no function whatsoever since any sensory apparatus can just as easily be put in the feet or stomach. What does a robot need a head for, anyway? Or a face, for that matter?
It appeared as if an android was a design aberration, an anomaly in which the function followed the form. But it turns out this isn’t the case. The form still follows the function if you recognize that the function of an android is to engage with people at an emotional level. A human-like shape simply supports this activity.
Dr. Hirukawa built the first android so that people could emotionally identify with the machine. His work with Paro (that seal-shaped robot for elderly healthcare) led he and his team to realize that the degree to which we can emotionally engage with robots will be proportional to the quality and quantity of its use.
Whether it is the Uncanny Valley hypothesis or the visual design of a desktop metaphor, computer science and robotics design cannot overlook the importance of emotional engagement. By giving the robot a human shape, people will like it more. Good design includes an emotional relationship. Both Jobs and Hirukawa believed this, acted on it, and made substantial inventions as a result.
Unlike Jobs, Hirukawa did not implement, on the face or chest of the android, a GUI. There is no keyboard on an android (or most other computers, for that matter), there is no screen, and there is no mouse. It is designed for natural language.
Let’s get back to talking, and that espresso. If you and I are sitting across from each other in a café, discussing, say, systems design, and there’s a couple of cups of coffee on the table, two coffeekeyboards, two monitors, and we’re typing to each other via terminal applications, then there’s something going horribly wrong between us geeks. If we’re occupying the same time-space continuum of the café, it’s most effective to be talking. And it’s more fun, too.
We’ve got this great, old, delicate, tool called natural language, and it’s probably the most powerful technology humans have ever invented. We should put it to use when we can.
NLP increases emotional engagement, broadens functionality and increases operational depth. Conversational interfaces allow debugging, redundancy checking, error modification, self-correction, other-correction, and a host of other system-level functionalities that we do every day when we ask things like, “What?” or “Did that make sense?” or “Do you follow?” or “Do you really need another espresso?”
On a functional level, NLP does for robotics what the GUI did for home computers. It allows non-technical and non-professional users, both young and old, to engage. In terms of where humans could interface with robots, think: home, school, and hospital. Seen in this light, NLP is a presentation-level system component that allows easy interface to other applications. Just like the GUI.
Apple’s Siri, IBM’s Watson, and Google Now have already shown some of the potential.
You and I are sitting in the café, and a robot is nearby. I wave it over and ask it for some milk with my coffee. It comes back a few seconds later and puts a little pitcher on the table. It turns out that processing the linguistic request is an easier problem to solve than putting the cup on the table.
Here’s how it’s done.
First, a voice interface module records the sound of “I’d like some milk, please,” and sends it to a server. That server ploughs through a pile of sounds and picks apart the sounds that are closest. This can get tricky since background noise, and my accent, age, gender, and the speed I say this all affect what the system can parse. The process requires recording a big batch of possible words for all the above variables (which Nuance has already done, or you can build your own with Sphinx, or you can use some of Google’s data). That big pile of voice recordings then gets checked and maybe improved then the system basically decides that the words it heard are equal to a collection of text strings (such as “I’d” and “like” and “some” and “milk” and “please”).
Next, that string of text gets analyzed. Now that we’ve derived the words from the sound, we have to derive the meaning from the word. This involves lexical preparation and match lookup and testing. In short, the system has to grammatically parse the phrase, find the key ideas, and match them to some existing or evolving set of data that it already owns. These days Siri and Watson and other systems do this in very similar ways; mostly hand-scripted top-down design (as opposed to bottom-up, which is learned).
A tighter lexical range – a tighter context – makes for a better system. We know, for example, that it is not a good idea to ask a barista for an Apple Macintosh. Same with the café robot. That robot will be waiting for words within a limited range of meanings (coffee, tea, milk, etcetera), and if the Natural Language Processing (NLP) system is properly tuned and tested, then the requests that people make will be included in the system’s knowledgebase. This is what allows the system, especially in hand-scripted top-down designs, to turns text into meaning. So well-designed NLP systems should, at least in the coming decade, work within tight social contexts if we want them to perform as well as possible.
Anyway, at this point the café robot has heard and understood the request for milk and needs to be able to bring it over with a smile and a curtsy. We shall not address such difficult problems as the system drivers for smiles, curtsys, nor carrying milk in this article.
An NLP interface doesn’t need to happen in the coffee shop. Maybe I’m driving and I want my car to find a café nearby. An example would be, “Find coffee in San Francisco,” which would create an environmental context by loading regional (city / state / country) dictionaries and business category dictionaries into an NLP unit. This would produce a system output that locates all business categories supplying coffee in the area defined by “San Francisco” as the location. It’s the same problem, different scale. We built a thing like that this year, in fact.
But regardless of the application, geography, or coffee bean, the core process is roughly the same:
- Record the voice input
- Link the voice to words
- Link the words to meaning
- Link the meaning to response
- Check it and see if it’s making sense
- Link the response to the action, data, or voice output
Just as the home computer industry of the 1980s went from a command-line user interface to a GUI, the home robotics industry of today is transitioning from a GUI to a natural language interface.
The personal robotics industry is moving in the same direction that the personal computing industry of the 1980s did. Voice input is replacing keyboards, NLP is replacing GUIs, and psychologists now play the role that graphic designers did, back when information architecture and user experience were new terms for the web design crew. Robotic systems that use natural language will become faster, simpler, easier to understand, more effective, and available for more people. The main reason is that robots do not generally come equipped with a keyboard and monitor. And as users of systems as prevalent as iOS and Android become accustomed to voice interface, it will soon be an expected feature.
If I were to ask that little coffee robot for milk, and it were to reply, “We have soy milk, non-fat, half-and-half, goat’s milk, or this week’s special, luke-warm yak milk.” I might be a bit surprised, but I wouldn’t be impressed. After all, it would just be playing a recording.
But I would be impressed if I said, “Whatever,” and it took off without saying another word. After all, sometimes less is more.
Natural language just works this way. Always has, always will.