Some new results from the NGV Team at the University of Michigan describe different approaches for perception (detecting obstacles on the road) and localizations (figuring out precisely where you are). Ford helped fund some of the research so they issued press releases about it and got some media stories. Here’s a look at what they propose.
Many hope to be able to solve robotics (and thus car) problems with just cameras. While LIDAR is going to become cheap, it is not yet, and cameras are much cheaper. I outline many of the trade-offs between the systems in my article on cameras vs lasers. Everybody hopes for a research breakthrough or computer vision breakthrough to make vision systems reliable enough for safe operation.
The Michigan lab’s approach is a special machine vision one. They map the road in advance in 3D and visible light by using a mapping car equipped with lots of expensive LIDAR and other sensors. They build a 3D representation of the road similar to what you need for a video game engine, and from that, with the use of GPUs, they can indeed create a 2D image of what a camera should see from any given point.
The car goes out into the world and its actual camera delivers a 2D frame of what it sees. Their system then compares that with generated 2D images of what the camera should see until it finds the closest match. Effectively, it’s like you looking out a window and then going into a video game and wandering around looking for a place that looks like what you see out that window, and then you know where the window is.
Of course it is not “wandering,” and they develop efficient search algorithms to quickly find the location that looks most like the real world image. We’ve all seen video games images, and know they only approximate the real world, so nothing will be an exact match, but if the system is good enough, there will be a “most similar” match that also corresponds with what other sensors, like your GPS and your odometer/dead reckoning system, tell you about where you probably are.
Localization with cameras has been done before, and this is a new approach taking advantage of new generations of GPUs, so it’s interesting. The big challenge is simulating the lighting, because the real world is full of different lighting, high dynamic range, and shadows. The human system has no problem understanding a stripe on the road as it moves through the shadow of a tree, but computer systems have a pretty tough time with that. Sun shadows can be mapped well with GPUs, but shadows from things like the moving limbs of trees are not possible to simulate, as are the shadows of other vehicles and road users. At night, light and shadows come from car headlights and urban lights. The team is optimistic about how well they will handle these problems.
The much larger challenge is object perception. Once you have a simulation of what the camera should see, you can notice when there are things present that are not in the prediction — like another car or pedestrian, or a new road sign. (Right now their system mostly is looking at the ground.) Once you identify the new region, you can attempt to classify it using computer vision techniques, and also by watching it move against the expected background.
This is where it gets challenging, because the bar is very high. To be used for driving it must effectively always work. Even if you miss 1 pedestrian in a million you have a real problem because there are billions of pedestrians encountered by a billion drivers every day. This is why people love LIDAR — if something (other than a mirror or sheet of glass) sufficiently large is sufficiently close you, you’re going to get laser returns from it, and not from what’s behind it. It has the reliability number that is needed.
The challenge of vision systems is to meet that reliability goal.
This work is interesting because it does a lot without relying on AI “computer vision” techniques. It is not trying to look at a picture and recognize a person. Humans are able to look at 2D pictures with bizarre lighting and still tell you not just what the things in the picture are, but often how far away they are and what they are doing. While we can be fooled in a 2D image, once you have a moving dynamic world, humans are, generally reliable enough at spotting other things on the road. (Though of course, with 1.2 million dead each year, and probably 50 million or more accidents, the majority because somebody was “not looking,” we are far from perfect.)
Some day, computer vision will be as good at recognizing and understanding the world as people are — and in fact surpass us. There are fields (like identifying traffic signs from photos) where they already surpass us. For those not willing to wait until that day, new techniques in perception that don’t require full object understanding are always interesting.
I should also point out that while lowering cost is of course a worthwhile goal, it is a false goal at this time. Today, maximal safety is the overriding goal, and as such, nobody will actually release a vehicle to consumers without LIDAR just to save the estimated 2017 cost of LIDAR, which will be sub-$500. Only later, when cameras get so good they completely replace LIDAR safety capabilities for less money would people release such a system to save cost. On the other hand, improving cameras to be used together with LIDAR is a real goal; superior safety, not lower cost.
A version of this article originally appeared on robocars.com. Want to learn more about the the University of Michigan research cited in this article? Check out the research paper by Ryan Wolcott and Ryan Eustice (Best Paper Award at IROS 2014!), as well as Wolcott’s IROS 2014 Webcam research pitch: