Computers That Recognize What They SeeIt is easy to take the sense of sight for granted. We simply open our eyes and see. But, in fact, the process of capturing the light that falls onto the retina of each eye and transforming it into an understanding of the physical world in front of us is nothing short of amazing.
Typically, when we walk into a room and look around, we identify virtually every object in our area of vision within just a few hundred milliseconds, regardless of the lighting or our angle of sight. Yet, despite decades of advancements in sensor technology and artificial intelligence, the visual ability that seems so easy for any human who is not blind is still well beyond the capability of computers.
The hurdle has not been in designing computers to see, that is, to capture light and translate the photons into an electronic pattern. Any $20 webcam can do that. The challenge has been in creating computers that can recognize and understand what it is that they are seeing. In fact, the processing that we do so quickly is an incredible feat for a machine and requires an enormous amount of computing compressed into a tiny sliver of time.
Since computers excel at pattern recognition and iterative processing, scientists once believed that machines were perfectly suited to this task. But, the fit doesnt exist because there is so more to it than simply looking for patterns in the pixels.
For example, beyond simply crunching numbers, machine vision requires the computer to make sense of ambiguous visual data about objects, while separating any movement of those objects from the movement of the observer, or the movement of the other items in the room.
Another problem is the enormous degree of variation in images. That variation has become the Achilles heel of every optical recognition algorithm. Why? Because a computer algorithm looks at an object as a pixel outline. When the object or the observer moves even just a bit, the computer code "sees" it as a totally new thing.
A human can recognize a desktop keyboard, for instance, at any angle and in virtually any light. We also recognize other versions of keyboards, such as ones on smart phones and laptops. On the other hand, recognizing a "phones keypad" as a type of keyboard is tough enough. But turn the phone to a side view, and the new angle will invariably stump the computer.
This is a significant challenge, because for computers to be useful as stand-alone visual tools or as part of a robotic system, theyll need to recognize objects in a wide variety of lighting conditions, and from the many different angles they will encounter in the real world.
Thats not the only challenge. Current algorithms operate by statistically matching the outlines of an object to ones stored in its database. Yet there is still no way make sure that a computer focuses specifically on what it needs to measure when trying to identify an object. For instance, when viewing a picture of a car with a mountain as a backdrop, the computer may zero in on the mountain when it is the car we would like identified.
Another difficult task for computers is perceiving in 3-D, which is something we do effortlessly. We accomplish this by separating an object into segments, such as seeing a hand as a palm and fingers. This way, it does not matter how these segments move relative to each other; we can still see that its a hand.
Existing computer models can only do this if they know ahead of time how many segments an object has. Needing to have this type of prior information limits the usefulness of the system for general purpose applications.
That brings up an additional challenge that causes computers to fail when applying pattern-matching techniques outside a limited context, like a factory floor. Most methods focus on similarities in shapes, colors, and composition. This can be successful for making exact or very close matches within a limited domain.
But those same methods are typically rendered useless when faced with matching objects in different domains. For example, different domains often involve seemingly minor factors, such as:
? Images collected at different seasons of the year. ? Images viewed under different lighting conditions. ? Images observed in different types of media, such as photographs, color paintings, or black-and-white sketches.
As noted earlier, these domain differences seldom cause problems for human observers. But so far, computers are not capable of comparing similar objects in different domains.
Fortunately, technological advances are slowly addressing these challenges, moving computers closer to the point where robotic systems will be able to "look at" the world around them and know whats going on.
One recent advance is a neural network that mimics key visual functions of the human brain.1 Neuroscientists and cognitive scientists studied the visual systems of advanced mammals, primates, and humans. Their findings were then incorporated into neural networks and used to refine mobile robots that were created by a team of computer scientists and roboticists. The goal of this collaboration was to devise a system that a robot could use to move quickly and safely through a crowded room.
The scientists tracked how different areas of the human brain interacted as people performed tasks, hoping to understand how our visual system scans our environment, detecting objects and discerning between the movement of objects and our own movement. They used this as a model to build a network with three levels that mimic the brains primary, mid-level, and higher-level visual subsystems.
Early results have been promising, with a robot making very controlled and deliberate movements toward specific objects, all directed by the neural network rather than pre-programming.
Another research team is working on a new approach designed to help robots identify objects, which they can then manipulate.2 The core of this research involves "machine learning," where a computer is programmed to observe events and generalize based on commonalities in what it sees.
For example, it might start by observing a wide variety of cups to determine what they have in common. It is subsequently able to identify new cups it sees based on characteristics shared in common with the observed cups.
Another feature of the machine learning system is the ability to discern the presence of objects through their contextual relationships.3 Using this capability, a computer might rapidly and reliably identify a keyboard, regardless of the lighting and angle of vision, by recognizing that the keyboard is located under a computer monitor.
To address the difficulties computers have in perceiving three-dimensional shapes, two novel techniques have been developed. They are called "heat mapping" and "heat distribution."
Both techniques use principles borrowed from physics and involve mathematical equations that relate to how heat diffuses over surfaces. These techniques capitalize on the fact that heat captures the precise contours of a shape when it diffuses over it.
For example, when a computer employs heat mapping, it breaks the object it "sees" into a mesh of triangles, and then calculates the flow of heat over the object. These calculations generate a signature called a histogram, which enables the computer to recognize the object regardless of its configuration.
In light of this trend, we offer the following three forecasts:
First, over the next 5 to 10 years, the slow but steady progress in improving computer vision will lead to a wide range of practical applications.
One of the earliest and most common uses will be to simplify searches for images. Finding information about an object based on its appearance rather than a textual description will become as easy as snapping a picture and then waiting for the results. Facial recognition software will also enable someone to take a picture of a stranger on the street and match his or her face to a database of images and even being led to that persons profile on Facebook.
Second, improved machine vision will enhance robots ability to navigate, which will increase their value for use in dangerous and hard-to-reach locations.
This improved perceptual ability will also make them useful and reliable in assisted-living situations where they will be used to perform basic tasks for the disabled or elderly.
Third, within 15 years, three-dimensional machine recognition will open the doors for unique and beneficial applications.
3D search engines will aid in finding mechanical parts in online databases. Medical imaging will be greatly enhanced in 3D, as will multimedia gaming, the use of military drones, and animating characters for movies. This 3D image recognition capability will also help in science and engineering applications where recognizing patterns is required.
References List :1. To access an article about designing a robotic vision system that mimics key functions of the human brain, visit the ICT Results website at: http://cordis.europa.eu 2. To access information about machine learning, visit the Purdue University website at: http://www.purdue.edu 3. For more information about designs that help robots identify objects, visit the Cornell University website at: http://www.news.cornell.edu