Why computer vision is the key to AI

The ability for machines to “visually understand the world around them” is the final piece of the puzzle needed for true artificial intelligence, according a researcher at one of the Australian Research Council's centres of excellence.

“For an artificially intelligent agent or a robot to be able to operate in its environment, it needs to be able to do three things,” said Stephen Gould, chief investigator at ARC's Centre of Excellence in Robotic Vision.

“The first thing is to be able to see, the second thing is to be able to think, and the third thing is to be able to act.”

Gould said that we’ve long possessed the algorithms required for AI to think and reason, and make decisions.

“And the machines that they’re reasoning on are getting faster thanks to Moore’s Law, and we now have distributed computing and specialised architecture that can run these algorithms,” he added.

He also said that today's robotic actuators allow for very precise movements, and “are used in factories, in your own vehicles, and are essentially based on mature technologies.”

“The piece that’s missing is the perception piece – the ability to get machines to visually understand the world around them.

“If we want our future robots to be able to work with humans, they need to be able to understand and perceive the world the way we do.”

Deep learning has caused an ‘AI Spring’

Gould said that the AI research community is currently enjoying an ‘AI Spring’, where innovation is occurring at a rapid pace.

“We’ve been in an AI winter for a very long time, where artificial intelligence essentially stagnated, and there wasn’t very much progress going on,” he explained.

He said the tide started to shift in 2012, when a combination of three things occurred:

Access to massive amounts of data. “Researchers from Stanford University released what was called ImageNet, which was a database of 14 million images that had all been tagged, allowing other researchers to train their algorithms better.”
Better hardware. “We got hardware that was capable of processing these 14 million images, and could run our algorithms at very high speeds.”
Deep learning. “There were improvements to an old technology formerly called artificial neural nets, which has now been relabelled as deep learning.”

The convergence of these factors led to the initiation of the ImageNet Large Scale Visual Recognition Challenge, a program launched by the research community where a machine had to be able to correctly identify one of a thousand different objects, according to Gould.

“In 2012, we saw a sudden jump in the accuracy of these systems and that’s through the use of the deep learning models,” he explained.

“Since then, performance has steadily improved, and now the ability for machines to recognise one of these thousand different objects exceeds that of a human.”

These advances, Gould said, have led to progress being made in the areas of object detection, semantic segmentation and depth perception within static images, which would not have been possible without the advancements in deep learning.

As further progress is made, Gould expects these advancements to translate beyond static image processing into video, where machines will eventually be able to understand actions and intent, bringing real-time human-AI interaction closer together.