Important Findings that Inspire the Field of Computer Vision
“When scientists back in the 1950s met to talk about artificial intelligence, they thought that teaching a computer to play chess would be very difficult, but teaching a computer to see would be easy,” David Knill, professor of brain and cognitive sciences.
The Building Blocks of Computer Vision
What distinguished computer vision from the already existing field of digital image processing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the three-dimensional structure of the world from images and to use this as a stepping stone towards full scene understanding.
Picking up from the last story following the Hubel and Wiesel’s experiment, David Marr, a neuroscientist at MIT published a book called VISION in the 70s — a book which I consider to be the holy grail of biologically-inspired computer vision research. Marr has introduced and brought together different approaches, made testable predictions, provided a framework for addressing neuroscientific questions and inspired a generation of young scientists (like myself) to study the brain and its computation.
Building on the ideas of Hubel and Wiesel (who we know discovered that vision processing doesn’t start with holistic objects and the information is processed in layers which increase in computational complexity), David gave us another valuable insight that the visual system is hierarchical and postulated that its crucial function is to create 3D representations of the environment that we can interact with based on the information provided by the retina.
“We mistakenly think of human vision like a camera,” says Knill. “We have this metaphor of an image being cast on the retina and we tend to think of vision as capturing images and sending them to the brain, like a video camera recording to a digital tape.”
The book also proposed a computational paradigm for studying the biological visual system and introducing the conception of three distinct levels of analysis for information-processing systems — computational theory level, representation and algorithmic level, and implementation level, in which I shall explain more explicitly in the future story. In short, computational theory level refers to the “what” aspect (what is the goal of the computation?), algorithmic level explains the representations and processes by which they solve the problems to achieve the goal, and implementation level refers to the physical instantiation of these representations and processes (e.g. how a certain task is done in neurons).
This Three Level is very important since it enables us to think from a macroscopic representation of the visual system rather looking at a microscopic entity such as an individual neuron. The “grandmother” neuron is one of the examples he introduced in his book. Indeed, you may see this particular neuron fire whenever the subject sees his/her grandmother but if you only focus on this one neuron you will never know the reason why it fires in the first place nor what lead to the recognition of the grandma. He also uses the “bird” example to demonstrate his point that it is very unlikely to find out the function of the feather regardless of the abundant knowledge on the physiology and attribute of an individual feather. Thus, one must first establish a computational theory (in this case: the study of flight) in order to find the relation between the physiology and the actual function of the feather.
Noam Chomsky, a famed linguist, philosopher, and cognitive scientist, also assert that for one the properly study the brain, they should follow David Marr’s approach by asking the question about what task the brain is performing first before start diving down into each individual neurons and study its attributes such as its synaptic strength. He brings forth the insight that one should, first, ‘look’ for the unit of computation (such as reading and writing) rather than focusing solely on a single entity, which can get very complex.
“Vision can be understood as an information processing task which converts a numerical image representation into a symbolic shape-oriented representation”- David Marr, 1982
One of Marr’s central and best-known contribution was made in the level of representation and algorithm when he established a representational framework for vision (Figure 1), which emphasises on the vision task of deriving shape information from images. He proposed that visual system generates a sequence of increasingly symbolic representations of a scene, progressing from a ‘primal sketch’ of the retinal image, through a ‘2½D sketch’ to simplified three-dimensional models of objects.
Here, the intensities perceived by any visual system are a function of four main factors:
1. the geometry (meaning shape and relative placement);
2. the reflectance and absolutely sorption properties of the visible surfaces (physical properties);
3. the illumination (light sources); and
4. the camera (viewpoint, optics).
Although David Marr’s work was groundbreaking at the time, it was very abstract and high-level and also did not include of any kinds of mathematical modeling that could be used in an artificial visual system, nor mentioned any type of a learning process. However, Marr’s philosophy is still as good a guide to framing and solving problems in the field of computer vison today as it was 25 years ago.
Marr’s theories were later improved by Stephen Palmer, who came up with the Palmer’s model of visual perception. The model consists of four stages where the degrees of abstraction increases for each layer:
- Image-based Processing (low level)
- Surface-based Processing
- Object-based Processing
- Category-based Processing (high level)
(Images below are taken from Palmer’s book ‘Computer Vision: Algorithm and Application’)
This is what Marr referred to as the ‘the 2.5D Sketch’, where two dimensional image is transitioned into three-dimensional images. Here, the visible surface of the object is analysed in terms of its spatial properties such as orientation, discontinuities, and depth.
Here, a three-dimensional representation is created where occluded (unclear) parts are assumed and filled in.
This final process finally sorts the extracted objects into groups according to their spatial properties, appearance, and other relations (eg. a cup of size xxx on a table)