image of the blog
ShadowThink Logo

How to Learn Computer Vision Well

Kick off the Game!

As a master student with machine learning background, it’s common to think about new domain like computer vision (CV), natural language processing (NLP), and speech recognition. These three domains are treated as the basement of general AI. Giant companies like Google, Facebook and Microsoft make great investment into these domains. Because I was always dream to build my personal intelligent robot and the CV plays a crucial role in robotics, I began my computer vision research journey without hesitation.

What is Computer Vision?

According to Wikipedia, computer vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do. We can understand it and its trend better from its history, especially its founders’ philosophy.

The top prize of computer vision is Marr Prize in memory of David Marr. Marr is the pioneer computational neuroscience, and his work contributed to cognitive science and help people understand computer vision. In his opinion, the computer vision has three levels: model (representation), algorithm and implementation. Currently, deep learning approaches mix up three levels thus it’s hard to explain why the model works or not. The performance is hard to be theoretically analyzed. Most importantly, Marr defines the computer vision as continual computation process acting as a robot with vision. Computer vision is task driven. With different tasks, you can get different results for the same input.

The next pioneer is King-Sun Fu who is considered as the starter of pattern recognition. Likewise, the King-Sun Fu Prize is the great honor to the person with outstanding contribution to the field of pattern recognition. Fu gives the syntactic method to pattern recognition. In fact, an image is also followed specified syntactic rules. Fu’s work provided a approach to understand the images systematically.

As we can see, just as its pioneers, computer vision is an interdisciplinary field related to cognitive science, pattern recognition etc. Even through under the deep learning bloom, they aren’t as influential as before. To some extend, they are still inspiring to computer vision innovation.

Computer Vision is Hard

Why is vision so difficult? In part, it’s because vision is an inverse problem, in which we seek to recover some unknowns given insufficient information to fully specify the solution. Humans and animals do this so effortlessly with advanced vague processing ability, while computers are error prone.

With current deep learning approaches, we can extract deeper representation but it needs large scale datasets. Our AI system is like a dilligent student who do so many exercises to pass the examine. But for a real leaner, like human, we can classify cat and dog after seen them at several but not thousands times. Analysis-by-synthesis may be the key to the difference. Human can synthesize objects in mind even it only saw the object once. In this way, human have much high ability at generalization. All in all, the state-of-the-art deep learning approaches still need lots of improvements comparing to human.

But How?

I’m not a master at computer vision, following tips are borrowed from some masters including Song-Chun Zhu, Gang Hua, Richard Szeliski.

  • Learn from its development history.

    Some ideas that appeared in current paper are actually from older ones, because of the limitation of computation capability at that time, they cannot practically use it. Some terminology changed just to attract more people.

  • Focus on three high-level approaches: Scientific, Statistical, Engineering.

    Because of the complexity of real-world imagery, some algorithm or model don’t work anymore under new imagery conditions. Ideally, you should make the mathematics more tractable and give scientific explanation on what imagery conditions it works and vice verse. From statistical view, you should use probabilistic models to quantify the prior likelihood of your unknowns and the noisy measurement processes taht produce the input images, then infer the best possible estimates. From engineering aspect, you should develope technique and know how to work well in practice.

  • Learn to distinguish what is good research.

    In computer vision, good researches should solve real problems. The better researches provide valuable problems or directions which is usually the first paper. The distinguishable researches provide a creative, influential approach to solve the valuable problems, some are called the best paper. The even hard one is the last paper which solved the problems completely. These are three types of good research.

  • Deep learning is not everything.

    Deep learning is not everything for computer vision. You should learn machematical tools and other algorithms. As a interdisciplinary field, signal processing, optimization theory, bayesian statistic are required as well. Only in this way, you can develope the ability of formal thinking.