🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What are the seminal papers on computer vision?

Computer vision has been shaped by several foundational papers that introduced key techniques still widely used today. Three of the most influential works are the development of convolutional neural networks (CNNs), the introduction of region-based object detection, and the application of transformers to visual tasks. These papers established core methods for image classification, object detection, and generative modeling, forming the backbone of modern computer vision systems.

The 2012 paper “ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton demonstrated the effectiveness of CNNs for large-scale image recognition. Their model, AlexNet, achieved a significant reduction in error rates on the ImageNet benchmark compared to traditional methods. Key innovations included using ReLU activation functions for faster training, dropout layers to reduce overfitting, and leveraging GPUs for efficient computation. This work popularized CNNs and spurred rapid adoption across the field. Earlier foundational work, like Yann LeCun’s 1998 paper “Gradient-Based Learning Applied to Document Recognition” on LeNet-5, had already introduced CNNs for digit recognition, but AlexNet’s success on a larger dataset solidified their importance.

For object detection, the 2014 paper “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” by Ross Girshick et al. introduced R-CNN, which combined region proposals with CNNs. Instead of applying a CNN to an entire image, R-CNN extracted regions of interest and processed them individually. While slow, this approach significantly improved detection accuracy. Follow-up work like Fast R-CNN (2015) and Faster R-CNN (2016) optimized the pipeline by sharing computations across regions. Another milestone was the 2016 “You Only Look Once” (YOLO) paper by Joseph Redmon et al., which reframed detection as a single regression problem, enabling real-time performance. These works established the two dominant paradigms in object detection: region-based methods and single-shot detectors.

Recent advances include applying transformers to vision tasks. The 2020 paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al. demonstrated that vision transformers (ViTs) could match or exceed CNN performance by splitting images into patches and processing them as sequences. Meanwhile, generative models like GANs, introduced in Ian Goodfellow’s 2014 paper “Generative Adversarial Networks”, enabled realistic image synthesis. For example, StyleGAN (2019) by Karras et al. refined GANs to produce high-quality, customizable images. These papers expanded the scope of computer vision beyond classification and detection, enabling tasks like image generation and cross-modal learning (e.g., combining text and images in models like CLIP).

Like the article? Spread the word