🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How SIFT method for image feature extraction works?

The Scale-Invariant Feature Transform (SIFT) method identifies and describes distinctive local features in images, enabling robust matching across different scales, rotations, and lighting conditions. It works in four main stages: scale-space extrema detection, keypoint localization, orientation assignment, and descriptor generation.

First, SIFT builds a scale-space pyramid by repeatedly blurring and downsampling the image to detect features at multiple scales. This is done using Gaussian filters with increasing standard deviations (σ), creating blurred versions of the image. The Difference of Gaussians (DoG) is then computed by subtracting adjacent blurred images, highlighting regions with significant intensity changes (like edges or corners). Keypoints are identified as local maxima/minima in the DoG pyramid across scales. These candidate points are refined by rejecting low-contrast points (using a threshold on DoG response) and eliminating edge-like features (using a Hessian matrix to assess curvature).

Next, each keypoint is assigned a dominant orientation based on local image gradients. This makes the feature rotation-invariant. For example, a histogram of gradient directions (weighted by magnitude) within a region around the keypoint is computed, and the peak orientation becomes the keypoint’s reference angle. Finally, a descriptor is created by dividing the keypoint’s neighborhood into 4x4 subregions. In each subregion, gradient magnitudes and orientations are aggregated into an 8-bin histogram, resulting in a 128-dimensional vector (4x4x8). This vector is normalized to reduce sensitivity to lighting changes, ensuring features remain consistent under varying illumination.

SIFT’s strength lies in its ability to handle scale, rotation, and partial illumination changes. For instance, when stitching panoramic photos, SIFT can match features between images even if one is zoomed in or rotated. Developers often use it in applications like object recognition, 3D reconstruction, or SLAM (Simultaneous Localization and Mapping). While computationally intensive, its robustness makes it a foundational technique, though newer methods like SURF or ORF optimize for speed. OpenCV’s SIFT_create() provides an accessible implementation, allowing developers to extract and match features with minimal code.

Like the article? Spread the word