Services like Shazam perform audio matching and search by creating a unique “fingerprint” of an audio clip and comparing it against a database of precomputed fingerprints. This process involves three main stages: generating a compact representation of the audio, efficiently indexing fingerprints, and performing fast similarity searches. The goal is to identify a match even when the input audio is noisy, truncated, or recorded in suboptimal conditions.
First, audio fingerprinting starts by converting the raw audio signal into a spectrogram using techniques like the Short-Time Fourier Transform (STFT). This reveals the frequency content of the audio over time. Key points in the spectrogram, such as local maxima in specific frequency bands, are identified as “landmarks.” For example, a peak at 1 kHz occurring at 10 seconds might be paired with another peak at 2 kHz occurring 2 seconds later. These landmark pairs are converted into hashes—numeric representations that encode the time and frequency relationship between the peaks. This hashing step reduces the audio to a set of unique identifiers while tolerating minor distortions.
Next, the system relies on a database optimized for fast lookups. Each stored song is preprocessed to generate its fingerprint, which is indexed using an inverted index or hash table. For instance, a hash value like “A3F9” might map to a list of songs and timestamps where that hash occurs. When a user queries a clip, Shazam extracts its hashes and searches the database for matching entries. To handle scale, distributed systems or optimized data structures like suffix trees are often used. Indexing also accounts for variations—such as tempo changes—by allowing flexible time alignment between hashes during matching.
Finally, the matching algorithm identifies the most likely candidate by analyzing temporal consistency. If multiple hashes from the query clip align in time with a stored song (e.g., hashes from the query occur 5 seconds apart, matching a 5-second gap in the stored song), confidence in the match increases. Statistical methods, such as counting overlapping hashes within a time window, help filter false positives. For example, a 10-second clip might produce 100 hashes, and a match is confirmed if 70 align with a single song’s timeline. This approach balances speed and accuracy, enabling real-time identification even with partial or degraded audio inputs.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word