Anomaly detection research relies on several well-known datasets that cover different domains and data types. Commonly used datasets include the KDD Cup 1999, NSL-KDD, UCI Machine Learning Repository datasets (e.g., Thyroid, Shuttle), MNIST for image-based anomalies, credit card fraud datasets from Kaggle, and time-series datasets like Numenta Anomaly Benchmark (NAB) and Yahoo’s Webscope S5. These datasets vary in complexity, size, and application areas, making them suitable for testing algorithms in scenarios like network intrusion detection, fraud detection, system health monitoring, and image recognition.
For network security research, the KDD Cup 1999 dataset (and its improved version, NSL-KDD) is widely used to detect malicious connections in network traffic. Though criticized for being outdated, it remains a benchmark due to its structured features (e.g., protocol type, connection duration) and labeled attack types. The UCI Thyroid dataset is popular in medical anomaly detection, where the goal is to identify rare thyroid disease cases from patient metrics. For industrial systems, the UCI Shuttle dataset, which records sensor readings from NASA space shuttle missions, is used to detect operational anomalies. Image-based anomaly detection often uses MNIST, where a subset of digits (e.g., “0” as normal) is treated as inliers, and other digits are outliers. Kaggle’s Credit Card Fraud Detection dataset provides real-world transaction data with extreme class imbalance (fraudulent vs. legitimate transactions), simulating practical challenges in fraud detection.
Time-series anomaly detection often employs datasets like NAB, which includes labeled anomalies in server metrics or temperature readings, and Yahoo’s Webscope S5, containing synthetic and real-world time-series data with point and contextual anomalies. Researchers also use synthetic datasets (e.g., generated with Gaussian mixtures or autoencoders) when real data is scarce or lacks diversity. When choosing a dataset, developers should consider the data type (tabular, image, time-series), anomaly ratio (e.g., 1% fraud cases), and domain relevance. For example, testing a fraud detection model on credit card data (tabular) would be more practical than using MNIST (images). Datasets with clear ground-truth labels and documented anomaly types help validate model performance effectively.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word