To design an ETL process that handles both batch and streaming data, start by creating separate but integrated pipelines for each data type while maintaining a unified storage layer. Batch processing handles large, static datasets at scheduled intervals (e.g., daily sales reports), while streaming processes real-time data continuously (e.g., IoT sensor feeds). Use a hybrid architecture like the “lambda architecture,” which combines batch and streaming layers, or a modern alternative like the “kappa architecture,” which uses a single stream-processing engine for both. The key is to ensure both pipelines write to a common storage system (e.g., a data lake or lakehouse) where results can be merged for querying. For example, batch jobs might calculate daily aggregates, while streaming handles minute-level metrics, with both stored in Apache Iceberg tables for consistency.
Choose tools that support batch and streaming natively. Apache Spark (with Structured Streaming) or Apache Flink can process both data types using similar APIs, reducing code duplication. For ingestion, use Kafka for streaming and tools like Sqoop or AWS Glue for batch. Store raw data in cloud storage (e.g., S3) or a distributed file system (HDFS), and use a metastore (e.g., Hive) to unify metadata. Implement idempotent operations and exactly-once processing in streaming (via Kafka transactions or Flink checkpoints) to avoid duplicates. For transformations, create reusable modules (e.g., shared Python libraries or SQL templates) to apply consistent business rules. For example, a user session calculation written as a Spark UDF could be reused in both a daily batch job and a Flink streaming job, ensuring logic alignment.
In practice, a retail company might use Kafka to ingest real-time website clicks and AWS Glue to pull daily order data from a SQL database. A Flink streaming job enriches clicks with user data, while a nightly Spark batch job aggregates orders. Both outputs land in Delta Lake tables, partitioned by date. A serving layer like Trino queries both datasets, combining real-time clickstreams with historical orders for analytics. Orchestration tools like Airflow can trigger batch jobs and monitor streaming pipelines via Kubernetes operators. To handle late-arriving data, streaming pipelines use event-time windows with watermarks, while batch jobs backfill corrections. This approach balances low-latency insights with accurate historical reporting, using shared infrastructure to reduce complexity.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word