Streaming systems handle late-arriving data by using event-time processing, watermarks, and configurable windowing strategies. These systems prioritize processing data based on when events actually occurred (event time) rather than when they arrive (processing time). To manage delays, they set thresholds for how long to wait for late data and update results incrementally. This ensures accurate outcomes even when data arrives out of order or after initial computations.
One core mechanism is the use of watermarks, which define the latest event time the system expects to see for a given window. For example, Apache Flink allows developers to set watermarks that tolerate a delay (e.g., 5 seconds). Any data arriving after the watermark passes the window’s end time is considered late. Systems often combine this with allowed lateness configurations, which keep windows open for an additional period (e.g., 10 minutes) to incorporate late data. During this time, results are recomputed and emitted as updates. For instance, a user activity event delayed due to network issues could still be added to a hourly session window if it arrives within the allowed lateness period. Side outputs (or “late data handlers”) are another tool, enabling developers to route late records to separate pipelines for logging or reprocessing.
Windowing strategies also play a role. Sliding or tumbling windows can be paired with triggers that fire when the watermark passes the window’s end or when late data arrives. For example, Google Dataflow uses triggers to emit partial results early and refine them as late data comes in. State management is critical here: systems must retain window state until the allowed lateness period expires. Frameworks like Kafka Streams handle this by maintaining state stores with configurable retention policies. Developers must balance latency and accuracy—longer lateness periods improve completeness but increase resource usage. By tuning watermarks, lateness thresholds, and state retention, streaming systems ensure reliable handling of real-world data delays while providing timely insights.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word