🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do watermarking techniques work in stream processing?

Watermarking in stream processing is a mechanism to track event time progress and handle late data in systems where events arrive out of order. Event time refers to the timestamp when an event actually occurred, not when it was processed. Watermarks signal the point in time up to which the system expects all earlier events to have arrived. For example, if a watermark is set at timestamp T, the system assumes no events with timestamps earlier than T will arrive afterward. This allows the processing engine to finalize computations (like window aggregations) for events up to T, even if some stragglers might still come later. Watermarks are essential for balancing latency and accuracy in real-time systems.

A practical example involves processing user clickstream data. Suppose events arrive with timestamps indicating when a user clicked a button, but network delays cause some events to arrive minutes late. Without watermarks, the system might wait indefinitely for all events, delaying results. With watermarks, the system could set a threshold (e.g., “wait 10 seconds for late data”). If the latest event timestamp is 3:00 PM, the watermark might be set to 3:00 PM minus 10 seconds. Once this watermark passes the end of a time window (e.g., a 1-minute window ending at 3:00 PM), the system triggers aggregation for that window, even if a few late clicks arrive afterward. Late events beyond the watermark can be handled separately, such as being routed to a side output for special processing.

Implementing watermarks requires configuring policies for generating them. In frameworks like Apache Flink, developers can choose between periodic watermarks (updated at fixed intervals) or punctuated watermarks (triggered by specific events). For instance, Flink’s BoundedOutOfOrdernessTimestampExtractor sets a fixed delay (e.g., 5 seconds) to account for late data. Developers must balance the watermark delay: too short a delay risks ignoring valid late data, while too long a delay increases processing latency. Additionally, idle data sources (e.g., a sensor temporarily offline) can stall watermarks, so some systems use heartbeat mechanisms to advance watermarks even during inactivity. Properly tuning watermarks ensures timely results while minimizing inaccuracies from late-arriving data.

Like the article? Spread the word