Apache Flume is a distributed system designed to efficiently collect, aggregate, and move large volumes of streaming data from multiple sources to centralized data stores. It operates using a flexible architecture based on agents, which are independent Java processes that handle data transfer through three core components: sources (data ingestion), channels (intermediate storage), and sinks (data delivery). This structure allows Flume to decouple data producers from consumers, providing fault tolerance and scalability.
A Flume agent is configured using a properties file that defines sources, channels, and sinks, along with their relationships. For example, a source could read log files from a web server (using exec source to tail files) or receive HTTP events (via HTTP source). Data ingested by the source is placed into a channel, which acts as a buffer. Channels can be memory-based (fast but volatile) or file-based (slower but durable). A sink then reads from the channel and forwards data to destinations like HDFS, HBase, Kafka, or another Flume agent. To ensure reliability, Flume uses transactional semantics: a source writes an event to the channel only after a successful transaction, and a sink removes it only after confirming delivery. This prevents data loss during failures. For instance, if a sink fails to write to HDFS, the event remains in the channel for retry.
Developers can extend Flume with custom sources, sinks, or interceptors (for data transformation). A common use case involves tiered agents: edge agents collect logs from servers and forward them to aggregation agents, which consolidate data before writing to HDFS. Flume also supports multiplexing data to multiple sinks using channel selectors and routing events based on headers. For scalability, multiple agents can be deployed across servers, and channel capacity can be tuned to handle throughput requirements. While Flume is often used with Hadoop ecosystems, it integrates with modern tools like Kafka for stream processing pipelines. Monitoring is achievable through JMX or REST APIs, providing visibility into metrics like channel occupancy or sink success rates.