Advertisement
If you’ve ever tried to collect streaming logs from multiple servers and push them to a centralized data store, you know it’s no easy task. Apache Flume exists to make that process smoother. Lightweight, flexible, and purpose-built for ingesting high volumes of event data, Flume acts as the quiet but efficient middleman between your data sources and destinations like HDFS or HBase.
But like any tool, Flume only performs as well as it's been set up. That means the installation, initial configuration, and basic setup can’t be treated like an afterthought. Here's how to get it right from the beginning.
Before installing anything, it's good to know what you're working with. Flume is designed around the idea of agents—tiny data pipelines composed of sources, channels, and sinks. Sources collect the data, channels temporarily hold it, and sinks deliver it to the final destination.
Each agent runs independently, which means you can scale horizontally with ease. You could have a simple agent pushing data from a log file to HDFS or a network of agents feeding data from various locations into a central collector. This modular structure is what makes Flume so versatile, and why getting the configuration right matters so much.
Apache Flume requires Java. Not just any version—Flume generally plays well with Java 8 or 11. To check what you’ve got, run:
nginx
CopyEdit
java -version
If nothing shows up or the version is off, go ahead and install the right version. Most setups can use:
arduino
CopyEdit
sudo apt-get install openjdk-11-jdk
or
nginx
CopyEdit
sudo yum install java-11-openjdk
Go to the official Apache Flume download page and choose the latest stable version. Once downloaded, extract the archive to a suitable location.
bash
CopyEdit
tar -xzf apache-flume-x.y.z-bin.tar.gz
mv apache-flume-x.y.z-bin /usr/local/flume
To make Flume available from anywhere in your terminal, add it to your PATH. You’ll also want to specify the Java home.
In your ~/.bashrc or ~/.bash_profile:
bash
CopyEdit
export FLUME_HOME=/usr/local/flume
export PATH=$PATH:$FLUME_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
Then reload the file:
bash
CopyEdit
source ~/.bashrc
Flume doesn’t require a traditional "install" step. Once it’s unpacked and Java is set up, it’s ready. You can verify things with:
pgsql
CopyEdit
flume-ng version
You should see the Flume version and information about the dependencies. If it runs without throwing errors, you're good to move on.
This is where Flume starts to flex. A configuration file in Flume is just a .conf file written in a simple key-value format. It defines how your agents behave—what source they listen to, how they buffer data, and where they send it.
Let’s walk through a basic example.
First, name your agent:
ini
CopyEdit
agent1.sources = src
agent1.channels = ch
agent1.sinks = sink
Let’s say we’re tailing a file:
ini
CopyEdit
agent1.sources.src.type = exec
agent1.sources.src.command = tail -F /var/log/syslog
This source will execute a shell command and stream the output to Flume.
Here, we’ll use a memory channel for simplicity:
ini
CopyEdit
agent1.channels.ch.type = memory
agent1.channels.ch.capacity = 1000
agent1.channels.ch.transactionCapacity = 100
This keeps things fast but isn't ideal for production where durability matters. In those cases, use a file channel.
And finally, let’s write the logs to HDFS:
ini
CopyEdit
agent1.sinks.sink.type = hdfs
agent1.sinks.sink.hdfs.path = hdfs://namenode:8020/user/flume/logs/
agent1.sinks.sink.hdfs.fileType = DataStream
agent1.sinks.sink.hdfs.writeFormat = Text
agent1.sinks.sink.hdfs.batchSize = 100
You need to bind the source and sink to the channel:
ini
CopyEdit
agent1.sources.src.channels = ch
agent1.sinks.sink.channel = ch
Save this configuration to a file—simple-agent.conf.
Once your configuration is ready, you can run the agent directly from the command line:
bash
CopyEdit
flume-ng agent --conf $FLUME_HOME/conf --conf-file simple-agent.conf --name agent1 -Dflume.root.logger=INFO,console
This command tells Flume which agent to start and where to find its config. The -Dflume.root.logger part just makes the logs print to your screen, which is useful during debugging.
If everything is configured correctly, you’ll see log output confirming that the source is running, the channel is initialized, and the sink is writing to HDFS. If something’s off, Flume will usually point you to the line in the config file that’s causing trouble.
Flume offers more than just local file tailing and HDFS sinks. You can chain agents together, use Avro sources for cross-agent communication, or configure failover paths to ensure nothing gets lost.
Here’s a quick overview of what’s possible:
You can define multiple sinks and a sink group with a failover strategy:
ini
CopyEdit
agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = sink1 sink2
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.sink1 = 1
agent1.sinkgroups.g1.processor.priority.sink2 = 2
You can also have multiple sources writing into multiple channels if you want to balance traffic. Just make sure every source-channel and channel-sink mapping is explicitly defined.
If reliability is a priority, switch your channel type from memory to file. It slows things down but guards against data loss during outages.
pgsql
CopyEdit
agent1.channels.ch.type = file
agent1.channels.ch.checkpointDir = /var/lib/flume/checkpoint
agent1.channels.ch.dataDirs = /var/lib/flume/data
Setting up Apache Flume may seem a little verbose at first, but once the pieces click together, it’s a straightforward and effective way to collect and move massive amounts of log data. Whether you're tailing local files, piping events over the network, or writing to distributed stores, Flume handles the heavy lifting so you don’t have to.
Once your agents are properly installed and configured, the rest becomes a matter of scale. And that’s where Flume shines best—quietly doing its job, one event at a time.
Advertisement
How benchmarking text generation inference helps evaluate speed, output quality, and model inference performance across real-world applications and workloads
Can a small language model actually be useful? Discover how SmolLM runs fast, works offline, and keeps responses sharp—making it the go-to choice for developers who want simplicity and speed without losing quality
How can vision-language models learn to respond more like people want? Discover how TRL uses human preferences, reward models, and PPO to align VLM outputs with what actually feels helpful
Accelerate AI with AWS GenAI tools offering scalable image creation and model training using Bedrock and SageMaker features
How to use ChatGPT for Google Sheets to automate tasks, generate formulas, and clean data without complex coding or add-ons
Achieve lightning-fast SetFit Inference on Intel Xeon processors with Hugging Face Optimum Intel. Discover how to reduce latency, optimize performance, and streamline deployment without compromising model accuracy
Ready to make computers see like humans? Learn how to get started with OpenCV—install it, process images, apply filters, and build a real foundation in computer vision with just Python
Learn the top 5 AI change management strategies and practical checklists to guide your enterprise transformation in 2025.
Learn what a data warehouse is, its key components like ETL and schema designs, and how it helps businesses organize large volumes of data for fast, reliable analysis and decision-making
Watsonx AI bots help IBM Consulting deliver faster, scalable, and ethical generative AI solutions across global client projects
Get full control over Python outputs with this clear guide to mastering f-strings in Python. Learn formatting tricks, expressions, alignment, and more—all made simple
Nvidia is set to manufacture AI supercomputers in the US for the first time, while Deloitte deepens agentic AI adoption through partnerships with Google Cloud and ServiceNow