Apache Flume Agent – Flume Hadoop | Spool directory to logger

Apache Flume Agent – Flume Hadoop | Spool directory to logger

In this tutorial, we will learn to set up an Apache Flume agent that is going to monitor a spooling directory.

When any log files are added to the directory, it will read those log files and push the content of the log file to the console log.

Here the source is spooling directory and the sink is console log, Apache Flume Agent sits in between spooling directory and console log.

Apache Flume Agent

Configuration of Apache Flume Agent

  1. We need to set up all the options in “.properties” file.
  2. Use the command line to start a flume agent.
  3. Command to start a flume agent as below. “ng” stands for next generation.
$flume-ng agent
  1. Set the “.properties” file as below.

spool-to-log.properties

--name=agent1
agent1.sources =datasource1
agent1.sinks =datastore1
agent1.channels =ch1

agent1.sources.datasource1.channels = ch1
agent1.sinks.datastore1.channels = ch1

agent1.sources.datasource1.type =spooldir
agent1.sources.datasource1.spooldir =/usr/kmayank/spooldir

agent1.sinks.datastore1.type =logger
agent1.channels.ch1.type =file
  1. Run the flume agent using below command:
>flume-ng agent \
>--conf-file spool-to-log.properties \
>--name agent1 \
>--Dflume.root.logger=WARN, console

The last option “Dflume” is optional. It just specifies what the logging should look like on the screen. Here, we have only mentioned the warning to be printed on the screen.

  1. Place the file into the spooling directory. You will notice that after few seconds the file name will get appended with COMPLETED and data of the file will get printed to the screen/console.

It is important to note that whenever you place a file in the spooling directory, the file should be a text file, unique in name and immutable.

To read files in any other format, you have to write a custom deserializer in Java and plug it to your properties file.

Flume Events in Apache Flume Agent

The data inside the Apache Flume agent is represented as flume events. It is the base unit of communication between source, channel, and sink.

A flume event is a discrete object that represents one record of data that need to be transported from source to sink.

Source reads the data in form of flume event, sends the data to the channel and then the sink reads the data from the channel in the form of events.

Apache Flume event consists of key-value pair representing 1 record of data. Key consists of event header i.e. metadata information which deals with how you want to process or route the data to channel or sink.

The value is the actual data which is also called event body. Therefore, each record is represented by one flume event.

For example, one file in the spooling directory is considered as one event. The event body is usually represented by a byte array. When the sources write these data to channel, it can be one event or multiple events.

If the event body exceeds the channel capacity, then apache flume won’t be able to transport that event. For example, if you want to transport very large file to say HDFS, then the direct copy would be a better option than using a flume.