Flume Hadoop Agent – Spool directory to HDFS

In the previous post, we have already seen how to write the data from spooling directory to the console log. In this post, we will learn how Flume Hadoop agent can write data from Spooling directory to HDFS.

Similarly, if we want to write the data from spooling directory to HDFS we need to edit the properties files and change the sink information to HDFS.

spool-to-hdfs.properties

agent1.sources =datasource1
agent1.sinks =datastore1
agent1.channels =ch1

agent1.sources.datasource1.channels = ch1
agent1.sinks.datastore1.channels = ch1

agent1.sources.datasource1.type =spooldir
agent1.sources.datasource1.spooldir =/usr/kmayank/spooldir
agent1.sinks.datastore1.type =hdfs
agent1.sinks.datastore1.hdfs.path =/temp/flume
agent1.sinks.datastore1.hdfs.filePrefix =events
agent1.sinks.datastore1.hdfs.fileSuffix =.log
agent1.sinks.datastore1.hdfs.inUsePrefix =_
agent1.sinks.datastore1.hdfs.filetype =Datastream

agent1.channels.ch1.type =file

With the help of above properties file when the file will be written to HDFS, the file name will be events_.log

How many events read from the source will go into one HDFS file?

The files in HDFS are rolled over every 30 seconds by default. You can change the interval by setting the “rollInterval” property. The value of “rollInterval” property will be specified in seconds. You can also roll over the files by event count or cumulative event size.

“Filetype” property can be of three different types i.e. sequence file, datastream (text file) or compressed stream.

The default option is sequence file which is binary format file. The event body will be the byte array and the byte array will be written to the sequence file.

It is expected that whoever is reading the sequence file knows how to serialize the binary data to object/data.

Datastream can be any file which is uncompressed such as text file while as the compressed stream will be any file format such as gzip, rar, bzip2 etc.

Now, start the flume agent using below command:

>flume-ng agent \
>--conf-file spool-to-hdfs.properties \
>--name agent1 \
>--Dflume.root.logger=WARN, console

Once, the Flume Hadoop agent is ready, start putting the files in spooling directory. It will trigger some actions in the flume agent.

Once you will see that the spooling directory files are suffixed with “COMPLETED”, go to the HDFS and check whether files have arrived or not. Use below command to list the file in HDFS directory.

hadoop fs –ls /temp/flume

Use the ‘cat’ command to print the content of the file.

hadoop fs –cat /temp/flume/<file_name>

Apache Flume Agent – Flume Hadoop | Spool directory to logger

In this tutorial, we will learn to set up an Apache Flume agent that is going to monitor a spooling directory.

When any log files are added to the directory, it will read those log files and push the content of the log file to the console log.

Here the source is spooling directory and the sink is console log, Apache Flume Agent sits in between spooling directory and console log.

Apache Flume Agent

Configuration of Apache Flume Agent

  1. We need to set up all the options in “.properties” file.
  2. Use the command line to start a flume agent.
  3. Command to start a flume agent as below. “ng” stands for next generation.
$flume-ng agent
  1. Set the “.properties” file as below.

spool-to-log.properties

--name=agent1
agent1.sources =datasource1
agent1.sinks =datastore1
agent1.channels =ch1

agent1.sources.datasource1.channels = ch1
agent1.sinks.datastore1.channels = ch1

agent1.sources.datasource1.type =spooldir
agent1.sources.datasource1.spooldir =/usr/kmayank/spooldir

agent1.sinks.datastore1.type =logger
agent1.channels.ch1.type =file
  1. Run the flume agent using below command:
>flume-ng agent \
>--conf-file spool-to-log.properties \
>--name agent1 \
>--Dflume.root.logger=WARN, console

The last option “Dflume” is optional. It just specifies what the logging should look like on the screen. Here, we have only mentioned the warning to be printed on the screen.

  1. Place the file into the spooling directory. You will notice that after few seconds the file name will get appended with COMPLETED and data of the file will get printed to the screen/console.

It is important to note that whenever you place a file in the spooling directory, the file should be a text file, unique in name and immutable.

To read files in any other format, you have to write a custom deserializer in Java and plug it to your properties file.

Flume Events in Apache Flume Agent

The data inside the Apache Flume agent is represented as flume events. It is the base unit of communication between source, channel, and sink.

A flume event is a discrete object that represents one record of data that need to be transported from source to sink.

Source reads the data in form of flume event, sends the data to the channel and then the sink reads the data from the channel in the form of events.

Apache Flume event consists of key-value pair representing 1 record of data. Key consists of event header i.e. metadata information which deals with how you want to process or route the data to channel or sink.

The value is the actual data which is also called event body. Therefore, each record is represented by one flume event.

For example, one file in the spooling directory is considered as one event. The event body is usually represented by a byte array. When the sources write these data to channel, it can be one event or multiple events.

If the event body exceeds the channel capacity, then apache flume won’t be able to transport that event. For example, if you want to transport very large file to say HDFS, then the direct copy would be a better option than using a flume.

Flume Installation – Apache Flume Agent

Flume installation is a very simple process. Go through the following steps to install and configure Apache Flume.

Steps for Flume Installation

  1. Download the Apache flume from the Apache Webpage. (Apache Flume Download Url)
  2. Extract the folder from the zip file that is downloaded and point to the flume folder in bash profile. The entry to the bash profile is to make sure that you can start the flume agent from any directory. For Example : export FLUME_HOME = $HOME/apache-flume-1.6.0-bin
  3. Append the path variable with FLUME_HOME. For example, export PATH=$PATH:$FLUME_HOME/bin

Flume Agent

A flume agent is an independent java daemon (JVM) which receives events (data) from an external source to the next destination. The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of data storesin big data.

The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of datastores.

Let’s understand this with an example:

Apache Flume Agent

If you have two data sources say DS1 and DS2 and you want to write DS1 data into HDFS and DS2 data into Cassandra.

For this scenario, one flume agent is enough to complete the job. A flume agent does hop by hop operation i.e. writing the data of DS1 and DS2 to HDFS and Cassandra is one hop complete.

Let us suppose you have another hop for the data i.e. data that is written to the HDFS is read by some other application and finally it needs to go to some other datastore say hive.

Here, there are two flume agents required since we have two hops of data. One flume agent for DS1 and DS2 data to HDFS, Cassandra respectively and the other flume agent from HDFS to Hive.

There are three basic components of a flume agent:

  1. The first component to receive data. This component is called as the source.
  2. The second component to buffer data. This component is called as the channel.
  3. The third component to write data. This component is called as the sink.

A flume agent is set up using a configuration file and within that configuration file, we have to configure about the sources and the format in which data is sent to your source.

It is important to configure the channel capacity as per the rates of the source and the sink. There are many types of channels but the two are most commonly used channels.

First, is called an in-memory channel which uses the memory of the system where flume agent is running and buffers the data where in-memory. The in-memory channel basically acts like an in-memory queue.

The source will write to the tail of the queue and the sink will read from the head of the queue.

But there are issues with the memory channel. One primary issue is that it is capacity constraint by the amount of memory system has. In case of the crash, memory channel is not persistent.

Therefore, all the data that is present in the buffer might be lost. File channels are the best because it gives you fault-tolerance and non-lossy data i.e. you will get a guarantee of no data loss.

Since the data is buffered on disk for file channels, you can have larger buffer capacity as per your requirement. Channels are continuously polled by sink components which write the data to endpoints.

Multiple sources can write to the single channel. Also, one source can write to multiple channels i.e. there is a many-to-many relationship between sources and channel.

However, channel to sink relationship is one to one. The channel will not immediately delete the data as it writes to the sink. It will wait for an acknowledgment from sink before deleting any data from the channel.