Articles Tutorials Interview Questions

Tutorial Playlist

[SCENARIO-BASED ] Apache Flume Interview Questions and Answers

Last updated on 22nd Sep 2022, Big Data, Blog, Interview Question

E-mail this post

(5.0) | 13265 Ratings 2058

1. What is an Apache Flume?

Ans:

It involves efficiency and dependably collect, mixture and transfer large amounts from the one or additional supply’s to a centralized data source and tend to use Apache Flume. However, it will ingest any reasonable knowledge together with the log knowledge, event data, network knowledge, social-media generated knowledge, email messages, message queues etc since knowledge sources are unit to customizable in Flume.

2. What are Basic Features of flume?

Ans:

A data collection is service for Hadoop: Using Flume, can get the data from the multiple servers immediately into Hadoop.

For distributed systems: Along with a log files, Flume is also used to import large volumes of event data produced by a social networking sites like Facebook and Twitter, and e-commerce websites like an Amazon and Flipkart.

Open source: It is an open-source software. It doesn’t need any license key for its activation. Scalable: Flume can scaled horizontally.

3. What are some applications of the Flume?

Ans:

Assume a web application need to analyse the customer behaviors about current activity. So this is where Flume comes in a handy. It extracts a data and moves the data to a Hadoop for analysis. Flume is used to moved the log data generated by application servers into a HDFS at a higher speed.

4. What is Agent?

Ans:

A process that hosts a flume components like sources, channels, and sinks, and thus has ability to receive, store and forward events to their destination.

5. What is channel?

Ans:

It stores an events, events are delivered to the channel by sources operating within the agent. An event stays in a channel until a sink removes it for further transport.

6. Does Apache Flume provide a support for third-party plug-ins?

Ans:

Most of the data analysts using an Apache Flume have plug-in based architecture as it can load a data from external sources and transfer it to the external destinations.

7. What is FlumeNG?

Ans:

FlumeNG is nothing but a period loader for a streaming the knowledge into Hadoop. Basically, it saves a knowledge in HDFS and HBase. Thus, if we wish to start with the FlumeNG, it improves on the first flume.

8. How do handle an agent failures?

Ans:

If the Flume agent goes down then all flows hosted on that agent are to be aborted. Once the agent is restarted then the flow will be resume. If the channel is set up as an in-memory channel then all events that are saved in the channels when the agent went down are lost. But channels set up as a file or other stable channels will continue to the process events where it left off.

9. Can Flume distribute a data to multiple destinations?

Ans:

Yes. It supports a multiplexing flow. The event flows from one source to the multiple channels and multiple destinations, It is achieved by explaining a flow multiplexer.

10. What is a Flume?

Ans:

Flume is a reliable distributed service for a collection and aggregation of large amounts of streaming data into HDFS. Most of Big data analysts used an Apache Flume to push data from different sources like Twitter, Facebook, & LinkedIn into Hadoop, Strom, Solr, Kafka & Spark.

11. Why are we using Flume?

Ans:

Most often Hadoop developers used this tool to get a log data from social media sites. It’s developed by Cloudera for an aggregating and moving a very large amount of data. The primary use is to collect a log files from various sources and asynchronously persists in the Hadoop cluster.

12. What is a Flume Agent?

Ans:

A Flume agent is JVM process that holds the Flume core components (Source, Channel, Sink. through the which events flow from external source like web-servers to a destination like a HDFS. An agent is a heart of the Apache Flume.

13. What are the Flume Core components?

Ans:

Source, Channels, and Sink are the core components in Apache Flume.
When the Flume source receives an event from an external sources, it saves the event in one or multiple channels.
Flume channel is temporarily stored & maintains the event until it’s consumed by the Flume sink. It acts like a Flume repository.
Flume Sink removed the event from the channel and puts it into external repository like HDFS or Move to a next Flume agent.

14. Can Flume provide 100% reliability to a data flow?

Ans:

Yes, it offers a end-to-end reliability of the flow. By default, Flume uses a transactional approach in data flow. Sources and sinks encapsulated in a transactional repository provided by a channels. These channels were responsible to pass reliably from end to end in flow. So it provides 100% reliability to a data flow.

15. Can explain about a configuration files?

Ans:

The agent configuration is stored in a local configuration file. It comprises every agent’s source, sinks, and channel information. Each core component like source, sink, and a channel has properties such as name, type and set of properties.

For example, Avro source require hostname, the port number to receive data from external client. The memory channel should have a maximum queue size in form of capacity. The sink should have a File System URI, Path to create files, frequency of file rotation and more configurations.

16. What are the complicated steps in a Flume configuration?

Ans:

Flume can process a streaming data, so if started once, there is no stop/end to the process. Asynchronously it can flow data from the source to HDFS via Agent. First of all, agents should know an individual components of how they are connected to a load data. So the configuration is the trigger to load a streaming data. For example, consumer key, Consumer Secret, Access Token and access token secret are the key factors to download data from a Twitter.

17. What are the important steps in configuration?

Ans:

1. Each Source must have at least one channel.
2. Each Sink must have a only one channel.
3. Each Component must have specific type.

18. Can explain Consolidation in a Flume?

Ans:

The beauty of Flume is Consolidation; it collects data from various sources even it’s different Flume Agents. Flume source can collect all data flow from different sources and flows through the channel and sink. Finally, send this data to a HDFS or target destination.

19. Can Flume distribute data to a multiple destinations?

Ans:

Yes, it supports a multiplexing flow. The event flows from the one source to multiple channels and multiple destinations. It’s achieved by explaining a flow multiplexer. In the above example, data flows and is replicated to the HDFS and another sink to destination and another destination is input to the another agent.

20. An Agent communicates with the other Agents?

Ans:

No, every agent runs independently. Flume can simply scale horizontally. As a result, there is no single point of a failure.

21. What are an interceptors?

Ans:

Interceptors are used to filter a events between the source and channel, channel and sink. These channels can filter an unnecessary or targeted log files. Depending on the requirements that can use a number of interceptors.

22. What are the Channel selectors?

Ans:

Channel selectors control and separate a events and allocate them to a specific channel. There are the default/ replicated channel selectors. Replicated channel selectors can replicate a data in the multiple/all channels.

Multiplexing channel selectors used to separate and aggregate a data based on the event’s header information. It means based on a Sink’s destination, the event aggregate into the specific sink.

23. What are a sink processors?

Ans:

Sink processors is mechanism by which can create a fail-over task and load balancing.

24. Which is reliable channel in Flume to ensure that there is no data loss?

Ans:

FILE Channel is a most reliable channel among 3 channels JDBC, FILE and MEMORY.

25. How can Flume be used with a HBase?

Ans:

Apache Flume can be used with the HBase using one of the two HBase sinks:

HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports a secure HBase clusters and also novel HBase IPC that was introduced in the version of HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has a better performance than HBase sink as it can simplly make non-blocking calls to a HBase.

26. Is it possible to leverage real-time analysis on big data collected by a Flume directly? If yes, then explain how?

Ans:

Data from Flume can be extracted, transformed and loaded in a real-time into Apache Solr servers by using a MorphlineSolrSink.

27. Explain the various channel types in Flume. Which channel type is be faster?

Ans:

MEMORY Channel:Events are read from a source into memory and passed to a sink.

JDBC Channel: JDBC Channel saves the events in an embedded Derby database.

FILE Channel: File Channel writes a contents to a file on the file system after reading event from a source. The file is deleted only after the contents are successfully delivered to a sink.

MEMORY Channel: Is the fastest channel among the three however has a risk of data loss. The channel that can choose completely depends on the nature of big data application and the value of an each event.

28. Explain about replication and multiplexing selectors in a Flume.

Ans:

Channel Selectors are used to handling a multiple channels. Based on the Flume header value, an event can be written just to single channel or to multiple channels. If a channel selector is not specified to a source, then by default it is a Replicating selector. Using the replicating selector, the same event is written to all channels in the source’s channels list. The multiplexing channel selector is used when the application has to send various events to various channels.

29. What is use of Apache Flume?

Ans:

The use of Apache Flume is to fetch log or streaming data from various social media sources and asynchronously persists in a Hadoop cluster for further analysis.

30. Differentiate between a FileSink and FileRollSink

Ans:

The major difference between a HDFS FileSink and FileRollSink is that HDFS File Sink writes events into the Hadoop Distributed File System (HDFS. whereas File Roll Sink saves the events into a local file system.

31. What is an Apache Flume Source?

Ans:

Apache Flume source is used to receive an events from external sources like a web server and put them into the one or more channels.

32. How multi-hop agents can be set up in a Flume?

Ans:

The Avro RPC Bridge mechanism is used to set up a Multi-hop agent in an Apache Flume.

33. What is the FlumeNG?

Ans:

A real-time loader for streaming a data into Hadoop. It saves data in HDFS and HBase. want to get started with the FlumeNG, which improves on an original flume.

34. What is the difference between the Apache Flume and Apache Kafka?

Ans:

Apache Flume used a Sinks to push messages to the destination whereas Kafka used a Kafka consumer API to consume messages from Kafka Broker.

35. What are the Data extraction tools in the Hadoop?

Ans:

Sqoop can be used to transfer data between a RDBMS and HDFS. Flume can be used to an extract the streaming data from social media, weblog, etc. and save it on HDFS.

36. Tell any two features of a Flume?

Ans:

Fume collects a data efficiently, aggregates and moves large amounts of log data from many various sources to centralized data stores.
Flume is not restricted to log data aggregation and it can transport a large quantity of event data including but not limited to a network traffic data, social-media generated data, email messages and much any data storage.

37. What is the Apache Flume event batching?

Ans:

Apache Flume is batch events. The batch size is the maximum number of events that a sink or client takes from channel in a single transaction. If the batch size is small then throughput is decreased but in case of failure, there would be less duplication and if the batch size is a big then through will be high but in case of failure, there would be a more duplication.

38. What is the Apache Flume Fan-out data flow?

Ans:

Apache Flume offers the facility to send an event from one source to the multiple channels. It has a two types of fan-out, the first one is called replicating, and the second one is called a multiplexing. In case of replicating event is forwarded to all the channels and in case of multiplexing, events are forwarded only to a selected channels.

39. What is the Apache Flume topology design?

Ans:

In Apache Flume, the first step is to check all the sources and destination sinks for the data after that check if need the aggregation or rerouting of events. If collecting data from many data sources then it is needed to have aggregation and rerouting to direct those events to a various location.

40. Where is the Agent configuration stored in flume?

Ans:

Flume agent configuration is saved in a local configuration file. This is a text file that follows Java properties file format. Configurations for one or more agents can be specified in a same configuration file.

41.How HBaseSink is different from a AsyncHBaseSink?

Ans:

Apache Flume HBaseSink and AsyncHBaseSink both are used to send an event to the Hbase system. In the case of HBaseSink, the HTable API is used to send a data to HBase, and in the case of an AsyncHBaseSink, the asynchbase API is used to send the stream data to the HBase. If there is a failure then that is handled by a callbacks.

42. What is a Avro Source in Apache Flume?

Ans:

Avro source is used to listens on Avro port and receives an events from the external Avro client streams. When it is paired with a built-in Avro Sink on another Flume agent, then it can create a tiered collection of topologies.

43. What is a Multi-hop data flow?

Ans:

In Multi-hop flow, a user can built multi-hop flows where events will travel through the multiple agents before reaching the final destination.

44. Explain What Are The Tools Used In a Big Data?

Ans:

Hadoop
Hive
Pig
Flume
Mahout
Sqoop

45. What is the type of Channel Selectors are supported by a Flume?

Ans:

1. Replicating Channel Selector
2. Multiplexing Channel Selector

46. Does Flume Provide 100% Reliability To a Data Flow?

Ans:

Yes, Apache Flume offers end to end reliability because of its transactional approach in data flow.

47. Why Flume.?

Ans:

Collecting a readings from array of sensors.
Collecting an impressions from custom apps for an ad network.
Collecting readings from the network devices in order to monitor their performance.
Flume is targeted toa preserve the reliability, scalability, manageability and extensibility while it serves a maximum number of clients with higher QoS.

48. What Is Flume Event?

Ans:

A unit of data with a set of string attributes called a Flume event. The external source, like a web-server, sends events to a source. Internally Flume has inbuilt functionality to understand a source format.Every log file is consider as an event. Every event has header and value sectors, which has header information and appropriate value that is assigned to a specific header.

49. What are the types of Interceptors supported by a Apache Flume?

Ans:

1.Timestamp Interceptor.
2.Host Interceptor.
3.Static Interceptor.
4.Remove Header Interceptor.
5.UUID Interceptor.
6.Morphline Interceptor.
7.Search and Replace Interceptor.
8.Regex Filtering Interceptor.

50. What Are The Similarities And Differences Between a Apache Flume And Apache Kafka?

Ans:

Flume pushes messages to their destination by its Sinks.With Kafka require to consume messages from a Kafka Broker using a Kafka Consumer API.

51. Explain Reliability And Failure Handling In an Apache Flume?

Ans:

Flume NG used a channel-based transactions to guarantee reliable message delivery. When a message moved from one agent to another, two transactions are started, one on the agent that delivered the event and the other on agent that receives the event. In order for sending agent to commit it’s transaction, it must receive a success indication from a receiving agent.The receiving agent only returns success indication if it’s own transaction commits a properly first. This ensures a guaranteed delivery semantics between hops that the flow makes.

52. What is an Apache Flume Interceptors?

Ans:

Apache Flume Interceptors are used to filter the events between source and channel, channel and sink. It can change or drop events based on the condition explained by Developers.

53. Will flume give a 100 percent responsibility to the information flow?

Ans:

Flume usually provide the end-to-end responsibleness of the flow. Also, it uses transactional approach to an information flow, by default.

In addition, supply and sink encapsulate in a very transactional repository offers the channels. Moreover, to pass dependably from finish to finish flow these channels are to be accountable. Hence, it offers 100 percent responsibleness to a information flow.

54. What is an associate Agent?

Ans:

In Apache Flume, associate a freelance daemon method (JVM) is what tend to decide associate agent. At first, it receives the events from purchasers or various agents. Afterwards, it forwards it to next destination that’s sink or an agent. Note that, it’s attainable to that Flume will have quite an one agent.

55. Is it attainable to a leverage period analysis of the large knowledge collected by a Flume directly? If affirmative, then make case for how?

Ans:

By victimising a MorphlineSolrSink will extract, rework and cargo knowledge from the Flume in period into Apache Solr servers.

56. How do check the integrity of a file channels?

Ans:

Fluid platform offers a File Channel Integrity tool which verifies an integrity of individual events in the File channel and removed a corrupted events.

57. What do mean by a Apache Web Server?

Ans:

Apache web server is a HTTP web server that is open source, and it is used for hosting a website.

58. How to check a Apache version?

Ans:

Use a command httpd -v.

59. Does apache run on which users and how to check a location of a config file?

Ans:

Apache runs on a nobody user, and config file’s location is /etc/httpd/conf/httpd.conf.

60. What is the port of a HTTP and https of Apache?

Ans:

The port of HTTP is 80, and https is a 443 in the Apache.

61. How will install the Apache server on a Linux Machine?

Ans:

The following command for a Centos and Debian, respectively :

Centos : yum install httpd.

Debian : apt-get install apache2.

62. Where are the configuration directories of an Apache web server?

Ans:

cd /etc/HTTP and type ls -l.

63. Can install two apache web servers on a one Single machine?

Ans:

Yes, can install the two apache web server on one machine, but we have to explain in two different ports on that.

64. What DocumentRoot refers to a Apache?

Ans:

It means the web file location, which saved on the server.

65. What do you mean by an Alias Directive?

Ans:

The alias directive is a responsible for mapping the resources in the file system.

66. What is mean by a Directory Index?

Ans:

It is the first file that can apache server looks for when any request comes from the domain.

67. Which is a Reliable Channel in Flume to ensure that there is a no data loss?

Ans:

FILE Channel is a most reliable channel.

68. What is mean by a virtual host is Apache?

Ans:

The virtual host section contains an information regarding the Website name, Directory Index, Server Admin and Email and information of error logs.

69. Explain the difference between a location and Directory.

Ans:

For setting an element related to a URL, use:: Location.
It refers to a location of the files system of the server:: Directory.

70. What is mean by Apache Virtual Hosting?

Ans:

Hosting a multiple websites on a single web server is known as an Apache Virtual Hosting.

There are 2 types of virtual hosting:

Name-Based Virtual Hosting.
IP Based Virtual Hosting.

71. What is mean by a MPM in Apache?

Ans:

In Apache, MPM stands for a Multi-Processing Modules.

72. What is mean by a mod_perl and mod_php?

Ans:

mod_perl is used for an enhancing the performance of a Perl scripts.
mod_php is used for an enhancing the performance of a PHP scripts.

73. What is mean by a Mod_evasive?

Ans:

It is a module that helpsa webserver to prevent web attacks.

Example : DDOS.

74. What is mean by a Loglevel debug in httpd.conf file?

Ans:

With Loglevel debugs help, that can find a more information regarding the error logs, which is used to solve a problem.

75. How to start and stop a Apache Web server?

Ans:

Inside the Apache instance location, there is bin folder, and inside a bin folder, there will be an executable script. Can also use the below command ina bin folder by terminal :

For start : ./apachectl start
For stop : ./apachectl stop

76. What is a command to change a default Listen port?

Ans:

Can give a command like this: Listen 9.126.8.139:8000. This command will modify a default listen port and make the listening port 8000.

77. What is a flume config file in a flume?

Ans:

Flume agent configuration is a file flume.conf resembles a Java property file format with the hierarchical property settings. The filename flume.conf is not fixed, and can provide any name to it and require to use the same name when starting an agent with a flume-ng command.

78. What is a log level of Apache?

Ans:

The log level is a : debug, info, warn, notice, crit, alarm, emerg, error.

79. How will kill a Apache process?

Ans:

Can used the below command :

Kill $PID NUMBER

80. What is mean by a error codes 200, 403 and 503?

Ans:

200 – The server is will be ok.
403 – Server is trying to the access the restricted file.
503 –Server is a busy.

81. How will check the httpd.conf consistency?

Ans:

By giving following a command :

httpd -zs.

82. How will enable a PHP scripts on the server?

Ans:

First, install a mod_php.
Second run a command : AddHandler application/x-httpd-PHP .phtml .php

83. Does flume have a plugin-based architecture?

Ans:

Yes, Flume has a 100% plugin-based architecture, it can load and ship data from an external sources to a external destinations which are separate from the Flume. So most of a big data analysis used this tool for the streaming data.

84. What is a flume source?

Ans:

A Flume source is a component of the Flume Agent which consumes a data (events. From a data generators like a web server and delivers it to one or more channels. The data generator sends a data (events. to Flume in the format recognized by a target Flume source.

85. What is the use of a flume in big data?

Ans:

Flume is a reliable distributed service for the collection and aggregation of huge amounts of streaming data into HDFS. Most of Big data analysts use Apache Flume to a push data from various sources like Twitter, Facebook, & LinkedIn into Hadoop, Strom, Solr, Kafka & Spark.

86. What is the external source of flume?

Ans:

The external source sends an events to Flume in a format that is recognized by a target Flume source. For example, an Avro Flume source can be used to receive an Avro events from Avro clients or the other Flume agents in the flow that send an events from an Avro sink.

87. What are the components of flume agent?

Ans:

A Flume Agent contains a three main components namely, source, channel, and sink. A source is a component of an Agent which received a data from the data generators and transfers it to a one or more channels in the form of a Flume events.

88. What is the use of a flume in HDFS?

Ans:

Flume is a framework which is used to moved a log data into HDFS. Generally events and log data are generated by a log servers and these servers have a Flume agents running on them. These agents received the data from data generators. The data in these agents will be collected by an intermediate node known as a Collector.

89. How do view a data sent by a flume to HDFS?

Ans:

In a Hue File Browser, open the /user/cloudera/flume/events directory. There will be file named FlumeData with a serial number as the file extension. Click a file name link to view a data sent by Flume to HDFS.

90. How to send a streaming data to be HDFS?

Ans:

Tools are available to send a streaming data to HDFS. To transfer the streaming a data (log files, events etc..,. from various sources to HDFS, there are following tools are available: A very famous tool use is in Scribe, to aggregate and stream the log data.

91. What is a flume in the Hadoop HDFS?

Ans:

It is another top-level project from Apache Software Foundation that is developed to provide a continuous data injection in Hadoop HDFS. The data can be any type of data, but Flume is majorly well-suited to handle a log data, like the log data from web servers.

92. How to check if sequence data has been loaded to a HDFS?

Ans:

To check if the sequence data has been loaded into a HDFS, access the URL: http://master:50070 The above steps are demonstrated the single Agent. The typical use case of Flume is to gather the system logs from many Web servers.

93. How to implement a HDFS sink in flume?

Ans:

The HDFS sink need the file system URI, a path to creating a files, etc. All such component attributes must be set in a properties file of the hosting Flume agent. 3. Writing a pieces together The flume agent must know an individual components to load. The agent require to know how the connectivity of the components constitutes of flow.

94. How to send a streaming data to a HDFS?

Ans:

Tools are available to send a streaming data to HDFS. To transfer a streaming data (log files, events etc…) from different sources to HDFS, there are following tools are available: A very famous tool we use is Scribe, to aggregate and stream a log data.

95. What is the difference between a HDFS and streaming?

Ans:

Streaming just implies that it can provide a constant bitrate above a certain threshold when transferring a data, as opposed to having the data come in the bursts or waves. If HDFS is laid out for a streaming, it will probably still support a seek, with a bit of overhead it needs to cache the data for constant stream.

96. Does HDFS can support seek?

Ans:

If HDFS is laid out for a streaming, it will probably still support a seek, with a bit of overhead it need to cache the data for a constant stream. Of course, depending on the system and network load, seeks might take a bit longer. HDFS stores data in a large blocks — like 64 MB.

97. How to send a data from the agent_demo to HDFS?

Ans:

The agent_demo is reading data from the external Avro client. It is then sending data to the HDFS through the memory channel. The config file weblogs.config would look like: This will make a data flow from avro_src_1 to hdfs_cluster_1 through a memory channel mem_channel_1.

98. How is data streamed off a hard drive?

Ans:

The data is “streamed” off the hard drive by maintaining a maximum I/O rate that the drive can sustain for these big blocks of data. HDFS is built around idea that the most efficient data processing pattern is the write-once, read-many-times pattern.

99. What is the use of sink flume?

Ans:

It is used for storing a data into a centralized store like HDFS, HBase, etc. Sink consumes events froma Flume channel and pushes them on to the central repository. the component that removed an events from a Flume agent and writes it to a another flume agent or some other system or a data store is called sink.

Apache File channel is a most reliable and durable channel because it saves an events on the disk so in case there system failure or JVM crashed or the system rebooted those events that are not to be transferred will start again a when Flume is restarted.

100. Which Flume Channel is the most reliable Channel to an ensure no data loss?

Ans:

[SCENARIO-BASED ] Apache Flume Interview Questions and Answers

Trending Courses

Trending Blog Articles

CONTACT

COMPANY

WORK WITH US

TERMS & POLICIES

Velachery

Tambaram

OMR

Porur

Anna Nagar

T. Nagar

Adyar

Thiruvanmiyur

Siruseri

Maraimalai Nagar

BTM Layout

Marathahalli

Rajaji Nagar

Jaya Nagar

Kalyan Nagar

Electronic City

Indira Nagar

HSR Layout

Hyderabad

Pune