MapReduce Interview Questions and Answers

MapReduce Interview Questions and Answers

Last updated on 04th Oct 2020, Blog, Interview Question

About author

Sanjay (Sr Big Data DevOps Engineer )

Highly Expertise in Respective Industry Domain with 7+ Years of Experience Also, He is a Technical Blog Writer for Past 4 Years to Renders A Kind Of Informative Knowledge for JOB Seeker

(5.0) | 13265 Ratings 1569

If you are into big data, you already know about the popularity of MapReduce. There is a huge demand for the MapReduce professionals in the market. It doesn’t matter if you are a beginner or looking to re-apply for a new job position, going through the 10 most popular MapReduce interview questions and answers can help you get prepared for the MapReduce interview. So, without any delay, let’s jump into the questions.

1.What are the main components of MapReduce?

Ans:

The three main components of MapReduce are:

  • Main Driver Class: The Main Driver Class provides the job configuration parameters.
  • Mapper Class: This class is used for mapping purposes.
  • Reducer Class: Reducer class divides the data into splits.

2. What are the configuration parameters required to be specified in MapReduce?

Ans:

The required configuration parameters that need to be specified are:

  • The job’s input and output location in HDFS
  • The input and output format
  • The classes containing the map and reduce functions
  • The .JAR file for driver, mapper, and reducer classes.

3. Define shuffling in MapReduce?

Ans:

Shuffling is the process of transferring data from Mapper to Reducer. It is part of the first phase of the framework.

4. What is meant by HDFS?

Ans:

HDFS stands for Hadoop Distributed File System. It is one of the most critical components in Hadoop architecture and is responsible for data storage.

5. What do you mean by a heartbeat in HDFS?

Ans:

Heartbeat is the signal sent by the datanode to the namenode to indicate that it’s alive. It is used to detect failures and ensure that the link between the two nodes is intact.

6. Discuss the main components of MapReduce job?

Ans:

There are three main components of a MapReduce job which are as follows:

  • Map Driver Class: it provides the necessary parameters for job configuration.
  • Mapper Class: The mapper class provides map() method. It extends org.apache.hadoop.mapreduce.Mapper class.
  • Reducer Class: The reducer class provides reduce() method. It extents org.apache.hadoop.mapreduce.Reducer class.

7. What are the main configuration parameters specified in MapReduce?

Ans:

To work properly, MapReduce needs some configuration parameters to be set correctly. Without them set correctly, the map and reduce jobs will not function properly. The configuration parameters that need to be set correctly are as follows:

  • Job’s input location in HDFS.
  • Job’s output location in HDFS.
  • Input and Output format.
  • Classes that contain the map and reduce functions.
  • Last, but not the least, .jar file for reducer, mapper and driver classes.

8. Explain the basic parameters of mapper and reducer function?

Ans:

The basic parameters of the mapper function are as below:

  • Input – Text, and LongWritable.
  • Intermediate Output – Text and IntWritable.

Also, the basic parameters of reducer function are

  • Final Output – Text, IntWritable

9. What is MapReduce?

Ans:

Referred as the core of Hadoop, MapReduce is a programming framework to process large sets of data or big data across thousands of servers in a Hadoop Cluster. The concept of MapReduce is similar to the cluster scale-out data processing systems. The term MapReduce refers to two important processes of Hadoop program operates.

  • First is the map() job, which converts a set of data into another breaking down individual elements into key/value pairs (tuples). Then comes reduce() job into play, wherein the output from the map, i.e. the tuples serve as the input and are combined into smaller set of tuples. As the name suggests, the map job every time occurs before the reduce one.
  • Intermediate Output – Text, IntWritable

10. What is Partitioner and its usage?

Ans:

Partitioner is yet another important phase that controls the partitioning of the intermediate map-reduce output keys using a hash function. The process of partitioning determines in what reducer, a key-value pair (of the map output) is sent. The number of partitions is equal to the total number of reduce jobs for the process.

Hash Partitioner is the default class available in Hadoop , which implements the following function.int getPartition(K key, V value, int numReduceTasks)

The function returns the partition number using the numReduceTasks is the number of fixed reducers.

11. What is Identity Mapper and Chain Mapper?

Ans:

Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class is defined, Identify will be executed. It only writes the input data into output and do not perform and computations and calculations on the input data.

The class name is org.apache.hadoop.mapred.lib.IdentityMapper.

Chain Mapper is the implementation of simple Mapper class through chain operations across a set of Mapper classes, within a single map task. In this, the output from the first mapper becomes the input for second mapper and second mapper’s output the input for third mapper and so on until the last mapper.

The class name is org.apache.hadoop.mapreduce.lib.ChainMapper.

12. What main configuration parameters are specified in MapReduce?

Ans:

The MapReduce programmers need to specify following configuration parameters to perform the map and reduce jobs:

  • The input location of the job in HDFs.
  • The output location of the job in HDFS.
  • The input’s and output’s format.
  • The classes containing map and reduce functions, respectively.
  • The .jar file for mapper, reducer and driver classes
Subscribe For Free Demo

Error: Contact form not found.

13. Name Job control options specified by MapReduce?

Ans:

Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations.

14. What is the difference between HDFS block and InputSplit?

Ans:

An HDFS block splits data into physical divisions while InputSplit in MapReduce splits input files logically.

While InputSplit is used to control number of mappers, the size of splits is user defined. On the contrary, the HDFS block size is fixed to 64 MB, i.e. for 1GB data , it will be 1GB/64MB = 16 splits/blocks. However, if input split size is not defined by user, it takes the HDFS default block size.

15. What is Text Input Format?

Ans:

It is the default InputFormat for plain text files in a given job having input files with .gz extension. In TextInputFormat, files are broken into lines, wherein key is position in the file and value refers to the line of text. Programmers can write their own InputFormat.

16. What is JobTracker?

Ans:

JobTracker is a Hadoop service used for the processing of MapReduce jobs  in the cluster. It submits and tracks the jobs to specific nodes having data. Only one JobTracker runs on single Hadoop cluster on its own JVM process. if JobTracker goes down, all the jobs halt.

17. Explain job scheduling through JobTracker?

Ans:

JobTracker communicates with NameNode to identify data location and submits the work to TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or mark a specific record as unreliable or blacklist it.

18. What is SequenceFileInputFormat?

Ans:

A compressed binary output file format to read in sequence files and extends the FileInputFormat.It passes data between output-input (between output of one MapReduce job to input of another MapReduce job)phases of MapReduce jobs.

19. What is RecordReader in a Map Reduce?

Ans:

RecordReader is used to read key/value pairs form the InputSplit by converting the byte-oriented view  and presenting record-oriented view to Mapper.

20. Define Writable data types in MapReduce?

Ans:

Hadoop reads and writes data in a serialized form in writable interface. The Writable interface has several classes like Text (storing String data), IntWritable, Long Wriatble, FloatWritable, BooleanWritable. users are free to define their personal Writable classes as well.

21. What is a “map” in Hadoop?

Ans:

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

22. What is a “reducer” in Hadoop?

Ans:

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

23. List the different configuration parameters that are needed to do the job of MapReduce framework

Ans:

The user has to most definitely specify the following types of parameters:

1. The input location of the data or the job needed to be specified in the file system 

2. The output location of the data also needed to be specified in the system 

3. The format of the input design 

4. The format of the output design  

5. Defining the specific class of the mapper function  

6. Defining the specific class of the reducer function  

7. JAR file which consists of all the mapper and reducer classes  

24. How can the reducers communicate?

Ans:

Different reducers can’t communicate with each other. They work in isolation.

25. Define Text Input Format?

Ans:

Text input format is just the default format for text files or the input data. The files are broken within the text input format. The line of the text refers to the value and the key is referred to the position. These two are the main components of data files.  

26. Splitting 100 lines of input in the form of a single split in MapReduce. Is this possible?

Ans:

Splitting such a huge data worth of input I’d possible in the form of a single split is only possible by using the Class NLine Input Format.  

27. What are the differences between a reducer and a combiner?

Ans:

All the local task of reducing the local data files are done with the help of combiner. This mainly works on the Map Output. Just like a reducer, it also produces the output for the reducer’s input. Combiner has other uses too like it is often used for the job of network optimization especially when the outputs increase in numbers by the map generator. Combiner also varies from the reducer in many ways like for example, a reducer is limited but however, a combiner has limitations like the input data or the output data and the values must be similar to the output data of the mapper. A combiner can also work on the commutative function like for example; it can operate on subsets of the values and keys of the data. Combiner can get its input from only one single mapper whereas; a reducer gets it input from several numbers of mappers. 

28. In MapReduce, when is the best time to use a combiner?

Ans:

The efficiency of the MapReduce is increased by using a combiner. It helps in aggregating the data locally and hence helps in reducing the huge bulk of data from while transferring them to the reducers. The combiner uses the reducer code when the function is commutative.  

29. What is the meaning of the term heartbeat which is used on HDFS?

Ans:

The signal which is used in HDFS is known as the heartbeat. This signal is mainly passed between two types of nodes namely data nodes and name nodes. This occurs between the job tracker and the task tracker. It is considered to have a poor heartbeat if the signal doesn’t work properly and if some issues arise with the two nodes or the trackers.

30. Differentiate between MapReduce and PIG.

Ans:

PIG is basically the data flow language which manages the data flow from one source to another. It also manages the data storage system and also helps in compressing them. Pig rearranges the steps for a faster and better processing. The output data of the MapReduce job is basically managed by PIG. Some functions of MapReduce processing are also added in the processing of PIG. The functions include grouping, ordering and counting data.  

MapReduce is basically the framework for writing a code for the developers. This is a data processing paradigm which separated the application of two type of developers, one who writes it and another who scales it.

31. Can you tell us about the distributed cache in MapReduce?

Ans:

A distributed cache is a service offered by the MapReduce framework to cache files such as text, jars, etc., needed by applications.

32. What do you mean by a combiner?

Ans:

Combiner is an optional class that accepts input from the Map class and passes the output key-value pairs to the Reducer class. It is used to increase the efficiency of the MapReduce program. However, the execution of the combiner is not guaranteed.

33. How would you split data into Hadoop?

Ans:

Splits are created with the help of the InputFormat. Once the splits are created, the number of mappers is decided based on the total number of splits. The splits are created according to the programming logic defined within the getSplits() method of InputFormat, and it is not bound to the HDFS block size.

The split size is calculated according to the following formula.

Split size = input file size/ number of map tasks

34. What is distributed Cache in MapReduce Framework?

Ans:

Distributed cache is an important part of the MapReduce framework. It is used to cache files across operations during the time of execution and ensures that tasks are performed faster. The framework uses the distributed cache to store important file that is frequently required to execute tasks at that particular node.

35. Why Compute Nodes And The Storage Nodes Are The Same?

Ans:

Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code execute where the data is stored, so the data node and compute node are the same.

Course Curriculum

Learn Expert-led mapreduce Training with Dedicated Lab Environment

  • Instructor-led Sessions
  • Real-life Case Studies
  • Assignments
Explore Curriculum

36. When We Goes To Partition?

Ans:

By default Hive reads entire dataset even the application have a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows special option called partitions. When you are creating table, hive partitioning the table based on requirement.

37. What Are The Important Steps When You Are Partitioning Table?

Ans:

Don’t over partition the data with too small partitions, it’s overhead to the namenode.
if dynamic partition, atleast one static partition should exist and set to strict mode by using given commands.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
first load data into non­partitioned table, then load such data into partitioned table. It’s not possible to load data from local to partitioned table.
insert overwrite table table_name partition(year) select * from non­partition­table;

38. Can You Elaborate Mapreduce Job Architecture?

Ans:

First Hadoop programmer submit Mpareduce program to JobClient.
Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s in the form of Job_HadoopStartedtime_00001. It’s unique ID.
Once JobClient receive received Job ID copy the Job resources (job.xml, job.jar) to File System (HDFS) and submit job to JobTracker. JobTracker initiate Job and schedule the job.
Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrive the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.

39. Why Task Tracker Launch Child Jvm?

Ans:

Most frequently, hadoop developer mistakenly submit wrong jobs or having bugs. If Task Tracker use existent JVM, it may interrupt the main JVM, so other tasks may influenced. Where as child JVM if it’s trying to damage existent resources, TaskTracker kill that child JVM and retry or relaunch new child JVM.

40. Why Jobclient, Job Tracker Submits Job Resources To File System?

Ans:

Data locality. Move competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resources copy where the data available.

41. How Many Mappers And Reducers Can Run?

Ans:

By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default values in Mapreduce.xml in conf file.

42. What Is Inputsplit?

Ans:

A chunk of data processed by a single mapper called InputSplit. In another words logical chunk of data which processed by a single mapper called Input split, by default inputSplit = block Size.

43. How To Configure The Split Value?

Ans:

By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.

  • split size = min (max_splitsize, max (block_size, min_split_size));
  • split size = max(min_split_size, min (block_size, max_split, size));

by default split size = block size
Always No of splits = No of mappers.
Apply above formula:

  • split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb split size = min (10gb (let assume), 64) split size = 64MB.
  • split size = max(min_split_size, min (block_size, max_split, size)); split size = max (512kb, min (64, 10GB)); split size = max (512kb, 64);split size = 64 MB;

44. How Much Ram Required To Process 64mb Data?

Ans:

Lets assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to process a chunk of data.So in this way required more memory to process un­structured process.

45. What Is Difference Between Block And Split?

Ans:

  • Block: How much chunk data to stored in the memory called block. 
  • Split: how much data to process the data called split. 

46. Why Hadoop Framework Reads A File Parallel Why Not Sequential?

Ans:

To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and write sequentially.

47. If I Am Change Block Size From 64 To 128, Then What Happen?

Ans:

Even you have changed block size not effect existent data. After changed the block size, every file chunked after 128 MB of block size. It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.

48. What Is Issplitable()?

Ans:

By default this value is true. It is used to split the data in the input format. if un­structured data, it’s not recommendable to split the data, so process entire file as a one split. to do it first change isSplitable() to false.

49. How Much Hadoop Allows Maximum Block Size And Minimum Block Size?

Ans:

Minimum: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

Maximum: Depends on environment. There is no upper ­bound. 

50. What Are The Job Resource Files?

Ans:

job.xml and job.jar are core resources to process the Job. Job Client copy the resources to the HDFS.

51. What’s The Mapreduce Job Consists?

Ans:

MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker

52. What Is The Data Locality?

Ans:

Where ever the data is there process the data, computation/process the data where the data available, this process called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal follow data locality. It’s possible when the data is splittable, by default it’s true.

53. What Is Speculative Execution?

Ans:

Hadoop run the process in commodity hardware, so it’s possible to fail the systems also has low memory. So if system failed, process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique.Computation/logic distribute to the multiple systems and execute which system execute quickly. By default this value is true. Now even the system crashed, not a problem, framework choose logic from other systems.
Eg: logic distributed on A, B, C, D systems, completed within a time.
System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process.

Course Curriculum

Get On-Demand Mapreduce Training & Certification Course

Weekday / Weekend BatchesSee Batch Details

54. When We Goes To Reducer?

Ans:

When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without reducer its possible to do this operation.

55. What Is Chain Mapper?

Ans:

Chain mapper class is a special mapper class sets which run in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way n number of mapper connected in chain fashion.

56. How To Do Value Level Comparison?

Ans:

Hadoop can process key level comparison only but not in the value level comparison.

57. What Is Setup And Clean Up Methods?

Ans:

If you don’t no what is starting and ending point/lines, it’s much difficult to solve those problems. Setup and clean up can resolve it. N number of blocks, by default 1 mapper called to each split. each split has one start and clean up methods. N number of methods, number of lines. Setup is initialize job resources.
The purpose of clean up is close the job resources. Map is process the data. once last map is completed, cleanup is initialized. It Improves the data transfer performance. All these block size comparison can do in reducer as well. If you have any key and value, compare one key value to another key value use it. If you compare record level used these setup and cleanup. It open once and process many times and close once. So it save a lot of network wastage during process.

58. How Many Slots Allocate For Each Task?

Ans:

By default each task has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to process the data.

59. Why Tasktracker Launch Child Jvm To Do A Task? Why Not Use Existent Jvm?

Ans:

Sometime child threads currupt parent threads. It means because of programmer mistake entired MapReduce task distruped. So task tracker launch a child JVM to process individual mapper or tasker. If tasktracker use existent JVM, it might damage main JVM. If any bugs occur, tasktracker kill the child process and relaunch another child JVM to do the same task. Usually task tracker relaunch and retry the task 4 times.

60. What Are The Main Components Of Mapreduce Job?

Ans:

  • Main Driver Class: providing job configuration parameters
  • Mapper Class: must extend org.apache.hadoop.mapreduce.Mapper class and performs execution of map() method
  • Reducer Class: must extend org.apache.hadoop.mapreduce.Reducer class

61. What Is Identity Mapper?

Ans:

Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class is defined, Identify will be executed. It only writes the input data into output and do not perform and computations and calculations on the input data. The class name is org.apache.hadoop.mapred.lib.IdentityMapper.

62.What is Streaming?

Ans:

Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

63.What is the difference between an HDFS Block and Input Split?

Ans:

HDFS Block is the physical division of the data and Input Split is the logical division of the data.

64.What happens in a textinputformat?

Ans:

In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line. For instance, Key: longWritable, value: text.

65. Can MapReduce program  be written in any language other than Java?

Ans:

Yes, Mapreduce can be written in many programming languages Java, R, C++, scripting Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and newline characters should work . Hadoop streaming (A Hadoop Utility) allows you to create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reducer.

66. What Mapper does?

Ans:

Mapper is the first phase of Map phase which  process map task.Mapper reads  key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

67.What if job tracker machine is down?

Ans:

In Hadoop 1.0, Job Tracker is single Point of availability means if JobTracker fails, all jobs must restart.Overall Execution flow will be interupted. Due to this limitation, In hadoop 2.0 Job Tracker concept is  replaced by YARN. In YARN, the term JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).

  • Resource Manager
  • Node Manager(node specific)

68.What happens when a datanode fails ?

Ans:

When a datanode fails:

  • Jobtracker and namenode detect the failure
  • On the failed node all tasks are re-scheduled

69. Mention what are the three modes in which Hadoop can be run?

Ans:

The three modes in which Hadoop can be run are

  1. 1. Pseudo distributed mode
  2. 2. Standalone (local) mode
  3. 3. Fully distributed mode

70. Mention what does the text input format do?

Ans:

The text input format will create a line object that is an hexadecimal number.  The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.

71. Mention how many InputSplits is made by a Hadoop Framework?

Ans:

Hadoop will make 5 splits

  • 1 split for 64K files
  • 2 split for 65mb files
  • 2 splits for 127mb files

72. Mention what is distributed cache in Hadoop?

Ans:

Distributed cache in Hadoop is a facility provided by MapReduce framework.  At the time of execution of the job, it is used to cache file.  The Framework copies the necessary files to the slave node before the execution of any task at that node.

73. Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?

Ans:

Classpath will consist of a list of directories containing jar files to stop or start daemons.

mapreduce Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

74. Compare MapReduce and Spark?

Ans:

CriteriaMapReduceSpark
Processing SpeedsGoodExceptional
Standalone modeNeeds HadoopCan work independently
Ease of useNeeds extensive Java programAPIs for Python, Java, & Scala
VersatilityReal-time & machine learning applicationsNot optimized for real-time & machine learning applications

75. Illustrate a simple example of the working of MapReduce.

Ans:

Let’s take a simple example to understand the functioning of MapReduce. However, in real-time projects and applications, this is going to be elaborate and complex as the data we deal with Hadoop and MapReduce is extensive and massive.

Assume you have five files and each file consists of two key/value pairs as in two columns in each file – a city name and its temperature recorded. Here, name of city is the key and the temperature is value.

It is important to note that each file may consist of the data for same city multiple times. Now, out of this data, we need to calculate the maximum temperature for each city across these five files. As explained, the MapReduce framework will divide it into five map tasks and each map task will perform data functions on one of the five files and returns maxim temperature for each city.

(San Francisco, 22)(Los Angeles, 16)(Vancouver, 30)(London, 25)

Similarly each mapper performs it for the other four files and produce intermediate results, for instance like below.

(San Francisco, 32)(Los Angeles, 2)(Vancouver, 8)(London, 27)

(San Francisco, 29)(Los Angeles, 19)(Vancouver, 28)(London, 12)

(San Francisco, 18)(Los Angeles, 24)(Vancouver, 36)(London, 10)

(San Francisco, 30)(Los Angeles, 11)(Vancouver, 12)(London, 5)

These tasks are then passed to the reduce job, where the input from all files are combined to output a single value. The final results here would be:

76. What is Shuffling and Sorting in MapReduce?

Ans:

Shuffling and Sorting are two major processes operating simultaneously during the working of mapper and reducer.

The process of transferring data from Mapper to reducer is Shuffling. It is a mandatory operation for reducers to proceed their jobs further as the shuffling process serves as input for the reduce tasks.

In MapReduce, the output key-value pairs between the map and reduce phases (after the mapper) are automatically sorted before moving to the Reducer. This feature is helpful in programs where you need sorting at some stages. It also saves the programmer’s overall time.

77. What is InputFormat in Hadoop?

Ans:

Another important feature in MapReduce programming, InputFormat defines the input specifications for a job. It performs the following functions:

  • Validates the input-specification of job.
  • Split the input file(s) into logical instances called InputSplit. Each of these split files are then assigned to individual Mapper.
  • Provides implementation of RecordReader to extract input records from the above instances for further Mapper processing

78. Explain job scheduling through JobTracker.

Ans:

JobTracker communicates with NameNode to identify data location and submits the work to TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or mark a specific record as unreliable or blacklist it.

79. How to set mappers and reducers for Hadoop jobs?

Ans:

Users can configure JobConf variable to set number of mappers and reducers.

  • job.setNumMaptasks()
  • job.setNumreduceTasks()

80. Explain JobConf in MapReduce?

Ans:

It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations and other advanced job faets liek Comparators.

81. What is a MapReduce Combiner?

Ans:

Also known as semi-reducer, Combiner is an optional class to combine the map out records using the same key. The main function of a combiner is to accept inputs from Map Class and pass those key-value pairs to Reducer class

82. What is OutputCommitter?

Ans:

OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the following operations:

  • Create temporary output directory for the job during initialization.
  • Then, it cleans the job as in removes temporary output directory post job completion.
  • Sets up the task temporary output.
  • Identifies whether a task needs commit. The commit is applied if required.
  • JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.
  • Namenode replicates the users data to another node

83. Explain is it possible to search for files using wildcards?

Ans:

Yes, it is possible to search for files using wildcards.

Are you looking training with Right Jobs?

Contact Us

Popular Courses