HDFS Interview Questions and Answers

HDFS Interview Questions and Answers

Last updated on 05th Oct 2020, Blog, Interview Question

About author

Krishnakumar (Lead Engineer - Director Level )

Highly Expertise in Respective Industry Domain with 10+ Years of Experience Also, He is a Technical Blog Writer for Past 4 Years to Renders A Kind Of Informative Knowledge for JOB Seeker

(5.0) | 12534 Ratings 1537

A Hadoop distributed file system (HDFS) is a system that stores very large dataset. As it is the most important component of Hadoop Architecture so it is the most important topic for an interview. In this blog, we provide the 50+ Hadoop HDFS interview questions and answers that are being framed by our company expert who provides training in Hadoop and another Big data framework. 

1.RDBMS vs Hadoop


Data volumeRDBMS cannot store and process a large amount of dataHadoop works better for large amounts of data. It can easily store and process a large amount of data compared to RDBMS.
ThroughputRDBMS fails to achieve a high ThroughputHadoop achieves high Throughput
Data varietySchema of the data is known in RDBMS and it always depends on the structured data.It stores any kind of data. Whether it could be structured, unstructured or semi-structured.
Data processingRDBMS supports OLTP(Online Transactional Processing)Hadoop supports OLAP(Online Analytical Processing)
Read/Write SpeedReads are fast in RDBMS because the schema of the data is already known.Writes are fast in Hadoop because no schema validation happens during HDFS write.
Schema on reading Vs WriteRDBMS follows schema on write policyHadoop follows the schema on reading policy
CostRDBMS is a licensed softwareHadoop is a free and open-source framework

2. Explain Big data and its characteristics?


Big Data refers to a large amount of data that exceeds the processing capacity of conventional database systems and requires a special parallel processing mechanism. This data can be either structured or unstructured data.

Characteristics of Big Data:

Volume – It represents the amount of data which is increasing at an exponential rate i.e. in gigabytes, Petabytes, Exabytes, etc. 

Velocity – Velocity refers to the rate at which data is generated, modified, and processed. At present, Social media is a major contributor to the velocity of growing data.

Variety – It refers to data coming from a variety of sources like audios, videos, CSV, etc. It can be either structured, unstructured, or semi-structured.

Veracity – Veracity refers to imprecise or uncertain data.

Value – This is the most important element of big data. It includes data on how to access and deliver quality analytics to the organization. It provides a fair market value on the used technology.

3.What is Hadoop and list its components?


Hadoop is an open-source framework used for storing large data sets and runs applications across clusters of commodity hardware.

It offers extensive storage for any type of data and can handle endless parallel tasks.

Core components of Hadoop:

Storage unit– HDFS (DataNode, NameNode)

Processing framework– YARN (NodeManager, ResourceManager)

4. What is YARN and explain its components?


Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop and is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.

YARN components:

Resource Manager – It runs on a master daemon and controls the resource allocation in the cluster.

Node Manager – It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.

Application Master – It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.

Container – It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node.

5. What is the difference between a regular file system and HDFS?


Regular File SystemsHDFS
A small block size of data (like 512 bytes)Large block size (orders of 64mb)
Multiple disks seek large filesReads data sequentially after single seek

6. What are the Hadoop daemons and explain their roles in a Hadoop cluster?


Generally, the daemon is nothing but a process that runs in the background. Hadoop has five such daemons. They are: 

NameNode –  Is is the Master node responsible to store the meta-data for all the directories and files. 

DataNode – It is the Slave node responsible to store the actual data.

Secondary NameNode – It is responsible for the backup of NameNode and stores entire metadata of data nodes like data node properties, address, and block report of each data node.

JobTracker –  It is used for creating and running jobs. It runs on data nodes and allocates the job to TaskTracker.

TaskTracker – It operates on the data node. It runs the tasks and reports the tasks to JobTracker.

7.What is Avro Serialization in Hadoop?


  • The process of translating objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema (written in JSON). 
  • It provides AvroMapper and AvroReducer for running MapReduce programs.

8.How can you skip the bad records in Hadoop?


Hadoop provides a feature called SkipBadRecords class for skipping bad records while processing mapping inputs.

9.Explain HDFS and its components?


  • HDFS (Hadoop Distributed File System) is the primary data storage unit of Hadoop. 
  • It stores various types of data as blocks in a distributed environment and follows master and slave topology.

HDFS components:

NameNode – It is the master node and is responsible for maintaining the metadata information for the blocks of data stored in HDFS. It manages all the DataNodes.

 Ex: replication factors, block location, etc.

DataNode – It is the slave node and responsible for storing data in the HDFS.

10.What are the features of HDFS?


  • Supports storage of very large datasets
  • Write once read many access model
  • Streaming data access
  • Replication using commodity hardware
  • HDFS is highly Fault Tolerant
  • Distributed Storage

11.What is the HDFS block size?


By default, the HDFS block size is 128MB for Hadoop 2.x.

12. What is the default replication factor?


  • Replication factor means the minimum number of times the file will replicate(copy) across the cluster.
  • The default replication factor is 3

13. List the various HDFS Commands?


The Various HDFS Commands are listed bellow

  • version
  • mkdir
  • ls
  • put
  • copy from local
  • get
  • copyToLocal
  • cat
  • mv
  • cp
Subscribe For Free Demo

Error: Contact form not found.

14. Compare HDFS (Hadoop Distributed File System) and NAS (Network Attached Storage)? 


It is a distributed file system used for storing data by commodity hardware.It is a file-level computer data storage server connected to a computer network, provides network access to a heterogeneous group of clients.
It includes commodity hardware which will be cost-effectiveNAS is a high-end storage device which includes a high cost.
It is designed to work for the MapReduce paradigm.It is not suitable for MapReduce.

15. What are the limitations of Hadoop 1.0?


  • NameNode: No Horizontal Scalability and No High Availability
  • Job Tracker: Overburdened.
  • MRv1: It can only understand Map and Reduce tasks

16.How to commission (adding) the nodes in the Hadoop cluster?


  • Update the network addresses in the dfs.include and mapred.include
  • Update the NameNode: Hadoop dfsadmin -refreshNodes 
  • Update the Jobtracker: Hadoop mradmin-refreshNodes
  • Update the slave file.
  • Start the DataNode and NodeManager on the added Node.

17. How to decommission (removing) the nodes in the Hadoop cluster?


  • Update the network addresses in the dfs.exclude and mapred.exclude
  • Update the Namenode: $ Hadoop dfsadmin -refreshNodes 
  • Update the JobTracker: Hadoop mradmin -refreshNodes
  • Cross-check the Web UI it will show “Decommissioning in Progress”
  • Remove the Nodes from include file and then run: Hadoop dfsadmin-refreshNodes, Hadoop mradmin -refreshNodes.
  • Remove the Nodes from the slave file.

18.Compare Hadoop 1.x and Hadoop 2.x


Name Hadoop 1.xHadoop 2.x
1. NameNodeIn Hadoop 1.x, NameNode is the single point of failureIn Hadoop 2.x, we have both Active and passive NameNodes.
2. ProcessingMRV1 (Job Tracker & Task Tracker)MRV2/YARN (ResourceManager & NodeManager)

19. What is the difference between active and passive NameNodes?


  • Active NameNode works and runs in the cluster.
  • Passive NameNode has similar data as active NameNode and replaces it when it fails.

20. How will you resolve the NameNode failure issue?


The following steps need to be executed to resolve the NameNode issue and make Hadoop cluster up and running:

  • Use the FsImage (file system metadata replica) to start a new NameNode. 
  • Now, configure DataNodes and clients, so that they can acknowledge the new NameNode, that is started.
  • The new NameNode will start serving the client once it has completed loading the last checkpoint FsImage and enough block reports from the DataNodes.

21. What is a Checkpoint Node in Hadoop?


Checkpoint Node is the new implementation of secondary NameNode in Hadoop.  It periodically creates the checkpoints of filesystem metadata by merging the edits log file with FsImage file.

22.List the different types of Hadoop schedulers.


  • Hadoop FIFO scheduler
  • Hadoop Fair Scheduler
  • Hadoop Capacity Scheduler

23.How to keep an HDFS cluster balanced?


However, it is not possible to limit a cluster from becoming unbalanced. In order to give a balance to a certain threshold among data nodes, use the Balancer tool. This tool tries to subsequently even out the block data distribution across the cluster.

24.What is DistCp?


  • DistCp is the tool used to copy large amounts of data to and from Hadoop file systems in parallel.
  • It uses MapReduce to effect its distribution, reporting, recovery,  and error handling.

25. What is HDFS Federation?


  • HDFS Federation enhances the present HDFS architecture through a clear separation of namespace and storage by enabling a generic block storage layer. 
  • It provides multiple namespaces in the cluster to improve scalability and isolation. 

26.What is HDFS High Availability?


HDFS High availability is introduced in Hadoop 2.0. It means providing support for multiple NameNodes to the Hadoop architecture.

27.What is rack-aware replica placement policy?


  • Rack Awareness is the algorithm used for improving the network traffic while reading/writing HDFS files to Hadoop cluster by NameNode.
  • NameNode chooses the Datanode which is closer to the same rack or nearby rack for reading/Write request. The concept of choosing closer data nodes based on racks information is called Rack Awareness.
  • Consider replication factor is 3 for data blocks on HDFS it means for every block of data two copies are stored on the same rack, while the third copy is stored on a different rack. This rule is called Replica Placement Policy.

28. What is the main purpose of Hadoop fsck command?


Hadoop fsck command is used for checking the HDFS file system. 

There are different arguments that can be passed with this command to emit different results.

Hadoop fsck / -files: Displays all the files in HDFS while checking. 

Hadoop fsck / -files -blocks: Displays all the blocks of the files while checking.

Hadoop fsck / -files -blocks -locations: Displays all the files block locations while checking. 

Hadoop fsck / -files -blocks -locations -racks: Displays the networking topology for data-node locations. 

Hadoop fsck -delete: Deletes the corrupted files in HDFS. 

Hadoop fsck -move: Moves the corrupted files to a particular directory.

29. What is the purpose of a DataNode block scanner?


  • The purpose of the DataNode block scanner is to operate and periodically check all the blocks that are stored on the DataNode.
  • If bad blocks are detected it will be fixed before any client reads.

30. What is the purpose of dfsadmin tool?


  • dfsadmin tool is used for examining the HDFS cluster status. 
  • dfsadmin – report command produces useful information about basic statistics of the cluster such as DataNodes and NameNode status, disk capacity configuration, etc.
  • It performs all the administrative tasks on the HDFS.

31. What is the command used for printing the topology?


hdfs dfsadmin -point topology is used for printing the topology. It displays the tree of racks and DataNodes attached to the tracks.

32. What is RAID?


RAID (redundant array of independent disks) is a data storage virtualization technology used for improving performance and data redundancy by combining multiple disk drives into a single entity.

33. Does Hadoop requires RAID?


  • In DataNodes, RAID is not necessary as storage is achieved by replication between the Nodes. 
  • In NameNode’s disk RAID is recommended.

34. List the various site-specific configuration files available in Hadoop?


  • conf/Hadoop-env.sh
  • conf/yarn-site.xml
  • conf/yarn-env.sh
  • conf/mapred-site.xml
  • conf/hdfs-site.xml
  • conf/core-site.xml

35. What is the main functionality of NameNode?


It is mainly responsible for:

Namespace – Manages metadata of HDFS.

Block Management – Processes and manages the block reports and its location.

Course Curriculum

Get Practical Oriented HDFS Training to UPGRADE Your Knowledge & Skills

  • Instructor-led Sessions
  • Real-life Case Studies
  • Assignments
Explore Curriculum

36.Which command is used to format the NameNode?


  • $ hdfs namenode -format

37. How a client application interacts with the NameNode?


  • Client applications associate the Hadoop HDFS API with the NameNode when it has to copy/move/add/locate/delete a file.
  • The NameNode returns to the successful requests by delivering a list of relevant DataNode servers where the data is residing.
  • The client can talk directly to a DataNode after the NameNode has given the location of the data

38.What is MapReduce and list its features?


MapReduce is a programming model used for processing and generating large datasets on the clusters with parallel and distributed algorithms.

The syntax for running the MapReduce program is

  • hadoop_jar_file.jar /input_path /output_path.

39.What are the features of MapReduce?


  • Automatic parallelization and distribution.
  • Built-in fault-tolerance and redundancy are available.
  • MapReduce Programming model is language independent
  • Distributed programming complexity is hidden
  • Enable data local processing
  • Manages all the inter-process communication

40.What do MapReduce framework consists of?


MapReduce framework is used to write applications for processing large data in parallel on large clusters of commodity hardware.

ResourceManager (RM)

  • Global resource scheduler
  • One master RM

NodeManager (NM)

  • One slave NM per cluster-node.


  • RM creates Containers upon request by AM
  • The application runs in one or more containers

ApplicationMaster (AM)

  • One AM per application
  • Runs in Container

41. What are the two main components of ResourceManager?


  • Scheduler

It allocates the resources (containers) to various running applications based on resource availability and configured shared policy.

  • ApplicationManager

It is mainly responsible for managing a collection of submitted applications

42. What is a Hadoop counter?


Hadoop Counters measures the progress or tracks the number of operations that occur within a MapReduce job. Counters are useful for collecting statistics about MapReduce job for application-level or quality control.

43. What are the main configuration parameters for a MapReduce application?


The job configuration requires the following:

  • Job’s input and output locations in the distributed file system
  • The input format of data
  • The output format of data
  • Class containing the map function and reduce function
  • JAR file containing the reducer, driver, and mapper classes

44. What are the steps involved to submit a Hadoop job?


Steps involved in Hadoop job submission:

  • Hadoop job client submits the job jar/executable and configuration to the ResourceManager.
  • ResourceManager then distributes the software/configuration to the slaves.
  • ResourceManager then scheduling tasks and monitoring them.
  • Finally, job status and diagnostic information are provided to the client.

45. How does MapReduce framework view its input internally?


It views the input data set as a set of pairs and processes the map tasks in a completely parallel manner. 

46. What are the basic parameters of Mapper?


The basic parameters of Mapper are listed below:

  • LongWritable and Text
  • Text and IntWritable

47. What are Writables and explain its importance in Hadoop?


  • Writables are interfaces in Hadoop. They act as a wrapper class to almost all the primitive data types of Java. 
  • A serializable object which executes a simple and efficient serialization protocol, based on DataInput and DataOutput.
  • Writables are used for creating serialized data types in Hadoop. 

48. Why comparison of types is important for MapReduce?


  • It is important for MapReduce as in the sorting phase the keys are compared with one another.
  • For Comparison of types, WritableComparable interface is implemented.

49. What is “speculative execution” in Hadoop?


In Apache Hadoop, if nodes do not fix or diagnose the slow-running tasks, the master node can redundantly perform another instance of the same task on another node as a backup (the backup task is called a Speculative task). This process is called Speculative Execution in Hadoop.

50.What are the methods used for restarting the NameNode in Hadoop?


The methods used for restarting the NameNodes are the following:

You can use below command for stopping the NameNode individually

  • /sbin/hadoop-daemon.sh stop namenode

then start the NameNode using

  • /sbin/hadoop-daemon.sh start namenode.

Use these command for stopping all the demons first and then start all the daemons.

  • sbin/stop-all.sh
  • /sbin/start-all.sh

51. What is the difference between an “HDFS Block” and “MapReduce Input Split”?


  • HDFS Block is the physical division of the disk which has the minimum amount of data that can be read/write, while MapReduce InputSplit is the logical division of data created by the InputFormat specified in the MapReduce job configuration.
  • HDFS divides data into blocks, whereas MapReduce divides data into input split and empower them to mapper function.

52.What are the different modes in which Hadoop can run?


Standalone Mode(local mode) – This is the default mode where Hadoop is configured to run. In this mode, all the components of Hadoop such as DataNode, NameNode, etc., run as a single Java process and useful for debugging.

Pseudo Distributed Mode(Single-Node Cluster) – Hadoop runs on a single node in a pseudo-distributed mode. Each Hadoop daemon works in a separate Java process in Pseudo-Distributed Mode, while in Local mode, each Hadoop daemon operates as a single Java process. 

Fully distributed mode (or multiple node cluster) – All the daemons are executed in separate nodes building into a multi-node cluster in the fully-distributed mode. 

53. Why aggregation cannot be performed in Mapperside?


  • We cannot perform Aggregation in mapping because it requires sorting of data, which occurs only at Reducer side.
  • For aggregation, we need the output from all the mapper functions, which is not possible during the map phase as map tasks will be running in different nodes, where data blocks are present.

54. What is the importance of “RecordReader” in Hadoop?


  • RecordReader in Hadoop uses the data from the InputSplit as input and converts it into Key-value pairs for Mapper.
  • The MapReduce framework represents the RecordReader instance through InputFormat.

55.What is the purpose of Distributed Cache in a MapReduce Framework?


  • The Purpose of Distributed Cache in the MapReduce framework is to cache files when needed by the applications. It caches read-only text files, jar files, archives, etc. 
  • When you have cached a file for a job, the Hadoop framework will make it available to each and every data node where map/reduces tasks are operating.
Course Curriculum

Learn HDFS Training Course for Beginners By Experts Trainers

Weekday / Weekend BatchesSee Batch Details

56.How do reducers communicate with each other in Hadoop?


Reducers always run in isolation and Hadoop Mapreduce programming paradigm never allows them to communicate with each other.

57. What is Identity Mapper?


  • Identity Mapper is a default Mapper class which automatically works when no Mapper is specified in the MapReduce driver class.
  • It implements mapping inputs directly into the output.
  • IdentityMapper.class is used as a default value when JobConf.setMapperClass is not set.

58. What are the phases of MapReduce Reducer?


The MapReduce reducer has three phases: 

Shuffle phase – In this phase, the sorted output from a mapper is an input to the Reducer. This framework will fetch the relevant partition of the output of all the mappers by using HTTP.

Sort phase – In this phase, the input from various mappers is sorted based on related keys. This framework groups reducer inputs by keys. Shuffle and sort phases occur concurrently.

Reduce phase – In this phase, reduce task aggregates the key-value pairs after shuffling and sorting phases. The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem.

59. What is the purpose of MapReduce Partitioner in Hadoop?


The MapReduce Partitioner manages the partitioning of the key of the intermediate mapper output. It makes sure that all the values of a single key pass to same reducers by allowing the even distribution over the reducers.

60.How will you write a custom partitioner for a Hadoop MapReduce job?


  • Build a new class that extends Partitioner Class
  • Override the get partition method in the wrapper.
  • Add the custom partitioner to the job as a config file or by using the method set Partitioner.

61. What is a Combiner?


A Combiner is a semi-reducer that executes the local reduce task. It receives inputs from the Map class and passes the output key-value pairs to the reducer class.

62. What is the use of SequenceFileInputFormat in Hadoop?


SequenceFileInputFormat is the input format used for reading in sequence files. It is a compressed binary file format optimized for passing the data between outputs of one MapReduce job to the input of some other MapReduce job.

63. What is Apache Pig?


  • Apache Pig is a high-level scripting language used for creating programs to run on Apache Hadoop. 
  • The language used in this platform is called Pig Latin. 
  • It executes Hadoop jobs in Apache Spark, MapReduce, etc.

64. What are the benefits of Apache Pig over MapReduce?


  • Pig Latin is a high-level scripting language while MapReduce is a low-level data processing paradigm.
  • Without much complex Java implementations in MapReduce, programmers can perform the same implementations very easily using Pig Latin.
  • Apache Pig decreases the length of the code by approx 20 times (according to Yahoo). Hence, this reduces development time by almost 16 times.
  • Pig offers various built-in operators for data operations like filters, joins, sorting, ordering, etc., while to perform these same functions in MapReduce is an enormous task.

65.What are the Hadoop Pig data types?


Hadoop Pig runs both atomic data types and complex data types.

Atomic data types: These are the basic data types which are used in all the languages like int, string, float, long, etc.

Complex Data Types: These are Bag, Map, and Tuple.

66.List the various relational operators used in “Pig Latin”?


  • JOIN
  • LOAD

67. What is Apache Hive?


Apache Hive offers database query interface to Apache Hadoop. It reads, writes, and manages large datasets that are residing in distributed storage and queries through SQL syntax.

68. Where do Hive stores table data in HDFS?


/usr/hive/warehouse is the default location where Hive stores the table data in HDFS.

69. Can the default “Hive Metastore” be used by multiple users (processes) at the same time?


By default, Hive Metastore uses Derby database. So, it is not possible for multiple users or processes to access it at the same time.

70. What is a SerDe?


SerDe is a combination of Serializer and Deserializer. It interprets the results of how a record should be processed by allowing Hive to read and write from a table.

71. What are the differences between Hive and RDBMS?


Schema on ReadingSchema on write
Batch processing jobsReal-time jobs
Data stored on HDFSData stored on the internal structure
Processed using MapReduceProcessed using database

72.What is an Apache HBase?


Apache HBase is multidimensional and a column-oriented key datastore runs on top of HDFS (Hadoop Distributed File System). It is designed to provide high table-update rates and a fault-tolerant way to store a large collection of sparse data sets.

73. What are the various components of Apache HBase?


Region Server: These are the worker nodes which handle read, write, update, and delete requests from clients. The region Server process runs on each and every node of the Hadoop cluster

HMaster: It monitors and manages the Region Server in the Hadoop cluster for load balancing.

ZooKeeper: ZHBase employs ZooKeeper for a distributed environment. It keeps track of each and every region server that is present in the HBase cluster.

74.What is WAL in HBase?


  • Write-Ahead Log (WAL) is a file storage and it records all changes to data in HBase. It is used for recovering data sets. 
  • The WAL ensures all the changes to the data can be replayed when a RegionServer crashes or becomes unavailable.

75.What are the differences between the Relational database and HBase?


Relational DatabaseHBase
It is a row-oriented datastoreIt is a column-oriented datastore
It’s a schema-based databaseIts schema is more flexible and less restrictive
Suitable for structured dataSuitable for both structured and unstructured data
Supports referential integrityDoesn’t supports referential integrity
It includes thin tablesIt includes sparsely populated tables
Accesses records from tables using SQL queries.Accesses data from HBase tables using APIs and MapReduce.
hdfs Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

76.What is Apache Spark?


Apache Spark is an open-source framework used for real-time data analytics in a distributed computing environment. It is a data processing engine which provides faster analytics than Hadoop MapReduce.

77.Can we build “Spark” with any particular Hadoop version?


Yes, we can build “Spark” for any specific Hadoop version.

78.What is RDD?


RDD(Resilient Distributed Datasets) is a fundamental data structure of Spark. It is a distributed collection of objects, and each dataset in RDD is further distributed into logical partitions and computed on several nodes of the cluster

79.What is Apache ZooKeeper?


Apache ZooKeeper is a centralized service used for managing various operations in a distributed environment. It maintains configuration data, performs synchronization, naming, and grouping.

80. What is Apache Oozie?


Apache Oozie is a scheduler which controls the workflow of Hadoop jobs.

There are two kinds of Oozie jobs:

Oozie Workflow – It is a collection of actions sequenced in a control dependency DAG(Direct Acyclic Graph) for execution.

Oozie Coordinator – If you want to trigger workflows based on the availability of data or time then you can use Oozie Coordinator Engine

81. What is the Secondary NameNode?


 Hadoop metadata is stored in NameNode main memory and disk. Mainly two files are used for this purpose –

  • Editlogs
  • Fsimage

Any updates done to HDFS are entered in the editlogs. As the number of entries increases the file size grows automatically, however, the file size for the fsimage file remains the same. When the server gets restarted, the contents of the editlogs file are written into the fsimage file which is then loaded into main memory which is time-consuming. The more the editlogs file size, the more time it will take to load into fsimage causing an extended downtime.

To avoid such prolonged downtime, a helper node for NameNode which is known as Secondary NameNode is used which periodically copies the contents from editlogs to fsimage and copy the new fsimage file back to the NameNode.

82. How does NameNode handle DataNode failure?


 HDFS architecture is designed in a way that every DataNodes periodically send heartbeat to the NameNode to assure it is in working mode. When the NameNode does not receive any heartbeat from a particular DataNode, it considers that DataNode as dead or non-functional and transfer all the respective DataBlock to some other active DataNode which is already replicated with it.

Preparing a for Hadoop Developer So, moving forward, here we cover few advance HDFS interview questions along with the frequently asked HDFS interview questions.

83. How data/file read operation is performed in HDFS?


 HDFS NameNode is the placeholder for all the file information and their actual locations in the slave nodes. The below steps are followed in the read operation of a file:

  • When a file needs to be read, the file information is retrieved from NameNode by DistributedFileSystem instance.
  • NameNode checks whether that particular file exists and the user has the access for the file
  • Once the above-mentioned criteria are met, the NameNode provides the token to the client, for authentication to get the file from DataNode.
  • NameNode provides the list of all Block detail and related data nodes of the file
  • DataNodes are then sorted as per their proximity to the client.
  • DistributedFileSystem returns an input stream to the client called as FSDataInputStream so that client can read data from it.
  • FSDataInputStream works as a wrapper to the DFSInputStream, which is responsible for managing NameNode and DataNode and I/O.
  • As the Client calls read () on the stream, the DFSInputStream connects to the closet DataNode block and data is returned to the client via stream. The read () operation is repeatedly called till the end of the first block is completely read.
  • Once the first block is completely read, the connection with that DataNode is closed.
  • Next, the DFSInputStream again connects to the next possible DataNode for the next block, and it continues until the file is completely read.
  • Once the entire file is read, FSDataInputStream calls the close () operation to close the connection.

84. Is concurrent write into HDFS file possible?


No, HDFS does not allow concurrent writing. Because when one client receives permission by NameNode for writing on data node block, the particular block gets locked till the finish of the write operation. Hence, no other client can write on the same block.

85. What are the challenges in existing HDFS architecture?


 Existing HDFS architecture consists of only one NameNode which contains the single Namespace and multiple DataNodes that hold the actual data. This architecture works well with limited cluster size. However, if we try to increase the cluster size, we come across few challenges.

  • As the Namespace and Blocks are tightly coupled, other services cannot easily utilize the storage capacity of Blocks efficiently.
  • With a single Namenode, if we want to add more DataNodes in the cluster, it will create huge metadata. Here we can scale DataNodes horizontally. However, we cannot scale up Namenode in the same manner. This is a Namespace Scalability issue.
  • The current HDFS file system has a performance limitation related to the throughput. Because a single name node supports only 60000 concurrent tasks.
  • We cannot get isolated namespace for a single application as HDFS deployments happen on a multi-tenant environment and multiple applications or organizations share a single cluster.

86.What is the size of the biggest hadoop cluster a company X operates?


Asking this question helps a hadoop job seeker understand the hadoop maturity curve at a company.Based on the answer of the interviewer, a candidate can judge how much an organization invests in Hadoop and their enthusiasm to buy big data products from various vendors. The candidate can also get an idea on the hiring needs of the company based on their hadoop infrastructure.

87. For what kind of big data problems, did the organization choose to use Hadoop?


Asking this question to the interviewer shows the candidates keen interest in understanding the reason for hadoop implementation from a business perspective. This question gives the impression to the interviewer that the candidate is not merely interested in the hadoop developer job role but is also interested in the growth of the company.

88.Based on the answer to question no 1, the candidate can ask the interviewer why the hadoop infrastructure is configured in that particular way, why the company chose to use the selected big data tools and how workloads are constructed in the hadoop environment?


Asking this question to the interviewer gives the impression that you are not just interested in maintaining the big data system and developing products around it but are also seriously thoughtful on how the infrastructure can be improved to help business growth and make cost savings.

89. What kind of data the organization works with or what are the HDFS file formats the company uses?


The question gives the candidate an idea on the kind of big data he or she will be handling if selected for the hadoop developer job role. Based on the data, it gives an idea on the kind of analysis they will be required to perform on the data.

90. What is the most complex problem the company is trying to solve using Apache Hadoop?


Asking this question helps the candidate know more about the upcoming projects he or she might have to work and what are the challenges around it. Knowing this beforehand helps the interviewee prepare on his or her areas of weakness.

91.Will I get an opportunity to attend big data conferences? Or will the organization incur any costs involved in taking advanced hadoop or big data certification?


This is a very important question that you should be asking these the interviewer. This helps a candidate understand whether the prospective hiring manager is interested and supportive when it comes to professional development of the employee.

92. How to control the number of reducers in a map reduce program?


A. By default, for every 1 GB of input data, 1 reducer will be spawned/created. But you can also override

this property by using the below property

  • job.setNumReduceTasks(int n)

The above property will set the number of reducers based on the integer number you provide to the

function as parameter.

93. How does Hadoop know how many mappers has to be started?


A. Number of mappers equals the number of input splits

Number of input splits(for a single file) = Ceil(Size of file)/(Size of input split)

For example, if you have 1GB of data and the input split size is 128MB then 1024/128 gives you 8 so 8

mappers will be started.

In default situations, input split size equals to the block size so number of input splits is equal to the

number of blocks. So, you can say that number of mappers is equal to the number of blocks.

Hope this post helped you know some important interview questions that are asked in the Hadoop HDFS and MapReduce topics.

94. How can you control block size and replication factor at file level?


A. You can change the block size and replication factors and many other configurations at the cluster

level by setting the properties in the configuration files like

  • core-site.xml,
  • hdfs-site.xml,
  • mapred-site.xml,
  • yarn-site.xml.

If you want to upload a file into HDFS with some specific block size and with some specific replication

factor, you can do that by providing the configuration and its value while writing the file into HDFS.

Changing block size

  • hadoop fs -Ddfs.block.size=1048576 -put file.txt /user/acadgild
  • hadoop fs -Ddfs.blocksize=1048576 -put file.txt /user/acadgild

Are you looking training with Right Jobs?

Contact Us

Popular Courses