What is hdfs LEARNOVITA

What is HDFS? Hadoop Distributed File System | A Complete Guide [ OverView ]

Last updated on 31st Oct 2022, Artciles, Blog

About author

Smita Jhingran (Big Data Engineer )

Smita Jhingran provides in-depth presentations on various big data technologies. She specializes in Docker, Hadoop, Microservices, MiNiFi, Cloudera, Commvault, and BI tools with 5+ years of experience.

(5.0) | 19278 Ratings 2152
    • In this article you will get
    • 1.What exactly is an HDFS?
    • 2.Architecture of HDFS, along with NameNodes and DataNodes
    • 3.Characteristics of the HDFS
    • 4.Read/Write Architecture of the HDFS File System
    • 5.Architecture for writing to HDFS
    • 6.What are some of the advantages of using HDFS?
    • 7.Conclusion

What exactly is an HDFS?

The abbreviation HDFS refers to the Hadoop Distributed File system. There is nothing more to it than a fundamental part of the Hadoop architecture. It is possible to save several files at the same time and retrieve them all at the same time.

The Hadoop Distributed File System (HDFS) is a key component of the Hadoop architecture and is responsible for data storage. The data storage for Hadoop is really distributed among a number of different workstations. This was done to both save costs and improve the system’s overall stability.

HDFS stands for the Hadoop Distributed File System.

In the next part of the post, we are going to go through the HDFS architecture in further depth.

HDFS is a file system that is organised in blocks. Every single file is broken up into separate blocks using this technique.

Each block may be kept in either a single cluster or several clusters simultaneously.

HDFS uses an architecture that is based on a master/slave relationship.

In this particular design, the NameNode serves the role of the Master Node, while all of the DataNodes are seen as being subservient to it.

Architecture of HDFS, along with NameNodes and DataNodes

The HDFS file system is designed with a primary/secondary architecture. The principal server that handles the file system namespace and regulates client access to files is known as the NameNode, and it is located inside the HDFS cluster. In its role as the core component of the Hadoop Distributed File System, the NameNode is responsible for the upkeep and management of the file system namespace as well as the distribution of the appropriate access rights to clients. The DataNodes in the system are in charge of managing the storage that is connected to the nodes that they run on.

HDFS makes it possible for user data to be saved in files and provides a namespace for the underlying file system. A file is partitioned into one or more of the blocks, and each of these blocks is then saved in its own unique DataNode. The activities that pertain to the namespace of the file system are handled by the NameNode. These operations include opening, shutting, and renaming files and directories. The mapping of blocks to the DataNodes is another function governed by the NameNode. Requests to read and write data are fulfilled by the DataNodes on behalf of the clients of the file system. In addition to this, they carry out the processes of block formation, deletion, and replication whenever the NameNode gives them instructions to do so.

HDFS is capable of supporting the conventional hierarchical arrangement of files. The creation of directories and the subsequent storage of files inside those directories may be performed by either an application or a user. The namespace structure of the file system is similar to that of the vast majority of other file systems in that a user has the ability to create, delete, rename, and transfer files from one directory to another.

Any time the file system namespace or any of its characteristics are modified, the NameNode logs the event. An application has the ability to specify the maximum number of copies of a file that should be kept by the HDFS. The replication factor of a given file is determined by the number of copies of that file that are stored in the NameNode.

Characteristics of the HDFS

There are a number of aspects that contribute to HDFS’s exceptional utility, including the following:

Data replication: This is done to protect the data from being lost while also ensuring that they are constantly accessible. For instance, in the event that a node crashes or there is a breakdown in the hardware, duplicated data may be fetched from another location within a cluster in order to ensure that processing can continue even while data is being recovered.

Tolerance for errors and high dependability: Both fault tolerance and reliability may be ensured by HDFS’s capacity to duplicate file blocks and store them across the nodes that comprise a big cluster.High degree of availability Because the data is replicated across all of the notes, it is still accessible even in the event that either the NameNode or a DataNode has an outage.

Scalability: Due to the fact that HDFS stores data on numerous nodes within the cluster, a cluster may expand to hundreds of nodes in order to accommodate growing demand.

High amount of work done: Because HDFS stores data in a distributed way, the data may be handled in parallel on a cluster of nodes provided that the nodes are connected to each other. The processing time was decreased as a result of this, and high throughput was made possible as a result of data locality (see the next bullet).

The localization of data: Instead of moving the data to the location of the processing unit, HDFS enables the computation to take place on the DataNodes themselves, which is where the data is stored. This strategy reduces the amount of congestion on a network and increases the overall throughput of a system by reducing the amount of distance that exists between the data and the computing process.

Characteristics of HDFS

Read/Write Architecture of the HDFS File System

  • Read/Write Architecture of the HDFS File System.
  • In the next part of the essay, we will have a quick discussion on the Read and Write procedures.
  • The HDFS file system adheres to the “write once, read many” concept.
  • Therefore, the information included in a file that has been uploaded to the HDFS system once cannot be changed, but the file may be read several times in order to examine the information contained inside it.
  • Now that we have everything out of the way, let’s have a look at an example that illustrates how to write a file into the HDFS system.

Architecture for writing to HDFS

Let’s say we’re going to look at a scenario in which the user has to save a file that’s 248 megabytes into the HDFS system.

  • The name of the text file is “example.txt.”
  • The size of the file is 248 megabytes.
  • Therefore, the data blocks will be generated such that the 248 MB file may be stored.
  • The size of a block is 128 megabytes when the default configuration is used. It follows that about two data blocks will be required to hold all of this information.
  • There is enough for 128 megabytes in Block A.
  • The remainder, which is 120 MB, will be stored in Block B.
Read and Write in HDFS

What are some of the advantages of using HDFS?

The use of HDFS comes with a total of five significant benefits, which are as follows:

Efficiency: In regard to costs The hardware that the DataNodes use to store the data is readily available and very affordable, which helps keep storage costs down. In addition, there are no licence costs associated with using HDFS since it is free source.

Large data set storage: HDFS is able to store a wide range of data in any format and of any size, ranging from gigabytes all the way up to petabytes. This includes both organised and unstructured data.

Rapid recovery after a malfunction in the hardware: HDFS was developed to identify errors and restore itself automatically in the event of a malfunction.

Portabilit:.HDFS is compatible with a variety of operating systems and is portable across all hardware platforms. These operating systems include Windows, Linux, and Mac OS/X.

Streaming data access: The HDFS file system was designed to have a high data throughput, making it ideal for providing access to streaming data.

Conclusion:

HDFS is a distributed file system that maintains massive data volumes using technology that is commonly available. Using this method, a single Apache Hadoop cluster may be grown up to hundreds or even thousands of nodes, depending on the size of the cluster. HDFS is one of the most important components of Apache Hadoop, along with MapReduce and YARN. It is important not to confuse Apache HBase with HDFS or use it in its stead. Apache HBase is a column-oriented non-relational database management system that sits on top of HDFS and can better handle real-time data demands with its in-memory processing engine. It is put to use in a wide variety of fields and has impressive properties.

Are you looking training with Right Jobs?

Contact Us

Popular Courses