Essential Concepts of Big Data and Hadoop
Last updated on 01st Oct 2020, Artciles, Blog
What is Hadoop?
Hadoop (the full proper name is ApacheTM Hadoop®) is an open-source framework that was created to make it easier to work with big data. It provides a method to access data that is distributed among multiple clustered computers, process the data, and manage resources across the computing and network resources that are involved. “Hadoop” commonly refers to the core technology that consists of the four main components described below, but is also frequently used in reference to the entire ecosystem of supporting technologies and applications.
“Hadoop” also is often used interchangeably with “big data,” but it shouldn’t be. Hadoop is a framework for working with big data. It is part of the big data ecosystem, which consists of much more than Hadoop itself.
Hadoop is a distributed framework that makes it easier to process large data sets that reside in clusters of computers. Because it is a framework, Hadoop is not a single technology or product. Instead, Hadoop is made up of four core modules that are supported by a large ecosystem of supporting technologies and products. The modules are:
Subscribe For Free Demo[contact-form-7 404 "Not Found"]
- Hadoop Distributed File System (HDFSTM) –Provides access to application data. Hadoop can also work with other file systems, including FTP, Amazon S3 and Windows Azure Storage Blobs (WASB), among others.
- Hadoop YARN – Provides the framework to schedule jobs and manage resources across the cluster that holds the data
- Hadoop MapReduce – A YARN-based parallel processing system for large data sets.
- Hadoop Common – A set of utilities that supports the three other core modules.
Some of the well-known Hadoop ecosystem components include Oozie, Spark, Sqoop, Hive and Pig.
What Hadoop isn’t
- Hadoop is not “big data” – the terms are sometimes used interchangeably, but they shouldn’t be. Hadoop is a framework for processing big data.
- Hadoop is not an operating system (OS) or packaged software application.
- Hadoop is not a brand name. It is an open source project, although “Hadoop” may be used as part of registered brand names.
What’s with the name?
Hadoop was originally developed by Doug Cutting and Mike Cafarella. According to lore, Cutting named the software after his son’s toy elephant. An image of an elephant remains the symbol for Hadoop.
Core elements of Hadoop
There are four basic elements to Hadoop: HDFS; MapReduce; YARN; Common.
Hadoop works across clusters of commodity servers. Therefore there needs to be a way to coordinate activity across the hardware. Hadoop can work with any distributed file system, however the Hadoop Distributed File System is the primary means for doing so and is the heart of Hadoop technology. HDFS manages how data files are divided and stored across the cluster. Data is divided into blocks, and each server in the cluster contains data from different blocks. There is also some built-in redundancy.
It would be nice if YARN could be thought of as the string that holds everything together, but in an environment where terms like Oozie, tuple and Sqoop are common, of course it’s not that simple. YARN is an acronym for Yet Another Resource Negotiator. As the full name implies, YARN helps manage resources across the cluster environment. It breaks up resource management, job scheduling, and job management tasks into separate daemons. Key elements include the ResourceManager (RM), the NodeManager (NM) and the ApplicationMaster (AM).
Think of the ResourceManager as the final authority for assigning resources for all the applications in the system. The NodeManagers are agents that manage resources (e.g. CPU, memory, network, etc.) on each machine. NodeManagers report to the ResourceManager. ApplicationMaster serves as a library that sits between the two. It negotiates resources with ResourceManager and works with one or more NodeManagers to execute tasks for which resources were allocated.
MapReduce provides a method for parallel processing on distributed servers. Before processing data, MapReduce converts that large block into smaller data sets called tuples. Tuples, in turn, can be organized and processed according to their key-value pairs. When MapReduce processing is complete, HDFS takes over and manages storage and distribution for the output. The shorthand version of MapReduce is that it breaks big data blocks into smaller chunks that are easier to work with.
The “Map” in MapReduce refers to the Map Tasks function. Map Tasks is the process of formatting data into key-value pairs and assigning them to nodes for the “Reduce” function, which is executed by Reduce Tasks, where data is reduced to tuples. Both Map Tasks and Reduce Tasks use worker nodes to carry out their functions.
JobTracker is a component of the MapReduce engine that manages how client applications submit MapReduce jobs. It distributes work to TaskTracker nodes. TaskTracker attempts to assign processing as close to where the data resides as possible.
Note that MapReduce is not the only way to manage parallel processing in the Hadoop environment.
Common, which is also known as Hadoop Core, is a set of utilities that support the other Hadoop components. Common is intended to give the Hadoop framework ways to manage typical (common) hardware failures.
What is Oozie?
Oozie is the workflow scheduler that was developed as part of the Apache Hadoop project. It manages how workflows start and execute, and also controls the execution path. Oozie is a server-based Java web application that uses workflow definitions written in hPDL, which is an XML Process Definition Language similar to JBOSS JBPM jPDL. Oozie only supports specific workflow types, so other workload schedulers are commonly used instead of, or in addition to, Oozie in Hadoop environments.
How is Hadoop related to big data?
Big data is becoming a catchall phrase, while Hadoop refers to a specific technology framework. Hadoop is a gateway that makes it possible to work with big data, or more specifically, large data sets that reside in a distributed environment. One way to define big data is data that is too big to be processed by relational database management systems (RDBMS). Hadoop helps overcome RDBMS limitations, so big data can be processed.
Perhaps a more important question than How is Hadoop related to big data? How does Hadoop relate to other big data technologies? The relationships between the core Hadoop modules and the technologies and solutions that complement and compete with them are covered in more depth in the Hadoop Ecosystem section of this guide.
What does Hadoop replace?
In many cases Hadoop hasn’t replaced anything, because the things it is being used for simply were not done before (especially with unstructured data) because of computing system limitations. Relational databases and distributed file systems each do parts of what Hadoop can do, but operate on a much smaller scale. Again, a more instructive question is Which elements of Hadoop can be replaced or enhanced by other technologies and products in the ecosystem?
Who uses Hadoop? Why?
Any list of how Hadoop is being used and the organizations that are using it would become out of date in the time it takes to save the file. The Apache Hadoop organization maintains a list of Hadoop users that is extensive. Some of the prominent users listed include Amazon, EBay, Facebook, Google, IBM, LinkedIn, the New York Times, Rackspace and Yahoo.
To generalize, Hadoop has found a home in industries and organizations characterized by having large data sets, time-sensitive data, and data that could provide insight to improve performance or revenue. More specifically, the financial services, telecommunications, utilities/energy, and retail industries have been early Hadoop adopters and innovators, along with some government and other public sector organizations.
Examples of Hadoop
Here are five examples of Hadoop use cases:
- Financial services companies use analytics to assess risk, build investment models, and create trading algorithms; Hadoop has been used to help build and run those applications.
- Retailers use it to help analyze structured and unstructured data to better understand and serve their customers.
- In the asset-intensive energy industry Hadoop-powered analytics are used for predictive maintenance, with input from Internet of Things (IoT) devices feeding data into big data programs.
- Telecommunications companies can adapt all the aforementioned use cases. For example, they can use Hadoop-powered analytics to execute predictive maintenance on their infrastructure. Big data analytics can also plan efficient network paths and recommend optimal locations for new cell towers or other network expansion. To support customer-facing operations telcos can analyze customer behavior and billing statements to inform new service offerings.
- There are numerous public sector programs, ranging from anticipating and preventing disease outbreaks to crunching numbers to catch tax cheats.
Hadoop is used in these and other big data programs because it is effective, scalable, and is well supported by large vendor and user communities. Hadoop is a de facto standard in big data.
Are you looking training with Right Jobs?Contact Us
- Hadoop Tutorial
- Hadoop Interview Questions and Answers
- Hadoop Ecosystem
- How to Become a Hadoop Developer?
- What Are the Skills Needed to Learn Hadoop?
- What is Dimension Reduction? | Know the techniques
- Difference between Data Lake vs Data Warehouse: A Complete Guide For Beginners with Best Practices
- What is Dimension Reduction? | Know the techniques
- What does the Yield keyword do and How to use Yield in python ? [ OverView ]
- Agile Sprint Planning | Everything You Need to Know