Data Engineer Interview Questions and Answers

Data Engineer Interview Questions and Answers

Last updated on 24th Oct 2020, Blog, Interview Question

About author

Vijaykumar ( (Lead Data Engineer - Director Level ) )

He is a Proficient Technical Expert for Respective Industry Domain & Serving 8+ Years. Also, Dedicated to Imparts the Informative Knowledge's to Freshers. He Share's this Blogs for us.

(5.0) | 12456 Ratings 1894

These Data Engineer Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Data Engineer . As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer.we are going to cover top 100 Data Engineer Interview questions along with their detailed answers. We will be covering Data Engineer  scenario based interview questions, Data Engineer  interview questions for freshers as well as Data Engineer   interview questions and answers for experienced.

1. Explain Data Engineering?

Ans:

Data engineering is a term used in big data. It focuses on the application of data collection and research. The data generated from various sources are just raw data. Data engineering helps to convert this raw data into useful information.

2. What is Data Modelling?

Ans:

Data modeling is the method of documenting complex software design as a diagram so that anyone can easily understand. It is a conceptual representation of data objects that are associated between various data objects and the rules.

3. List various types of design schemas in Data Modelling?

Ans:

There are mainly two types of schemas in data modeling: 1) Star schema and 2) Snowflake schema.

4. Distinguish between structured and unstructured data?

Ans:

Following is a difference between structured and unstructured data:

ParameterStructured DataUnstructured Data
StorageDBMSUnmanaged file structures
StandardADO.net, ODBC, and SQLSTMP, XML, CSV, and SMS
Integration ToolELT (Extract, Transform, Load)Manual data entry or batch processing that includes codes
scalingSchema scaling is difficultScaling is very easy.

5.Explain all components of a Hadoop application?

Ans:

Following are the components of Hadoop application:

  • Hadoop Common: It is a common set of utilities and libraries that are utilized by Hadoop.
  • HDFS: This Hadoop application relates to the file system in which the Hadoop data is stored. It is a distributed file system having high bandwidth.
  • Hadoop MapReduce: It is based according to the algorithm for the provision of large-scale data processing.
  • Hadoop YARN: It is used for resource management within the Hadoop cluster. It can also be used for task scheduling for users.

6. What is NameNode?

Ans:

It is the centerpiece of HDFS. It stores data of HDFS and tracks various files across the clusters. Here, the actual data is not stored. The data is stored in DataNodes.

7. Define Hadoop streaming?

Ans:

It is a utility which allows for the creation of the map and Reduces jobs and submits them to a specific cluster.

8. What is the full form of HDFS?

Ans:

HDFS stands for Hadoop Distributed File System.

9. Define Block and Block Scanner in HDFS?

Ans:

Blocks are the smallest unit of a data file. Hadoop automatically splits huge files into small pieces.Block Scanner verifies the list of blocks that are presented on a DataNode.

10. Name two messages that NameNode gets from DataNode?

Ans:

There are two messages which NameNode gets from DataNode. They are

  • Block report
  • Heartbeat.

11. What is data engineering to you?

Ans:

  • Data engineering refers to an employment role, which is known within the field of big data. It references data architecture or infrastructure. The data is generated by many varieties of sources. From internal databases to external data sets. The data has to be transformed, profiled and cleansed for the purpose of business needs.
  • This raw data is also termed as Dark Data. The practice of working with the data and making it accessible to the employees who need it to be informed about their decisions is called a Data Engineer.

12. Discuss the four V’s of Big data?

Ans:

They include the following:

  • Velocity: Analysis of the streaming data. Usually, Big Data keeps getting generated over the course of time. Velocity refers to the rate at which the data is being generated.
  • Variety: this point is about the different forms of Big Data. It can be within log files, voice recordings, images, and media files.
  • Volume: the scale of data. The scale may be defined within the term of the number of users, size of data, number of tables and the number of records.
  • Veracity: this concerning the certainty or the uncertainty of the data. How sure would you be concerning the accuracy of the data.

13. What is the difference between unstructured and structured data?

Ans:

  • The structured data is that which can be stored within a traditional database system and these may include MS Access, SQL Server and Oracle. It is stored within rows and different columns. A lot of the online application transactions are Structured Data. The structured data is that which can be easily defined according to the data model. The Unstructured data, though cannot be stored in terms of the rows and columns.
  • Therefore, it cannot be stored within a traditional database system. Usually, it has a varying size and content. The examples when it comes to unstructured data would include tweets, Facebook likes and the Google search items. Some of the Internet of Things data happens to be the unstructured forms of data. It is hard to define the unstructured data within a data model, which is defined. Some of the software, which supports the unstructured data include such things as MongoDB and Hadoop.

14. Describe a data disaster recovery situation and your particular role?

Ans:

  • You are required to complete daily tasks assigned by your peers. Hiring managers are seeking Data Engineers that are able to contribute to emergency situations as well as contribute to the overall success of the product decision making. When the data is not accessible then it may have damaging effects on the operations of the company.
  • The companies need to make certain they are ready with the appropriate resources to deal with failure if it happens. A lot of the time, it becomes an all hands on deck circumstances.
Subscribe For Free Demo
[contact-form-7 404 "Not Found"]

15. Explain the responsibilities of a data engineer?

Ans:

The task of the data engineer is to handle the data stewardship within the company.

  • They also handle and maintain the source systems of the data and the staging areas.
  • They simplify the data cleansing and improvement of data re duplication and subsequent building.
  • They provide and execute ELT and data transformation.
  • They do ad-hoc data queries building as well as, extraction.

16. What are the main forms of design schemas when it comes to data modeling?

Ans:

There are two types of schemas when it comes to data modeling:

  • Star Schema: this type of schema is divided into two. One would be the fact table and the other is the dimension table where the dimension tables are connected to the fact one. The foreign key when it comes to the fact table is in reference to the primary keys that are present within the dimension tables.
  • The snowflake schema is the other one where the levels of normalization are increased. In this case, the fact table would be the same as of the star schema. Because of the different layers of the dimension tables, it looks like a snowflake and so hence the name.

17. Do you have any experience when it comes to data modeling?

Ans:

  • You may say that you have worked on a project for a health/ insurance or health client where they have utilized the ETL tools including the Informatica, Talend and even Pentaho. This would be for the purposes of transforming and processing the data as fetched from the MySQL/RDS/SQL Database and sends the information to the vendors that would assist with the increase of the revenues.
  • One might illustrate below the high-level architecture of the data model. it entails a primary key, attributes and relationship constraints etc.

18. What is one of the hardest aspects of being a data engineer?

Ans:

  • You may want to avoid an indirect answer to this particular question because of the fear of highlighting a weakness you may have. Understand that this is one of those questions that doesn’t have a perfect desired outcome.
  • Instead, try and identify something which you may have had a hard time with and the way that you dealt with it.

19. Illustrate a moment when you found a new use for existing data and it had a positive effect on the company?

Ans:

  • As a data engineer, I will most likely have a better perspective or understanding of the data within the company. If certain departments are looking to garnish a set of insight from within a product, sales or marketing effort I can help them to better understand it.
  • To add the biggest value to the strategies of the company it would be valuable to know the initiatives of each department. That would allow me, the Data Engineer a greater chance of providing valuable insight from within the data.

20. What are the fields or languages that you need to learn in order to become a data engineer?

Ans:

  • Mathematics such as probability and linear algebra
  • Statistics like regression and trend analysis
  • R and SAS learning techniques
  • SQL databases, Hive QL
  • Python
  • Machine learning techniques

21. What are the essential qualities of a data engineer?

Ans:

  • “A successful data engineer needs to know how to architect distributed systems and data stores, create dependable pipelines and combine data sources effectively. Data engineers also need to be able to collaborate with team members and colleagues from other departments.
  • To accomplish all of these tasks, a data engineer needs strong math and computing skills, critical thinking and problem solving skills and communication and leadership capabilities.”

22. Which frameworks and applications are critical for data engineers?

Ans:

“Data engineers have to be proficient in SQL, Amazon Web Services, Hadoop and Python. I am fluent with all of these frameworks and I am also familiar with Tableau, Java, Hive and Apache Spark. I embrace every opportunity to learn new frameworks.”

23. Can you explain the design schemas relevant to data modeling?

Ans:

Data modeling involves two schemas, star and snowflake. Star schema includes dimension tables that are connected to a fact table. Snowflake schema includes a similar fact table and dimension tables with snowflake-like layers.”

24. Do you consider yourself database- or pipeline-centric?

Ans:

Because I usually opt to work for smaller companies, I am a generalist who is equally comfortable with a database or pipeline focus. Since I specialize in both components, I have comprehensive knowledge of distributed systems and data warehouses.”

25. What is the biggest professional challenge you have overcome as a data engineer?

Ans:

Last year, I served as the lead data engineer for a project that had insufficient internal support. As a result, my portion of project fell behind schedule and I risked disciplinary measures. After my team missed the first deadline, I took the initiative to meet with the project manager and proposed possible solutions. Based on my suggestions, the company assigned additional personnel to my team and we were able to complete the project successfully within the original timeline.”

26. As a data engineer, how would you prepare to develop a new product?

Ans:

As a lead data engineer, I would request an outline of the entire project so I can understand the complete scope and the particular requirements. Once I know what the stakeholders want and why, I would sketch some scenarios that might arise. Then I would use my understanding to begin developing data tables with the appropriate level of granularity.”

27. List out various XML configuration files in Hadoop?

Ans:

There are five XML configuration files in Hadoop:

  • Mapred-site
  • Core-site
  • HDFS-site
  • Yarn-site

28. What are four V’s of big data?

Ans:

Four V’s of big data are:

  • Velocity
  • Variety
  • Volume
  • Veracity

29. Explain the features of Hadoop?

Ans:

Important features of Hadoop are:

  • It is an open-source framework that is available freeware.
  • Hadoop is compatible with the many types of hardware and easy to access new hardware within a specific node.
  • Hadoop supports faster-distributed processing of data.
  • It stores the data in the cluster, which is independent of the rest of the operations.
  • Hadoop allows creating 3 replicas for each block with different nodes.

30. Explain the main methods of Reducer?

Ans:

  • setup (): It is used for configuring parameters like the size of input data and distributed cache.
  • cleanup(): This method is used to clean temporary files.
  • reduce(): It is a heart of the reducer which is called once per key with the associated reduced task

31. What is the abbreviation of COSHH?

Ans:

The abbreviation of COSHH is Classification and Optimization based Schedule for Heterogeneous Hadoop systems.

32. Explain Star Schema?

Ans:

Star Schema or Star Join Schema is the simplest type of Data Warehouse schema. It is known as star schema because its structure is like a star. In the Star schema, the center of the star may have one fact table and multiple associated dimension table. This schema is used for querying large data sets.

33. Explain FSCK?

Ans:

File System Check or FSCK is command used by HDFS. FSCK command is used to check inconsistencies and problem in file.

34. Explain Snowflake Schema?

Ans:

A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is so-called as snowflake because its diagram looks like a Snowflake. The dimension tables are normalized, that splits data into additional tables.

35. Distinguish between Star and Snowflake Schema?

Ans:

StarSnowFlake Schema
Dimensions hierarchies are stored in dimensional table.Each hierarchy is stored into separate tables.
Chances of data redundancy are highChances of data redundancy are low.
It has a very simple DB designIt has a complex DB design
Provide a faster way for cube processingCube processing is slow due to the complex join.

36. Explain Hadoop distributed file system?

Ans:

Hadoop works with scalable distributed file systems like S3, HFTP FS, FS, and HDFS. Hadoop Distributed File System is made on the Google File System. This file system is designed in a way that it can easily run on a large cluster of the computer system.

Course Curriculum

Enroll for Data Engineer Training from Top-Rated Instructors

  • Instructor-led Sessions
  • Real-life Case Studies
  • Assignments
Explore Curriculum

37. Explain the main responsibilities of a data engineer?

Ans:

Data engineers have many responsibilities. They manage the source system of data. Data engineers simplify complex data structure and prevent the reduplication of data. Many times they also provide ELT and data transformation.

38. What is the full form of YARN?

Ans:

The full form of YARN is Yet Another Resource Negotiator.

39. List various modes in Hadoop

Ans:

Modes in Hadoop are

  • Standalone mode
  • Pseudo distributed mode
  • Fully distributed mode.

40. How to achieve security in Hadoop?

Ans:

Perform the following steps to achieve security in Hadoop:

  • The first step is to secure the authentication channel of the client to the server. Provide time-stamped to the client.
  • In the second step, the client uses the received time-stamped to request TGS for a service ticket.
  • In the last step, the client use service ticket for self-authentication to a specific server.

41. What is  Relational vs Non-Relational Databases?

Ans:

  • A relational database is one where data is stored in the form of a table. Each table has a schema, which is the columns and types a record is required to have. Each schema must have at least one primary key that uniquely identifies that record. In other words, there are no duplicate rows in your database. Moreover, each table can be related to other tables using foreign keys.
  •  Non-relational databases tackle things in a different way. They are inherently schema-less, which means that records can be saved with different schemas and with a different, nested structure. Records can still have primary keys, but a change in the schema is done on an entry-by-entry basis.

42. What are SQL Aggregation Functions?

Ans:

Aggregation functions are those that perform a mathematical operation on a result set. Some examples include AVG, COUNT, MIN, MAX, and SUM. Often, you’ll need GROUP BY and HAVING clauses to complement these aggregations. 

43. How to Speed Up SQL Queries?

Ans:

Speed depends on various factors, but is mostly affected by how many of each of the following are present:

  • Joins
  • Aggregations
  • Traversals
  • Records

44. How to Debugg  SQL Queries?

Ans:

Most databases include an EXPLAIN QUERY PLAN that describes the steps the database takes to execute the query. For SQLite, you can enable this functionality by adding EXPLAIN QUERY PLAN in front of a SELECT statement.

45. How to Query Data With MongoDB?

Ans:

Let’s try to replicate the BoughtItem table first, as you did in SQL. To do this, you must append a new field to a customer. MongoDB’s documentation specifies that the keyword operator set can be used to update a record without having to write all the existing fields:

  • # Just add “boughtitems” to the customer where the firstname is Bob
  • bob = customers.update_many
  • ({“firstname”: “Bob”},
  •  {“$set”: {“boughtitems”:
  • [{ “title”: “USB”,
  • “price”: 10.2,
  • “currency”: “EUR”,
  • “notes”: “Customer wants it delivered via FedEx”,
  • “original_item_id”: 1}]},})

46. Differentiate  NoSQL vs SQL?

Ans:

  • SQL databases, particularly PostgreSQL, have also released a feature that allows queryable JSON data to be inserted as part of a record. While this can combine the best of both worlds, speed may be of concern.
  • It’s faster to query unstructured data from a NoSQL database than it is to query JSON fields from a JSON-type column in PostgreSQL. You can always do a speed comparison test for a definitive answer.
  • Nonetheless, this feature might reduce the need for an additional database. Sometimes, pickled or serialized objects are stored in records in the form of binary types, and then de-serialized on read.

47. How to Use Cache Databases?

Ans:

  • If you’ve ever used dictionaries in Python, then Redis follows the same structure. It’s a key-value store, where you can SET and GET data just like a Python dict.
  • When a request comes in, you first check the cache database, then the main database. This way, you can prevent any unnecessary and repetitive requests from reaching the main database’s server. Since a cache database has a lower read time, you also benefit from a performance increase.

48. What are ETL Challenges?

Ans:

There are several challenging concepts in ETL, including the following:

  • Big data
  • Stateful problems
  • Asynchronous workers
  • Type-matching

49. How to Design Patterns in Big Data?

Ans:

Imagine Amazon needs to create a recommender system to suggest suitable products to users. The data science team needs data and lots of it! They go to you, the data engineer, and ask you to create a separate staging database warehouse. That’s where they’ll clean up and transform the data.

50.what are the Common Aspects of the ETL Process and Big Data Workflows?

Ans:

Both workflows follow the Producer-Consumer pattern. A worker (the Producer) produces data of some kind and outputs it to a pipeline. This pipeline can take many forms, including network messages and triggers. After the Producer outputs the data, the Consumer consumes and makes use of it. These workers typically work in an asynchronous manner and are executed in separate processes.

51. What is Heartbeat in Hadoop?

Ans:

In Hadoop, NameNode and DataNode communicate with each other. Heartbeat is the signal sent by DataNode to NameNode on a regular basis to show its presence.

52. Distinguish between NAS and DAS in Hadoop?

Ans:

NASDAS
Storage capacity is 109 to 1012 in byte.Storage capacity is 109 in byte.
Management cost per GB is moderate.Management cost per GB is high.
Transmit data using Ethernet or TCP/IP.Transmit data using IDE/ SCSI

53. List important fields or languages used by data engineer?

Ans:

Here are a few fields or languages used by data engineer:

  • Probability as well as linear algebra
  • Machine learning
  • Trend analysis and regression
  • Hive QL and SQL databases

54. What is Big Data?

Ans:

It is a large amount of structured and unstructured data, that cannot be easily processed by traditional data storage methods. Data engineers are using Hadoop to manage big data.

55. What is FIFO scheduling?

Ans:

It is a Hadoop Job scheduling algorithm. In this FIFO scheduling, a reporter selects jobs from a work queue, the oldest job first.

56. Mention default port numbers on which task tracker, NameNode, and job tracker run in Hadoop?

Ans:

Default port numbers on which task tracker, NameNode, and job tracker run in Hadoop are as follows:

  • Task tracker runs on 50060 port
  • NameNode runs on 50070 port
  • Job Tracker runs on 50030 port

57. How to disable Block Scanner on HDFS Data Node?

Ans:

In order to disable Block Scanner on HDFS Data Node, set dfs.datanode.scan.period.hours to 0.

58. How to define the distance between two nodes in Hadoop?

Ans:

The distance is equal to the sum of the distance to the closest nodes. The method getDistance() is used to calculate the distance between two nodes.

59. Why use commodity hardware in Hadoop?

Ans:

Commodity hardware is easy to obtain and affordable. It is a system that is compatible with Windows, MS-DOS, or Linux.

60. Define replication factor in HDFS?

Ans:

Replication factor is a total number of replicas of a file in the system.

61. What are some of the common issues that are faced by the data engineer?

Ans:

  • Real time integration and continuous integration
  • Storing a large amount of data would be one thing and the information from that data is also an issue
  • Considerations of the processors and the RAM configurations
  • The ways to deal with failure and asking whether there is fault tolerance there or not
  • Consideration of the tools, which can be used, and which of them will provide the best storage, efficiency, performance and results.
Course Curriculum

Get Data Engineer Course to Build Your Skills & Enhance Your Career

Weekday / Weekend BatchesSee Batch Details

62. What are all the components of a Hadoop Application?

Ans:

Over time, there are different ways in which the Hadoop application would be defined. Though in a lot of the cases, there are four core components relating to the Hadoop application.

These include the following:

  • Hadoop Common: this is common set of libraries and utilities that have been utilized by Hadoop.
  • HDFS: this relates to the file system in which the Hadoop data is stored. It is one of the distributed file systems that have a high bandwidth.
  • Hadoop MapReduce: this is based according to the algorithm for the provision of large-scale data processing.
  • Hadoop YARN: this can be used for the purpose of resource management within the Hadoop cluster. It may also be used or the scheduling of tasks for the users.

63. What is the main concept behind the Apache Hadoop Framework?

Ans:

The Apache Hadoop is based according to the concept, which is oriented toward Mapreduce. When it comes to this algorithm, Map and Reduce type of operations are the ones, which are used for processing the large data sets. The Map method is the one which does the filtering and sorting of the particular data the Reduce method also performs summaries of the data. The main points within this point would include fault tolerance as well as scalability. When it comes to Apache Hadoop, these particular features can be achieved by multi-threading and the efficient implementation of Map Reduce.

64. What is the main difference that is between NameNode Backup Node and Checkpoint Node when it comes to HDFS?

Ans:

  • NameNode: at the core of the HDFS file system which manages the metadata. That is to say, the data of the filing system is not stored on the NameNode but that it has the directory tree of the files, which are all present on the HDFS system on a Hadoop type cluster. There are two files, which are used for the sake of the namespace.
  • Edits file: this is a log of changes, which are made to the namespace since checkpoint.
  • Fsimage file: a file that will keep track of the latest checkpoint within the namespace.
  • BackupNode: this particular node provides check pointing functionality but it also monitors the up to date in memory copy for the filing system namespace which is synchronized with the active NameNode.
  • Checkpoint Node: this is what keeps track of the latest checkpoint within a directory that has a similar structuring as compared to the one of the NameNode directory. The checkpoint node allows for checkpoints for the particular namespace at regular intervals through the downloading of the edits and the fsimage files from the NameNode and merging it locally. The new image would be updated back to the active NameNode.

65. Explain how the analysis of Big data is helpful when it comes to the increase of business revenue?

Ans:

  • Big Data has gained a significance for a variety of businesses. It assists them to differentiate from others and by doing increases the odds of revenue gain. Through the predictive analytics, big data type analytics allows businesses to customize the suggestions and recommendations. Similarly, the big data type analytics allows the enterprises to launch their new products as depending on the needs of the client and their preferences.
  • These factors are those, which make the businesses able to earn more revenue, and so companies are frequently using big data type of analytics. The corporations may then encounter a rise of 5 to 20 percent in revenue through the implementation of the big data type of analytics. Some of the popular technology companies or organizations are those that are using big data analytics in order to increase their revenue. They include technology companies like Facebook, Twitter and even Bank of America.

66. What are the steps to be followed for deploying a Big Data solution?

Ans:

  • Data ingestion: the first step for the deployment of a big data solution would be data ingestion. That is to mention the extraction of data from different sources. The data sources could be from a CRM like SAP, Salesforce, Highrise or RDBMS such as MySQL or log files, internal databased and much more. The data can then be ingested through batch jobs or real time streaming. The extracted data would then be stored within HDFS.
  • Data Storage: after the data ingestion, the next step would be to store the extracted data. The data would either be stored in a HDFS or NoSQL database preferably. The HDFS storage would wok especially well for the sequential access though HBase when it comes to random read or write access.
  • Data processing: the concluding step when it comes to the deployment of a big data solution would be the data processing. The data is then processed through one of the main frameworks for processing such as MapReduce or Pig.

67. What are some of the important features inside Hadoop?

Ans:

Hadoop supports the processing and storage of big data. It represents the best solution when it comes to the handling of the big data type challenges.

Some of the main features when it comes to Hadoop would include the following:

  • Open Source: Hadoop represents an open source type framework which is to say that it is available free of charge. The users may also be allowed to alter the source code according to the requirements.
  • Distributed processing: Hadoop supports the distributed processing of data whereby there is faster processing. The data within Hadoop HDFS is stored in a distributed fashion and MapReduce is obligated to the parallel processing of the that data.
  • Fault tolerance: Hadoop itself is highly tolerant. It allows for the creation of three replicas for each of the blocks at different nodes by default. This number may be altered depending on the particular requirements. Therefore, it is possible to change this according to the requirements. It is possible to recover the data from another node in the event that one of the nodes is not successful. The particular detection of the node failure and the recovery of data is done automatically.
  • Scalability: the other significant feature of Hadoop would is scalability. It is compatible with the many types of hardware and it’s easy to access that new hardware within particular infrastructure nodes.
  • Reliability: Hadoop stores the data in the cluster in a reliable manner, which would be independent of all other operations. That means the data that is stored within the Hadoop environment is not be affected by the failings of the machine.
  • High availability: the data is stored within Hadoop and becomes available to access even after the hardware failure. In the event of hardware failure, the data may even be accessed from other paths.

68. What is the difference between an architect for data and a data engineer?

Ans:

  • The data architect is that person that manages the data particularly when one is dealing with different numbers for a variety of data sources. A person may have an in-depth knowledge of the way the database works, how the data is connected to business problems and the way the changes would disturb the organization’s usage and then a data architect would manipulate the data architecture according to them. the main responsibility of the data architect would be working on data warehousing, development of the data architecture or the enterprise data warehouse/ hub.
  • The data engineer would assist with the installation of data warehousing solutions, data modeling, development and the testing of database architecture.

69. Describe one time when you found a new use case for the present database, which had a positive effect on the enterprise?

Ans:

During the era of Big Data having SQL would lack the following features:

  • RDBMS are the schema oriented DB so it is structured better for the structured data and not for the semi structured or the unstructured data.
  • Not being able to process the unpredictable or the unstructured data.
  • It is not scalable horizontally, which is to say the parallel execution and storing is not possible within SQL.
  • It suffers from performance issue once the number of users starts to increase.
  • Use case is mainly utilized for Online transactional processing.

70. Give a healthy description of what Hadoop Streaming is to you?

Ans:

Hadoop distribution gives Java utility known as Hadoop streaming. With the use of Hadoop Streaming, it possible to create and run the Map Reduce tasks with a script, which is executable. It is possible to create the executable scripts for the Mapper as well as the Reducer functions. These executable scripts are passed to Hadoop Streaming in a command. The Hadoop Streaming utility allows for the creation of Map and Reduce jobs and submits them to a particular cluster. It is also possible to monitor these jobs with the particular utility.

71. Can you tell me what the Block and Block Scanner in HDFS?

Ans:

A large file when it comes to HDFS is broken into different parts and each of them is stored on a different Block. By default, a Block has a 64 MB capacity within HDFS. Block Scanner refers to a program which every Data node in HDFS runs periodically in order to verify the checksum of each block stored within the data node. The goal of the Block Scanner would be detecting the data corruption errors on the Data node.

72. What are the different means that Hadoop is run?

Ans:

  • Standalone Mode: by default, Hadoop runs in a local mode whereby it is on a non-distributed and single node. This standalone mode utilizes a local filing system in order to perform both input and output operations with efficiency. That particular mode would not support the usage of the HDFS so it can be used for the sake of debugging. There is no custom configuration, which is required for the configuration files within the particular mode.
  • Pseudo Distributed Mode: in this mode, Hadoop is run on a single node in the same way as the Standalone mode. Within this mode, each daemon runs in a separate Java process. As the daemons run on a single node, there is the same node for the Master and Slave nodes.
  • Fully distributed mode: in the fully distributed mode, the daemons are running on separate individual modes and so to form a multi-node cluster. There are different nodes for both the Master and Slave nodes.

73. How would you achieve security within Hadoop?

Ans:

Kerberos is a tool utilized often to achieve security within Hadoop. There are 3 steps that would allow for the access of a service while using Kerberos. Each step is part of a message exchange with a server.

  • Authentication: the first step would be to secure the authentication of the client to the server and then providing a time-stamped TGT to the client.
  • Authorization: for this step, the client would use the received TGT in order to request a service ticket from the ticket-granting server.
  • Service request: this is the final step in order to obtain some security within Hadoop. Then the client would utilize the service ticket for authenticating himself to the particular server.

74. What is data, which is happened to be stored within a HDFS NameNode?

Ans:

The NameNode refers to the central node of a HDFS system. It does not store any of the data from the Map-Reduce operations. Though, it has metadata, which has been stored within the HDFS DataNodes. NameNode has the directory tree for the files within the HDFS filesystem. With the use of this metadata, it ends up managing the data, which is stored in the different DataNodes.

75. Why did you study data engineering?

Ans:

This question aims to learn about your education, work experience, and background. It might have been a natural choice in the continuation of your Information Systems or Computer Science degree. Or, maybe you have worked in a similar field, or you might be transitioning from an entirely different work area.However, don’t start storytelling. Start with your educational background a little and then reach to the part when you knew you wanted to be a data engineer. And then move on how you reach here.

76. What is the toughest thing about being a data engineer according to you?

Ans:

“As a data engineer I find it hard to complete the request of all the departments in a company where most of them often come up with conflicting demands. So, I often find it challenging to balance them accordingly.

77. Tell us an incident where you were supposed to bring data together from various sources but faced unexpected issues and how did you resolve it?

Ans:

“In my previous franchise company, I and my team were supposed to collect data from various locations and systems. But one of the franchises changed their system without giving us any prior notice. This resulted in a handful of issues for data collection and processing.

78. How is the job of a data engineer different from that of a data architect?

Ans:

“according to my experience, the difference between the roles of a data engineer and a data architect varies from company to company. Although they work very closely together, there are differences in their general responsibilities.

79. What are Big Data’s four V’s?

Ans:

The four V’s of Big Data are:

  • The first V is Velocity which is referred to the rate at which Big Data is being generated over time. So, it can be considered as analyzing the data.
  • The second V is the Variety of various forms of Big Data, be it within images, log files, media files, and voice recordings.
  • The third V is the Volume of the data. It could be in the number of users, the number of tables, size of data, or the number of records.
  • The fourth V is Veracity related to the uncertainty or certainty of the data. In other terms, it decides how sure you can be about the accuracy of the data.

80. How is structured data different from unstructured data?

Ans:

The below table explain the differences:

Structured DataUnstructured Data
It can be stored in MS Access, Oracle, SQL Server, and other similar traditional database systems.It can’t be stored in a traditional database system.
It can be stored within different columns and rows.It can’t be stored in rows and columns.
An example of structured data is online application transactions.Examples of unstructured data are Tweets, Google searches, Facebook likes, etc.
It can be easily defined within the data model.It can’t be defined according to the data model.
It comes with a fixed size and contents.It comes in various sizes and contents.

81. Which ETL tools are you familiar with?

Ans:

  •  Name all the ETL tools you have worked with. You can say, “ I have worked with SAS Data management, IBM Infosphere, and SAP Data Services. But my preferred one is PowerCenter from Informatica. It is efficient, has an extremely high-performance rate, and is flexible. In short, it has all the important properties of a good ETL tool.
  • They smoothly run business data operations and guarantee data access even when there are changes taking place in business or its structure.” Make sure you only talk about the ones you have worked with and the ones you like working with. Or, it could tank your interview later.
Data Engineer Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

82. Tell us about design schemas of data modeling?

Ans:

Data modeling comes with two types of design schemas.

They are explained as follows:

  • table and the dimension table. Here, both the tables are connected. Star schema is the simplest data mart schema style and is most widely approached as well. It is named so because its structure resembles a star.
  • The second one is the Snowflake schema which is the extension of the star schema. It adds additional dimensions and is called a snowflake because its structure resembles that of a snowflake.

83. What is the difference between Star schema and Snowflake schema?

Ans:

The below table explain the differences:

Star SchemaSnowflake Schema
The dimension table contains the hierarchies for the dimensions.There are separate tables for hierarchies.
Here dimension tables surround a fact table.Dimension tables surround a fact table and then they are further surrounded by dimension tables.
A fact table and any dimension table are connected by just a single join.To fetch the data, it requires many joins.
It comes with a simple DB design.It has a complex DB design.
Works well even with denormalized queries and data structures.Works only with the normalized data structure.
Data redundancy- high.Data redundancy- very low.
Aggregated data is contained in a single dimension.Data is split into different dimension tables.
Faster cube processing.Complex join slows cube processing.

84. What is the difference between Data warehouse and Operational database?

Ans:

The below table explain the differences:

Data WarehouseOperational Database
These are designed to support the analytical processing of high-volume.These support transaction processing of high-volume.
Historical data affects a data warehouse.Current data affects the operational database.
New, non-volatile data is added regularly but remains rarely changed.Data is updated regularly as the need arises.
It is designed for analyzing business measures by attributes, subject areas, and categories.It is designed for real-time processing and business-dealings.
Optimized for heavy loads and complex queries accessing many rows at every table.Optimized for a simple single set of transactions like retrieving and adding one row at a time for every table.
It is full of valid and consistent information and doesn’t need any real-time validation.Improved for validating incoming information and uses validation data tables.
Supports a handful of OLTP like concurrent clients.Supports many concurrent clients.
Its systems are mainly subject-oriented.Its systems are mainly process-oriented.
Data out.Data In.
A huge number of data can be accessed.A limited number of data can be accessed.
Created for OLAP, on-line Analytical Processing.Created for OLTP, on-line transaction Processing.

85. Point out the difference between OLTP and OLAP?

Ans:

The below table explain the differences:

OLTPOLAP
Used to manage operational data.Used to manage informational data.
Clients, clerks and IT professionals use it.Managers, analysts, executives, and other knowledge workers use it.
It is customer-oriented.It is market-oriented.
It manages current data, the ones that are extremely detailed and are used for decision making.It manages a huge amount of historical data. It also provides facilities for aggregation and summarization along with managing and storing data at different levels of granularity. Hence the data becomes more comfortable to be used in decision making.
It has a 100 MB-GB database size.It has a 100 GB-TB database size.
It uses an ER (entity-relationship) data model along with a database design that is application-oriented.OLAP uses either a snowflake or star model along with a database design that is subject-oriented.
The volume of data is not very large.It has a large volume of data.
Access mode- Read/Write.The access mode is mostly write.
Completely normalized.Partially normalized.
Its processing speed is very fast.Its processing speed depends on the number of files it contains, complex queries, and batch data refresh

86. Explain the main concept behind the Framework of Apache Hadoop?

Ans:

 It is based on the MapReduce algorithm. In this algorithm, to process a huge data set, Map and Reduce operations are used. Map, filters and sorts the data while Reduce, summarizes the data. Scalability and fault tolerance are the key points in this concept. We can achieve these features in Apache Hadoop by efficiently implementing MapReduce and Multi-threading.

87. What data is stored in NameNode?

Ans:

Namenode stores the metadata for the HDFS like block information, and namespace information.

88. What do you mean by Rack Awareness?

Ans:

In Haddop cluster, Namenode uses the Datanode to improve the network traffic while reading or writing any file that is closer to the nearby rack to Read or Write request. Namenode maintains the rack id of each DataNode to achieve rack information. This concept is called as Rack Awareness in Hadoop.

89. What are the functions of Secondary NameNode?

Ans:

Following are the functions of Secondary NameNode:

  • FsImage which stores a copy of EditLog and FsImage file.
  • NameNode crash: If the NameNode crashes, then Secondary NameNode’s FsImage can be used to recreate the NameNode.
  • Checkpoint: It is used by Secondary NameNode to confirm that data is not corrupted in HDFS.
  • Update: It automatically updates the EditLog and FsImage file. It helps to keep FsImage file on Secondary NameNode updated.

90. What happens when NameNode is down, and the user submits a new job?

Ans:

 NameNode is the single point of failure in Hadoop so the user can not submit a new job cannot execute. If the NameNode is down, then the job may fail, due to this user needs to wait for NameNode to restart before running any job.

91. What are the basic phases of reducer in Hadoop?

Ans:

 There are three basic phases of a reducer in Hadoop:

  • Shuffle: Here, Reducer copies the output from Mapper.
  • Sort: In sort, Hadoop sorts the input to Reducer using the same key.
  • Reduce: In this phase, output values associated with a key are reduced to consolidate the data into the final output.

92. Why Hadoop uses Context object?

Ans:

Hadoop framework uses Context object with the Mapper class in order to interact with the remaining system. Context object gets the system configuration details and job in its constructor.We use Context object in order to pass the information in setup(), cleanup() and map() methods. This object makes vital information available during the map operations.

93. Define Combiner in Hadoop?

Ans:

It is an optional step between Map and Reduce. Combiner takes the output from Map function, creates key value pairs, and submit to Hadoop Reducer. Combiner’s task is to summarize the final result from Map into summary records with an identical key.

94. What is the default replication factor available in HDFS What it indicates?

Ans:

Default replication factor in available in HDFS is three. Default replication factor indicates that there will be three replicas of each data.

95. What do you mean Data Locality in Hadoop?

Ans:

In a Big Data system, the size of data is huge, and that is why it does not make sense to move data across the network. Now, Hadoop tries to move computation closer to data. This way, the data remains local to the stored location.

96. Define Balancer in HDFS?

Ans:

In HDFS, the balancer is an administrative used by admin staff to rebalance data across DataNodes and moves blocks from overutilized to underutilized nodes.

97. Explain Safe mode in HDFS?

Ans:

It is a read-only mode of NameNode in a cluster. Initially, NameNode is in Safemode. It prevents writing to file-system in Safemode. At this time, it collects data and statistics from all the DataNodes.

98. What is Metastore in Hive?

Ans:

It stores schema as well as the Hive table location.

Hive table defines, mappings, and metadata that are stored in Metastore. This can be stored in RDBMS supported by JPOX.

99. List components available in Hive data model?

Ans:

There are the following components in the Hive data model:

  • Tables
  • Partitions
  • Buckets

100.How can you increase the business revenue by analyzing Big Data?

Ans:

 Big data analysis is a vital part of the businesses since it helps them to differentiate from one another along with increasing the revenue. Big data analytics offers customized suggestions and recommendations to businesses through predictive analysis.It also helps businesses in launching new products based on the preferences and needs of the customers. This helps the businesses earn significantly more, approximately 5-20% more. Companies like Bank of America, LinkedIn, Twitter, Walmart, Facebook, etc. use Big Data Analysis to increase their revenue.

Are you looking training with Right Jobs?

Contact Us

Popular Courses