Apache Spark Interview Questions and Answers
Last updated on 25th Sep 2020, Blog, Interview Question
1. List the languages supported by Apache Spark for developing any big data applications.
The languages supported by Apache Spark for developing any big data applications are
2. Is there an option for a user to use Spark to access and investigate any external data stored in Cassandra databases?
Yes, it is possible to use Spark Cassandra Connector to analyze and access external data stored in Cassandra databases.
3. Can a user use Apache Spark on Apache Mesos?
Apache Spark can be executed on the hardware clusters managed by Apache Mesos. This is also one of the features which make the Apache Spark quite popular.
4.Mention something about all the different cluster managers available in Apache Spark.
The 3 different clusters managers supported in Apache Spark are:
1. YARN – It is mainly responsible for resource management in Hadoop.
2. Standalone deployments – Manager to set up a cluster. These are well suited for new deployments which can only run and are very easy to set up.
3. Apache Mesos – This has rich resource scheduling capabilities and has been designed to be well suited to run Spark along with other applications. It is especially advantageous when numerous users run interactive shells, majorly because it scales down the CPU allocation between commands. It is the generalized/commonly-used cluster manager which also runs Hadoop MapReduce and other applications.
5.Can Spark be connected to Apache Mesos by a user?
Yes, a user can connect Spark to Apache Mesos. In order to connect Spark with Mesos, the user must follow the given steps-
- 1. The spark driver program needs to be configured in order to connect to the Mesos. Any Spark binary package should be in an exclusive location accessible by Mesos.
- 2. This is an alternative way to achieve the same. The user needs to install Apache Spark in the same location similar to Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it has been installed.
6.Can the user minimize data transfers when working with Spark? If yes, how?
Yes, any user has been given the option to minimize the data transfers while working with Spark. Minimizing data transfers and escaping shuffling helps the user to write Spark programs that can be executed in a fast and reliable manner. The various ways in which data transfers can be reduced when working with Apache Spark are:
- Using Broadcast Variable- Broadcast variables are designed to enhance the efficiency of joins between small and large RDDs.
- Using Accumulators – Accumulators help the user to update the values of variables in parallel while executing the program simultaneously.
7.Does a user need broadcast variables while working with Apache Spark? If yes, why?
Broadcast variables are read-only variables and are present within the memory cache on every machine. When a user is working with Spark, he/she needs to use broadcast variables in order to eliminate the necessity to send copies of a variable for every task, so data can be processed faster. Broadcast variables also help to store a lookup table inside the memory in order to enhance the retrieval efficiency when compared to a RDD lookup ().
8.Can a user execute Spark and Mesos in accordance with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of the individual services as a separate service on the machine. The Apache Mesos acts as a unified schedule that assigns tasks to either Spark or Hadoop.
9.Do you know anything about the lineage graph?
All the RDDs available in Spark solely depend on more than one RDD. The representation of all such dependencies between RDDs is termed as Lineage graph. The information provided by a Lineage graph is used to compute each RDD on demand in order to make sure that whenever a part of a persistent RDD is lost, the data lost can be recovered without a fuss using the lineage graph information.
10.Can a user trigger automatic clean-ups in Spark in order to handle accumulated metadata?
Yes, the user can trigger automatic clean-ups by setting the parameter ‘spark.cleaner.ttl’. Alternatively, the user can achieve the same by dividing the long-running jobs into different batches and writing all intermediary results to the disk.
Subscribe For Free Demo[contact-form-7 404 "Not Found"]
11.What do you know about the major libraries that constitute the Spark Ecosystem?
The following are the major libraries that make up a bulk of the Spark Ecosystem:
- Spark Streaming – This library is generally used to process real-time streaming data.
- Spark MLib – This is the Machine learning library in Spark and is commonly used for learning algorithms like clustering, regression, classification, etc.
- Spark SQL – This library helps to execute SQL like queries on Spark data using standard visualization or BI tools.
- Spark GraphX – This is the Spark API for graph parallel computations along with basic operators like joinVertices, subgraph, aggregateMessages, etc.
12.Mention the benefits of using Spark in accordance with Apache Mesos.
Spark when used with Apache Mesos renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and any other big data framework.
13. Mention the importance of Sliding Window operation.
The function called Sliding Window controls transmission of data packets between the various computer networks. Spark Streaming library provides a number of windowed computations where the transformations on RDDs are explicitly applied over a sliding window of data. Whenever the window slides, all the RDDs that fall within the particular window are combined and operated upon in order to produce new RDDs of the windowed DStream.
14. What do you know about a DStream?
A Discretized Stream, generally known as a DStream, is a sequence of Resilient Distributed Databases (RDDs) that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
All minds of DStreams have two operations –
- Output operations that write data to an external system
- Transformations that produce a new DStream
15. While running Spark applications, is it necessary for the user to install Spark on all the nodes of a YARN cluster?
One of the most striking features of Spark is that it does not need not be installed when running a job under YARN or Mesos. This is because Spark can execute on top of YARN or Mesos clusters without causing any change to the cluster.
16. Tell us something about the Catalyst framework.
A Catalyst framework is a new optimization framework present in the Spark SQL. This special framework allows Spark to automatically transform SQL queries by adding some new optimizations in order to build a faster processing system.
17. Can you mention the companies that use Apache Spark in their respective production?
Some of the companies that make use of the Apache Spark in their production are
- Open Table
18. Do you know about the Spark library that allows reliable file sharing at memory speed across different cluster frameworks?
The Tachyon is the Spark Library that allows reliable file sharing at memory speed across different cluster frameworks.
19. Do you know anything about BlinkDB? Why is it used?
BlinkDB is a query engine that is used for executing interactive SQL queries on large volumes of data and condenses query results marked with meaningful error bars. It is an amazing tool and helps users to balance query accuracy along with response time.
20. Differentiate between Hadoop and Spark with respect to ease of use.
The Hadoop MapReduce is required by users for programming in Java which is difficult. The Pig and Hive have been developed to make programming with Java considerably easier. However, learning the syntax of Pig and Hive takes a lot of time. Spark has a number of interactive APIs for different languages like Java, Python or Scala and also includes Spark SQL. This makes it comparatively easier to use than Hadoop.
21. Mention the common mistakes that developers usually commit when running Spark applications.
The common mistakes that developers usually commit when running Spark applications:
- 1. Hitting the web service several times by using multiple clusters.
- 2. Run everything on the local node instead of distributing it.
22. Mention the advantages of a Parquet file.
Parquet file is a columnar format file that helps the user to–
- 1. Consumes less space
- 2. Limit I/O operations
- 3. Fetches only required columns
23. Mention the various data sources available in SparkSQL.
The various data sources available in SparkSQL are:
- 1. Parquet file
- 2. Hive tables
- 3. JSON Datasets
24. How can a user execute Spark using Hadoop?
Spark has been designed with its own cluster management computation and uses Hadoop for storage mainly.
25. Mention the features of Apache Spark that make it so popular.
The features of Apache Spark that make it so popular are:
- Apache Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc.
- Apache Spark has good performance gains, as it helps to run an application in the Hadoop cluster ten times faster on disk and 100 times faster within the memory.
- Apache Spark has built-in APIs in multiple languages like Java, Scala, Python and R.
26. Tell us something about the Pair RDD.
Special operations can be performed on RDDs in Spark using the available key/value pairs. Such RDDs are termed Pair RDDs. Pair RDDs allow the innumerable users to access each key in parallel. They also have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements which have the same key.
27. Between the Hadoop MapReduce and Apache Spark, which must be used for a project?
Choosing an application or development software depends on the given project scenario. Spark uses memory instead of network and disk I/O. However, Spark uses a large amount of RAM and requires a dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies vigorously with the requirements of the project and budget of the organization.
28. Mention the different types of transformations on DStreams.
The different types of transformations on DStreams are:
1. Stateful Transformations – Processing of the batch depends on the intermediary results of the previous batch.
Examples –Transformations that depend on sliding windows.
2. Stateless Transformations – Processing of the batch does not depend on the output of the previous batch.
Examples – map (), reduceByKey (), filter ().
29. Explain about the popular use cases of Apache Spark.
Apache Spark is mainly used for:
- 1. Iterative machine learning.
- 2. Interactive data analytics and processing.
- 3. Stream processing
- 4. Sensor data processing
30. Can Apache Spark be used for Reinforcement learning?
No, a user cannot use Apache Spark for Reinforcement Learning. The Apache Spark works well for simple machine learning algorithms like clustering, regression, and classification.
Enroll in Apache Spark Training to Build Skills & Advance Your Career
- Instructor-led Sessions
- Real-life Case Studies
31. What do you know about the Spark Core?
Spark Core is one of the features of Spark. It has all the basic functionalities of Spark, such as – interacting with storage systems, memory management, fault recovery, scheduling tasks, etc.
32. Can the user remove the elements with a key present in another RDD?
The user can remove the elements with a key present in another RDD by using the subtractByKey () function.
33. Differentiate between persist() and cache() methods.
The persist() method allows the user to specify the storage level whereas the method cache () uses the default storage level.
34. Tell us about the various levels of persistence in Apache Spark.
The Apache Spark automatically continues the intermediary data from a number of shuffle operations. However, it is often recommended for the users to call the persist() method on the RDD so that it can be reused. The Spark has various persistence levels to store a number of RDDs on disk or within the memory or as a combination of both the disk and the memory with different replication levels.
The various storage/persistence levels in Spark are –
- 1. OFF_HEAP
- 2. MEMORY_ONLY
- 3. MEMORY_AND_DISK
- 4. MEMORY_ONLY_SER
- 5. MEMORY_AND_DISK_SER, DISK_ONLY
35. How has the Spark been designed to handle monitoring and logging in while in the Standalone mode?
Spark has been provided with a web-based user interface for keeping a check on the cluster in standalone mode that shows the cluster as well as job statistics. The user log output for each job is written to the working directory of the slave nodes.
36. Can the Apache Spark provide checkpointing to the user?
Lineage graphs have been provided within the Apache Spark to recover RDDs from a failure. However, this is time-consuming if the RDDs have long lineage chains. Spark has been provided with an API for checkpointing i.e. a REPLICATE flag to persist. The decision on which data to checkpoint is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
37. How can the user launch Spark jobs within Hadoop MapReduce?
By using the SIMR (Spark in MapReduce) users are able to execute any Spark job inside MapReduce without using any admin rights.
38. How is Spark able to use Akka?
Spark has been designed to use Akka for scheduling. All the users usually request for a task to master after registering themselves. The master just assigns the task. At this particular instance, Spark uses Akka for messaging between the workers and masters.
39. How is the user able to achieve high availability in Apache Spark?
The user is able to achieve high availability in Apache Spark by applying the given methods:
- By implementing a single node recovery in accordance with the local file system.
- By using the StandBy Masters with the Apache ZooKeeper.
40. How does Apache Spark achieve fault tolerance?
The data storage model in Apache Spark is based on RDDs. The RDDs help achieve fault tolerance through lineage graphs. The RDD has been designed to always store information on how to build from other datasets. If any partition of an RDD is lost due to failure, the lineage helps to build only that particular lost partition.
41. Explain the core components of a distributed Spark application
The core components of any distributed Spark application are as follows:
- 1. Executor – It consists of the worker processes that run the individual tasks of a Spark job.
- 2. Driver- This consists of the process that runs the main() method of the program to create RDDs and perform transformations and actions on them.
- 3. Cluster Manager- This is a pluggable component in Spark which is used to launch Executors and Drivers. The cluster manager allows the Spark to run with external managers like Apache Mesos or YARN in the background.
42. Do you know anything about Lazy Evaluation?
When Spark is instructed to operate on a given dataset, it takes care of the instructions and makes a note of it, so that it does not forget. However Spark does nothing about the instructions unless the user asks for the final result.
When a transformation like the method map() is called on an RDD, Spark does not perform the operation immediately. All transformations in Spark are not evaluated until the user has to perform an action. This helps to optimize the overall data processing workflow.
43. What do you know about a worker node?
A worker node is a node that can run the Spark application code in a cluster. A worker node can have more than one process which can be easily configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker node is initiated if the SPARK_ WORKER_INSTANCES property is not defined.
44. Tell us something about SchemaRDD.
An RDD that comprises row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column is known as a SchemaRDD.
45. Mention the disadvantages of using Apache Spark over Hadoop MapReduce.
The Apache Spark does not perform very well for compute-intensive jobs and consumes a large number of system resources. Apache Spark’s in-memory capability causes a major barrier for cost-efficient processing of big data. Spark has its own file management system and needs to be integrated with other cloud-based data platforms and in case, Apache Hadoop.
46. Does the user need to install Spark on all the nodes of a YARN cluster while running Apache Spark on YARN?
No, it is not necessary for the user to install Spark on all the nodes of a YARN cluster while running Apache Spark on YARN because Apache Spark runs on top of YARN.
47. Do you know anything about the Executor Memory in a Spark application?
Every Spark application has the same fixed heap size and a fixed number of cores for a Spark executor. The heap size is known as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every Spark application has been designed to have one executor on each worker node. The executor memory is a measure of the size of memory of the worker node that the application utilizes.
48. What has the Spark Engine been designed to accomplish?
The Spark engine has been designed to accomplish a number of tasks such as creating schedules, distributing and monitoring all data applications across the Spark cluster.
49. Apache Spark is good at low-latency workloads like graph processing and machine learning. Elaborate on the reasons behind this.
The Apache Spark stores data in-memory for faster model building and training. All Machine learning algorithms require multiple iterations to produce an optimal model result. Similarly, graph algorithms navigate through all the nodes and edges. All these low latency workloads that need multiple iterations can lead to increased performance. Less disk access and controlled network traffic change the entire equation when there is a lot of data to be processed.
50. Does the user have to start Hadoop to run any Apache Spark Application?
No, starting Hadoop is not mandatory for the user to run any Spark application. As there is no separate storage in Apache Spark, it uses the Hadoop HDFS. The data can be stored in a local file system and can be conveniently loaded from the local file system and processed.
Learn Apache Spark Certification Course and Get Hired by TOP MNCsWeekday / Weekend BatchesSee Batch Details
51. Mention the default level of parallelism in Apache Spark.
If the user does not explicitly state the level of parallelism, then the number of partitions are considered as the default level of parallelism in Apache Spark.
52. Do you know anything about the common workflow of a Spark program?
Yes, the following is the common workflow of the Spark Program:
- The first step in a Spark program is the creation of the input RDD’s from external data.
- The various RDD transformations like filter() are next used to create new transformed RDD’s based on the business logic.
- The persist() method is used for any intermediate RDD’s which might have to be reused in the future.
- Finally, the various RDD actions like first(), count() are launched to begin parallel computation. Later these are optimized and executed by Spark.
53. In a Spark program, how can the user identify whether a given operation is a Transformation or Action?
One can identify the operation based on the return type –
- The operation is an Action if the return type is anything other than RDD.
- The operation is Transformation if the return type is the same as the RDD.
54. What is a common mistake any Apache Spark developer usually makes while working with Spark?
Some of the common mistakes that all Apache Spark developers make while working with Spark are:
- 1. Maintaining the required size of shuffle blocks.
- 2. Trying to manage directed acyclic graphs (DAG’s.)
55. Differentiate between Spark SQL and Hive.
The following are the differences between Spark SQL and Hive:
- 1. Any Hive query can easily be executed in Spark SQL but vice-versa is not true.
- 2. Spark SQL is faster than Hive.
- 3. It is not compulsory to create a metastore in Spark SQL but it is compulsory to create a Hive metastore.
- 4. Spark SQL is a library while Hive is a framework.
- 5. Spark SQL automatically deduces the schema while in Hive, the schema needs to be explicitly declared.
56. Mention the sources from where Spark streaming component can process real-time data.
Usually the users apply Apache Flume, Apache Kafka, and Amazon Kinesis for Spark streaming component to process real-time data.
57. What are the companies that are currently using Spark Streaming?
Uber, Netflix, Pinterest are some of the companies that are currently making use of Spark Streaming.
58. What is the bottom layer of abstraction in the Spark Streaming API?
DStream is the bottom layer of abstraction in the Spark Streaming API.
59. Do you know anything about receivers in Spark Streaming?
Receivers are special entities in Spark Streaming that consume data from various data sources and move them accordingly to Apache Spark. Receivers are usually created by streaming contexts as long-running tasks on various executors and scheduled to operate in a Round-Robin manner with each receiver taking a single core.
60. How is the user supposed to calculate the number of executors required to do real-time processing using Apache Spark? What factors need to be considered for deciding on the number of nodes for real-time processing?
The number of nodes can be easily calculated by benchmarking the hardware. While doing so, one must also consider multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with Spark.
61. Differentiate between Spark Transform in DStream and Map.
The transform() function in Spark Streaming allows the concerned developers to use Apache Spark transformations on the underlying RDD’s for the stream.
The map() function in Hadoop is used for element-to-element transform and can be implemented using the transform() function. The map() method works on the elements of Dstream while the transform() method allows developers to work with RDD’s of the DStream. A map() method is an elementary transformation whereas the transform() method is an RDD transformation.
62. What is Apache Spark?
The Apache Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution engine supporting cyclic data flow and in-memory computing. Spark can run on Hadoop, can run standalone or in the cloud and is also capable of accessing diverse data sources including HDFS, HBase, Cassandra, and others.
63. Explain Caching in Spark Streaming.
DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream. For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.
64. Do you know anything about RDD?
RDD is the acronym for Resilient Distribution Datasets. It is a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD:
- Parallelized Collections: The existing RDD’s running parallel with one another.
- Hadoop datasets: They perform a function on each file record in HDFS or another storage system.
65. Define Partitions.
A partition is a smaller and logical division of data similar to ‘split’ in the MapReduce method of programming. Partitioning is the process used to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.
66. What operations does the RDD support?
An RDD has distributed a collection of objects. Each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable data structures, i.e., they can only be read. The user can’t change the original RDD, but can transform it into a different RDD with all the required changes.
RDDs support two types of operations:
- 1. Transformations – Transformations create a new RDD from an existing RDD like map() and reduceByKey(). Transformations are executed on demand.
- 2. Actions – Actions return final results of RDD computations. Actions trigger execution using a lineage graph to load data into the original RDD, execute all intermediate transformations and return final results to the Driver program or write results to the file system.
67. What do you understand by Transformations in Spark?
Transformations are functions applied on RDD, resulting in another RDD. A transformation specifically does not execute until an action occurs. The map() and filer() methods are examples of transformation. The map() method formally applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements to form the current RDD that passes the function argument.
68. Elaborate on the concept of Actions.
An action in Spark helps in restoring the data from RDD to the local machine. The execution of any Action is the result of all previously created transformations. The reduce() method is an action that implements the function passed again and again until one value is finally left. The take() Action takes all the values from RDD to the local node.
69. Mention the functions of SparkCore.
The SparkCore acts as the base engine and performs a number of functions such as:
- Memory management
- Job scheduling
- Monitoring jobs
- Interaction with storage systems
70. What do you know about RDD Lineage?
Spark does not support data replication in the memory. If any data is lost, it is automatically rebuilt using RDD lineage. The RDD lineage is a process that reconstructs lost data partitions and always remembers how to build from other datasets.
71. What do you know about Spark Driver?
Spark Driver is the program that runs on the master node of the machine and is used to declare transformations and Actions on data RDDs. The driver in Spark creates SparkContext, connected to a given Spark Master. The driver also delivers the RDD graphs to the Spark Master, where the standalone cluster manager runs.
72. What do you know about Hive on Spark?
Hive contains significant support for Apache Spark but Hive execution is configured to Spark through the given piece of code:
- hive> set spark.home=/location/to/sparkHome;
- hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
73. Name some of the commonly-used Spark Ecosystems.
These are some of the common commonly-used Spark Ecosystems:
- 1. Spark SQL (Shark)- for developers.
- 2. SparkR to promote R Programming in Spark engine.
- 3. GraphX for generating and computing graphs.
- 4. MLlib (Machine Learning Algorithms).
- 5. Spark Streaming for processing live data streams.
74. Do you know anything about Spark Streaming?
Spark supports stream processing. It is an extension to the Spark API and allows stream processing of live data streams. Data is procured from different sources like Flume and HDFS. It is streamed and finally processed to file systems, live dashboards, and databases. This is similar to batch processing as the input data is divided into streams like batches.
75. Tell us something about GraphX.
Spark uses the tool, GraphX for graph processing and to build and transform interactive graphs. The GraphX component enables programmers to study structured data at scale.
76. Why is the MLlib required?
The MLlib is an accessible machine learning library provided within Spark. It makes machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
77. What do you know about Spark SQL?
SQL Spark or Shark is a novel module introduced in Spark to work with structured data and execute structured data processing. Spark executes relational SQL queries on the data. The core of the Shark supports an altogether different RDD called the SchemaRDD. The SchemaRDD is composed of rows objects and schema objects defining the data type of each column in the row and is similar to a table in a relational database.
78. Tell us something about a Parquet file.
A Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both the read and write operations with the Parquet file.
79. What are the file systems supported by Spark?
The following are the file systems supported by Spark:
- 1. Hadoop Distributed File System (HDFS).
- 2. Local File system.
- 3. S3
80. Elaborate on the Yarn.
The Yarn is one of the key features in Spark and is very similar to Hadoop. It provides a central and resource management platform to deliver accessible operations across the cluster. When the user runs Spark on Yarn, he/she has to necessitate a binary distribution of Spark as it is built on Yarn support.
81. List the functions of Spark SQL.
The Spark SQL is capable of accomplishing the following functions:
- Loading data from a variety of structured sources.
- Providing integration between SQL and regular Python/Java/Scala code, along with the ability to join RDDs and SQL tables, and expose custom functions in SQL.
- Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
82. Mention the benefits of using Spark as compared to MapReduce.
Spark has a number of advantages as compared to MapReduce:
- 1. Spark implements the processing around 10-100x faster than Hadoop MapReduce due to the availability of in-memory processing. MapReduce makes use of persistent storage for any of the data processing tasks.
- 2. Spark provides in-built libraries to perform multiple tasks form the same core as batch processing, Steaming, Machine learning, Interactive SQL queries. Hadoop MapReduce however, only supports batch processing.
- 3. Hadoop MapReduce is highly disk-dependent while Spark promotes caching and in-memory data storage.
- 4. Spark is capable of performing iterative computation while there is no iterative computing implemented by Hadoop.
83. Is there any benefit of learning Hadoop MapReduce?
Yes a user must learn the Hadoop MapReduce. It is a paradigm used by many big data tools including Spark and is extremely relevant when the data grows bigger and bigger. Tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
84. What do you know about the Spark Executor?
When a user connects the SparkContext to a cluster manager, it acquires a Spark Executor on nodes in the cluster. The Executors are Spark processes that run computations and store the data on the worker node. The final tasks are transferred to the Spark Executors for their final execution.
85. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
86. Elaborate on the worker node.
The Worker node refers to any node that can run the application code in a cluster.
87. Do you know anything about the PageRank?
The PageRank is the measure of each vertex in the graph and is one of the striking features of the Graph in Spark.
88. Does the user need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?
No, there is no compulsion regarding this because Spark runs on top of Yarn.
89. Tell us about some of the demerits of using Spark.
Spark utilizes more storage space as compared to Hadoop and MapReduce and hence, can cause certain problems. Developers need to be careful while running their applications in Spark. All Spark developers must make sure that the work is equally distributed over multiple clusters.
90. How can a user create RDD?
Spark provides two methods to create RDD:
- By parallelizing a collection in your Driver program.
This makes use of SparkContext’s ‘parallelize’ methodical IntellipaatData = Array(2,4,6,8,10)
- val distIntellipaatData = sc.parallelize(IntellipaatData)
- By loading an external dataset from external storage like HDFS, shared file system.
91. Elaborate on the concept of Pair RDD.
The Apache Spark defines PairRDD functions class as:
class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable
A number of special operations can be performed on RDDs in Spark using the key/value pairs and such RDDs are called Pair RDDs. Pair RDDs are used to enable users to access each key in parallel. They have areduceByKey() method that is used to collect data, based on each key and a join() method that is used to combine different RDDs together, based on the elements having the same key.
92.Which is the most popular language supported by Apache Spark?
Scala and Python have interactive shells for Spark. The Scala shell can be accessed through./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used language among them because Spark is written in Scala.
93. Explain the concept of Resilient Distributed Dataset (RDD).
RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of operational elements that run in parallel. There are two types of RDD:
- Parallelized Collections: The existing RDDs running parallel with one another.
- Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.
RDDs are parts of data that are stored in the memory distributed across many nodes. They are lazily evaluated in Spark which makes Spark operate at a faster speed.
94. Compare Hadoop and Spark.
We will compare Hadoop MapReduce and Spark based on the following aspects:
|Feature Criteria||Apache Spark||Hadoop|
|Speed||100 times faster than Hadoop||Decent speed|
|Processing||Real-time & Batch processing||Batch processing only|
|Difficulty||Easy because of high level modules||Tough to learn|
|Recovery||Allows recovery of partitions||Fault-tolerant|
|Interactivity||Has interactive modes||No interactive mode except Pig & Hive|
95. What do you know about Executor Memory in any Spark application?
Every Spark application has the same fixed heap size and a fixed number of cores for a Spark Executor. This heap size is what is referred to as the Spark executor memory and is controlled with the spark.executor.memory property of the –executor-memory flag. Every Spark application always has one Executor on each worker node. The executor memory is a measure on the amount of memory the worker node of any application will utilize.
96. When running Spark applications, is it necessary to install Spark on all the nodes of the YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
97. How can Streaming be implemented in Spark?
The Spark Streaming component of the Spark Ecosystem is used for processing real-time streaming data. It enables a high-throughput and fault-tolerant stream processing of live data streams.
The essential stream unit is DStream which is also a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data is obtained from different sources like Flume and HDFS, is streamed and processed to file systems, live dashboards, and databases. The working of this part is similar to batch processing as the input data is divided into streams like batches.
98. Mention the functions provided within the SparkCore.
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.
SparkCore performs various important functions like:
- 1. Memory management
- 2. Monitoring jobs
- 3. Fault-tolerance
- 4. Job scheduling
- 5. Interaction with storage systems
There are some additional libraries as well which are built atop the core to allow diverse workloads for streaming, SQL, and machine learning. These libraries are responsible for:
- 1. Memory management and fault recovery
- 2. Scheduling, distributing and monitoring jobs on a cluster
- 3. Interacting with storage systems
Are you looking training with Right Jobs?Contact Us
- Hadoop Vs Apache Spark
- Hadoop Interview Questions and Answers
- SAS Tutorial
- Big Data and Hadoop Ecosystem Tutorial
- Cassandra Interview Questions and Answers
- What is Dimension Reduction? | Know the techniques
- Difference between Data Lake vs Data Warehouse: A Complete Guide For Beginners with Best Practices
- What is Dimension Reduction? | Know the techniques
- What does the Yield keyword do and How to use Yield in python ? [ OverView ]
- Agile Sprint Planning | Everything You Need to Know