Pyspark Interview Questions and Answers

Pyspark Interview Questions and Answers

Last updated on 25th Sep 2020, Blog, Interview Question

About author

Gupta (Senior Data Engineer )

Delegates in Corresponding Technical Domain with 9+ Years of Experience. Also, He is a Technology Writer for Past 5 Years & Share's this Informative Blogs for us.

(5.0) | 12574 Ratings 1388

PySpark is one of the most popular distributed, general-purpose cluster-computing frameworks. The open-source tool offers an interface for programming an entire computer cluster with implicit data parallelism and fault-tolerance features.

Here we have compiled a list of the top PySpark interview questions. These will help you gauge your Apache Spark preparation for cracking that upcoming interview. Do you think you can get the answers right? Well, you’ll only know once you’ve gone through it!

1. Please explain the sparse vector in Spark.

Ans:

A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays:

  1. 1. One for indices
  2. 2. The other for values

An example of a sparse vector is as follows:

  • Vectors.sparse(7,Array(0,1,2,3,4,5,6),
  • Array(1650d,50000d,800d,3.0,3.0,2009,95054)

2. How will you connect Apache Spark with Apache Mesos?

Ans:

Step by step procedure for connecting Apache Spark with Apache Mesos is:

  1. 1. Configure the Spark driver program to connect with Apache Mesos
  2. 2. Put the Spark binary package in a location accessible by Mesos
  3. 3. Install Apache Spark in the same location as that of the Apache Mesos
  4. 4. Configure the spark.mesos.executor.home property for pointing to the location where the Apache Spark is installed

3. Can you explain how to minimize data transfers while working with Spark?

Ans:

Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:

  • Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
  • Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
  • Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs

4. What are broadcast variables in Apache Spark? Why do we need them?

Ans:

Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine.

Broadcast variables are also used to provide every node with a copy of a large input dataset. Apache Spark tries to distribute broadcast variables by using effectual broadcast algorithms for reducing communication costs.

Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances retrieval efficiency.

5. Please provide an explanation on DStream in Spark.

Ans:

DStream is a contraction for Discretized Stream. It is the basic abstraction offered by Spark Streaming and is a continuous stream of data. DStream is received from either a processed data stream generated by transforming the input stream or directly from a data source.

A DStream is represented by a continuous series of RDDs, where each RDD contains data from a certain interval. An operation applied to a DStream is analogous to applying the same operation on the underlying RDDs. A DStream has two operations:

  1. 1. Output operations responsible for writing data to an external system
  2. 2. Transformations resulting in the production of a new DStream

It is possible to create DStream from various sources, including Apache Kafka, Apache Flume, and HDFS. Also, Spark Streaming provides support for several DStream transformations.

6. Does Apache Spark provide checkpoints?

Ans:

Yes, Apache Spark provides checkpoints. They allow for a program to run all around the clock in addition to making it resilient towards failures not related to application logic. Lineage graphs are used for recovering RDDs from a failure.

Apache Spark comes with an API for adding and managing checkpoints. The user then decides which data to the checkpoint. Checkpoints are preferred over lineage graphs when the latter are long and have wider dependencies.

7. What are the different levels of persistence in Spark?

Ans:

Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.

Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:

  • DISK_ONLY – Stores the RDD partitions only on the disk.
  • MEMORY_AND_DISK – Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
  • MEMORY_ONLY_SER – Stores RDD as serialized Java objects with one-byte array per partition.
  • MEMORY_AND_DISK_SER – Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
  • MEMORY_ONLY – The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
  • OFF_HEAP – Works like MEMORY_ONLY_SER but stores the data in off-heap memory.

8. Can you list down the limitations of using Apache Spark?

Ans:

  1. 1. It doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system
  2. 2. Higher latency but consequently, lower throughput
  3. 3. No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is micro-batch processing and not truly real-time data processing
  4. 4. Lesser number of algorithms available
  5. 5. Spark streaming doesn’t support record-based window criteria
  6. 6. The work needs to be distributed over multiple clusters instead of running everything on a single node
  7. 7. While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck

9. Define Apache Spark?

Ans:

Apache Spark is an easy to use, highly flexible and fast processing framework which has an advanced engine that supports the cyclic data flow and in-memory computing process. It can run as a standalone in Cloud and Hadoop, providing access to varied data sources like Cassandra, HDFS, HBase, and various others.

10. What is the main purpose of the Spark Engine?

Ans:

The main purpose of the Spark Engine is to schedule, monitor, and distribute the data application along with the cluster.

11. Define Partitions in Apache Spark?

Ans:

Partitions in Apache Spark are meant to split the data in MapReduce by making it smaller, relevant, and more logical division of the data. It is a process that helps in deriving the logical units of data so that the speedy pace can be applied for data processing. Apache Spark is partitioned in Resilient Distribution Datasets (RDD).

12. What are the main operations of RDD?

Ans:

There are two main operations of RDD which includes:

  1. 1. Transformations
  2. 2. Actions
Subscribe For Free Demo

Error: Contact form not found.

13. Define Transformations in Spark?

Ans:

Transformations are the functions that are applied to RDD that helps in creating another RDD. Transformation does not occur until action takes place. The examples of transformation are Map () and filer().

14. What is the function of the Map ()?

Ans:

The function of the Map () is to repeat over every line in the RDD and, after that, split them into new RDD.

15. What is the function of filer()?

Ans:

The function of filer() is to develop a new RDD by selecting the various elements from the existing RDD, which passes the function argument.

16. What are the Actions in Spark?

Ans:

Actions in Spark helps in bringing back the data from an RDD to the local machine. It includes various RDD operations that give out non-RDD values. The actions in Sparks include functions such as reduce() and take().

17. What is the difference between reducing () and take() function?

Ans:

Reduce() function is an action that is applied repeatedly until the one value is left in the last, while the take() function is an action that takes into consideration all the values from an RDD to the local node.

18. What are the similarities and differences between coalesce () and repartition () in Map Reduce?

Ans:

The similarity is that both Coalesce () and Repartition () in Map Reduce are used to modify the number of partitions in an RDD. The difference between them is that Coalesce () is a part of repartition(), which shuffles using Coalesce(). This helps repartition() to give results in a specific number of partitions with the whole data getting distributed by application of various kinds of hash practitioners.

19. Define YARN in Spark?

Ans:

YARN in Spark acts as a central resource management platform that helps in delivering scalable operations throughout the cluster and performs the function of a distributed container manager.

20. What is Shark?

Ans:

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.

21. List some use cases where Spark outperforms Hadoop in processing.

Ans:

Sensor Data Processing – Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources. Spark is preferred over Hadoop for real time querying of data

Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

22. What is a Sparse Vector?

Ans:

A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

23. What is RDD?

Ans:

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are,

  • Immutable – RDDs cannot be altered.
  • Resilient – If a node holding the partition fails the other node takes the data.

24. Explain about transformations and actions in the context of RDDs.

Ans:

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

25. What are the languages supported by Apache Spark for developing big data applications?

Ans:

  1. 1. Scala
  2. 2. Java
  3. 3. Python
  4. 4. R
  5. 5. Clojure

26. Can you use Spark to access and analyse data stored in Cassandra databases?

Ans:

Yes, it is possible if you use Spark Cassandra Connector.

27. Is it possible to run Apache Spark on Apache Mesos?

Ans:

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

28.  Explain about the different cluster managers in Apache Spark

Ans:

The 3 different clusters managers supported in Apache Spark are:

  • YARN
  • Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
  • Standalone deployments – Well suited for new deployments which only run and are easy to set up.

29. How can Spark be connected to Apache Mesos?

Ans:

To connect Spark with Mesos-

Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)

Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

30. Why is there a need for broadcast variables when working with Apache Spark?

Ans:

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Course Curriculum

Get In-Depth Knowledge in PySpark Training from Expert Trainers

  • Instructor-led Sessions
  • Real-life Case Studies
  • Assignments
Explore Curriculum

31. Is it possible to run Spark and Mesos along with Hadoop?

Ans:

Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

32. What is lineage graph?

Ans:

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

33. Explain the key features of Spark.

Ans:

  • Apache Spark allows integrating with Hadoop.
  • It has an interactive language shell, Scala (the language in which Spark is written).
  • Spark consists of RDDs (Resilient Distributed Datasets), which can be cached across the computing nodes in a cluster.
  • Apache Spark supports multiple analytic tools that are used for interactive query analysis, real-time analysis, and graph processing

34. Define the functions of Spark Core.

Ans:

Serving as the base engine, Spark Core performs various important functions like memory management, monitoring jobs, providing fault-tolerance, job scheduling, and interaction with storage systems.

35.  What is Spark Driver?

Ans:

Spark driver is the program that runs on the master node of a machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master. It also delivers RDD graphs to Master, where the standalone Cluster Manager runs.

36. What is Hive on Spark?

Ans:

Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:

  • hive> set spark.home=/location/to/sparkHome;
  • hive> set hive.execution.engine=spark;

Hive supports Spark on YARN mode by default.

37. Name the commonly used Spark Ecosystems.

Ans:

  • Spark SQL (Shark) for developers
  • Spark Streaming for processing live data streams
  • GraphX for generating and computing graphs
  • MLlib (Machine Learning Algorithms)
  • SparkR to promote R programming in the Spark engine

38. Define Spark Streaming.

Ans:

Spark supports stream processing—an extension to the Spark API allowing stream processing of live data streams.

39. How Spark handles monitoring and logging in Standalone mode?

Ans:

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

40. Does Apache Spark provide check pointing?

Ans:

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

41. How can you launch Spark jobs inside Hadoop MapReduce?

Ans:

Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

42. How Spark uses Akka?

Ans:

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

43. How can you achieve high availability in Apache Spark?

Ans:

  • Implementing single node recovery with local file system
  • Using StandBy Masters with Apache ZooKeeper.

44. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?

Ans:

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

45. Explain about the core components of a distributed Spark application

Ans:

  • Driver – The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
  • Executor – The worker processes that run the individual tasks of a Spark job.
  • Cluster Manager – A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
Course Curriculum

Enroll in PySpark Certification Course to Build Your Skills & Advance Your Career

Weekday / Weekend BatchesSee Batch Details

46. What do you understand by Lazy Evaluation?

Ans:

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

47. Define a worker node.

Ans:

A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

48. What do you understand by SchemaRDD?

Ans:

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.

49. What is the default level of parallelism in apache spark?

Ans:

If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.

50. Explain about the common workflow of a Spark program

Ans:

  • The foremost step in a Spark program involves creating input RDD’s from external data.
  • Use various RDD transformations like filter() to create new transformed RDD’s based on the business logic.
  • persist() any intermediate RDD’s which might have to be reused in future.
  • Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.

51. In a given spark program, how will you identify whether a given operation is Transformation or Action ?

Ans:

One can identify the operation based on the return type –

  1. 1. The operation is an action, if the return type is other than RDD.
  2. 2. The operation is transformation, if the return type is same as the RDD.

52. What according to you is a common mistake apache spark developers make when using spark ?

Ans:

  • Maintaining the required size of shuffle blocks.
  • Spark developer often make mistakes with managing directed acyclic graphs (DAG’s.)

53. How will you calculate the number of executors required to do real-time processing using Apache Spark? What factors need to be connsidered for deciding on the number of nodes for real-time processing?

Ans:

The number of nodes can be decided by benchmarking the hardware and considering multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with spark.

54. What is the difference between Spark Transform in DStream and map ?

Ans:

Transform function in spark streaming allows developers to use Apache Spark transformations on the underlying RDD’s for the stream. map function in hadoop is used for an element to element transform and can be implemented using transform.Ideally , map works on the elements of Dstream and transform allows developers to work with RDD’s of the DStream. map is an elementary transformation whereas transform is an RDD transformation.

55. Is there any API available for implementing graphs in Spark?

Ans:

GraphX is the API used for implementing graphs and graph-parallel computing in Apache Spark. It extends the Spark RDD with a Resilient Distributed Property Graph. It is a directed multi-graph that can have several edges in parallel.

Each edge and vertex of the Resilient Distributed Property Graph has user-defined properties associated with it. The parallel edges allow for multiple relationships between the same vertices.

In order to support graph computation, GraphX exposes a set of fundamental operators, such as joinVertices, mapReduceTriplets, and subgraph, and an optimized variant of the Pregel API.

The GraphX component also includes an increasing collection of graph algorithms and builders for simplifying graph analytics tasks.

56. Tell us how will you implement SQL in Spark?

Ans:

Spark SQL modules help in integrating relational processing with Spark’s functional programming API. It supports querying data via SQL or HiveQL (Hive Query Language).

Also, Spark SQL supports a galore of data sources and allows for weaving SQL queries with code transformations. DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service are the four libraries contained by the Spark SQL.

57. What do you understand by the Parquet file?

Ans:

Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:

  • Able to fetch specific columns for access
  • Consumes less space
  • Follows type-specific encoding
  • Limited I/O operations
  • Offers better-summarized data

58. Define PageRank in Spark? Give an example?

Ans:

PageRank in Spark is an algorithm in Graphix which measures each vertex in the graph. For example, if a person on Facebook, Instagram, or any other social media platform has a huge number of followers than his/her page will be ranked higher.

59. What is Sliding Window in Spark? Give an example?

Ans:

A Sliding Window in Spark is used to specify each batch of Spark streaming that has to be processed. For example, you can specifically set the batch intervals and several batches that you want to process through Spark streaming.

60. What are the benefits of Sliding Window operations?

Ans:

Sliding Window operations have the following benefits:

  • It helps in controlling the transfer of data packets between different computer networks.
  • It combines the RDDs that falls within the particular window and operates upon it to create a new RDDs of the windowed DStream.
  • It offers windowed computations to support the process of transformation of RDDs using the Spark Streaming Library.
Pyspark Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

61. Why we need the master driver in spark?

Ans:

Master driver is central point and the entry point of the Spark Shell which is supporting this language (Scala, Python, and R). Below is the sequential process, which driver follows to execute the spark job.

  1. 1. Driver runs the main () function of the application which create the spark context.
  2. 2. Driver program that runs on the master node of the spark cluster schedules the job execution.
  3. 3. Translates the RDD’s into the execution graph and splits the graph into multiple stages.
  4. 4. Driver stores the metadata about all the Resilient Distributed Databases and their partitions.

62. What happens when a Spark Job is submitted?

Ans:

Below is the step which spark job follows once job get submitted:

  • A standalone application starts and instantiates a SparkContext instance and it is only then when you can call the application a driver.
  • The driver program asks for resources to the cluster manager to launch executors.
  • The cluster manager launches executors.
  • The driver process runs through the user application. 
  • Depending on the actions and transformations over RDDs task are sent to executors.
  • Executors run the tasks and save the results.
  • If any worker crashes, its tasks will be sent to different executors to be processed again.
  • Driver implicitly converts the code containing transformations and actions into a logical
  • directed acyclic graph (DAG). 

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map () operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node and take its result if that finishes

  1. 1. Driver program converts a user application into smaller execution units known as tasks which is also as a stage.
  2. 2. Tasks are then executed by the executors i.e. the worker processes which run individual tasks.

The complete process can track by cluster manager user interface. Driver exposes the information about the running spark application through a Web UI at port 4040

63. What is the other notable feature of RDD and ways to create the RDD?

Ans:

  • In-Memory:  Ability to perform operation in the primary memory not in the disk
  • Immutable or Read-Only: Emphasize in creating the immutable data set.
  • Lazy evaluated: Spark computing the record when the action is going to perform, not in transformation level.
  • Cacheable: We can cache the record, for faster processing.
  • Parallel:  Spark has an ability to parallelize the operation on data, saved in     RDD.
  • Partitioned of records: Spark has ability to partition the record, by default its support 128 MB of partition.
  • Parallelizing: an existing collection in your driver program. 
  • Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase

64. Explain Classification Algorithm. intermediate

Ans:

One common type of supervised learning is classification. Classification is the act of training an algorithm to predict a dependent variable that is categorical (belonging to a discrete, finite set of values). The most common case is binary classification, where our resulting model will make a prediction that a given item belongs to one of two groups.

The canonical example is classifying email spam. Using a set of historical emails that are organized into groups of spam emails and not spam emails, we train an algorithm to analyze the words in, and any number of properties of, the historical emails and make predictions about them. Once we are satisfied with the algorithm’s performance, we use that model to make predictions about future emails the model has never seen before.

When we classify items into more than just two categories, we call this multiclass classification. For example, we may have four different categories of an email (as opposed to the two categories in the previous paragraph): spam, personal, work-related, and other.

65. What is Graph Algorithm and explain one of the Graph algorithms? intermediate

Ans:

A graph is nothing but just a logical representation of data. Graph theory provides numerous algorithms for analyzing data in this format, and GraphFrames allows us to leverage many algorithms out of the box.

Page Rank 

One of the most prolific graph algorithms is PageRank. Larry Page, a co-founder of Google, created PageRank as a research project for how to rank web pages.

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. In very short form PageRank is a ‘vote’ by all other pages on the internet web and about how important a page is. If a link to a page count as a vote of support of a web page. If there is no link then there is no support.

66. What is coalesce in Spark? How is it different from repartition?

Ans:

Coalesce in Spark provides a way to reduce the number of partitions in an RDD or data frame. It works on existing partitions instead of creating new partitions thereby reducing the amount of data that are shuffled.

A good use for coalesce is when data in RDD has been filtered out. As a result of filtering some of the partitions in RDD may now be empty or have fewer data. Coalesce will help reduce the number of partitions thereby helping optimize any further operations on the RDD. Note that coalesce cannot be used to increase the number of partitions.

67. What is shuffling in Spark? When does it occur? What are the various ways by which shuffling of data can be minimized?

Ans:

Shuffling is the process of redistributing data across partitions that may cause data movement across executors.

By default, shuffling doesn’t change the number of partitions, but their content. There are many different operations that require shuffling of data, for instance, join between two tables or byKey operations such as GroupByKey or ReduceByKey.

Shuffling is a costly operation as it involves the movement of data across executors and care must be taken to minimize it. This can be done using optimized grouping operation such as using reduceByKey instead of groupByKey. While groupByKey shuffles all the data, reduceByKey shuffles only the results of aggregations of each partition and hence is more optimized than groupByKey.

When joining two tables opt to use the same partitioner on both the tables. This would store values having the same key in same chunk/partition. This way Spark would not have to go through the entire second table for each partition of the first table hence reducing shuffling of data.

Another optimization is to use broadcast join when joining a large table with a smaller one. This would broadcast a smaller table’s data to all the executors hence reducing shuffling of data.

68. What are the output Modes in Structured Streaming?

Ans:

There are three modes supported by Structured Streaming. Let’s look at each of them:

  1. 1. Append mode.
  2. 2. Complete mode.
  3. 3. Update mode.

Append mode

Append mode is the default behavior and the simplest to understand. When new rows are added to the result table, they will be output to the sink based on the trigger (explained next) that you specify. This mode ensures that each row is output once (and only once), assuming that you have a fault-tolerant sink. When you use append mode with event-time and watermarks, only the final results will output to the sink.

Complete mode

The complete model will output the entire state of the result table to your output sink. This is useful when we are working with some stateful data for which all rows are expected to change over time or the sink you are writing does not support row-level updates. Think of it as the state of a stream at the time the previous batch had run.

Update mode

Update mode is complete mode except that only the rows that are different from the previous write are written out to the sink. Naturally, your sink must support row-level updates to support this mode. If the query doesn’t contain aggregations, this is equivalent to append mode.

69. What are Event Time and Stateful processing in Streaming?

Ans:

Event Time :

At a higher level, in stream-processing systems, there are effectively two relevant times for each event: the time at which it actually occurred (event time) and the time that it was processed or reached the stream-processing system (processing time).

Event time:

Event time is the time that is embedded in the data itself. It is most often, though not required to be, the time that an event actually occurs. This is important to use because it provides a more robust way of comparing events against one another. The challenge here is that event data can be late or out of order. This means that the stream processing system must be able to handle out-of-order or late data.

Processing time:

Processing time is the time at which the stream-processing system actually receives data. This is usually less important than event time because when it is processed, is largely an implementation detail. This can not ever be out of order because it is a property of the streaming system at a certain time.

Stateful Processing :

Stateful processing is only necessary when you need to use or update intermediate information (state) over longer periods of time (in either a micro-batch or a record-at-a-time approach). This can happen when you are using event time or when you are performing aggregation on a key, whether that involves event time or not.

For the most part, when we are performing stateful operations, Spark handles all of this complexity for us. For example, when you specify a grouping, Structured Streaming maintains and updates the information for you. You simply specify the logic. When performing a stateful operation, Spark stores the intermediate information in a state store. Spark’s current state store implementation is an in-memory state store that is made fault tolerant by storing intermediate state to the checkpoint directory.

70. What is Graph Analytics in Spark?

Ans:

Graphs are data structures composed of nodes, or vertices, which are arbitrary objects, and edges that define relationships between these nodes. Graph analytics is the process of analysing these relationships. An example graph might be your friend group. In the context of graph analytics, each vertex or node would represent a person, and each edge would represent a relationship.

Graphs are a natural way of describing relationships and many different problem sets, and Spark provides several ways of working in this analytics paradigm. Some business use cases could be detecting credit card fraud, motif finding, determining the importance of papers in bibliographic networks (i.e., which papers are most referenced), and ranking web pages, as Google famously used the PageRank algorithm to do.

Spark has long contained an RDD-based library for performing graph processing: GraphX. This provided a very low-level interface that was extremely powerful, but just like RDDs, wasn’t easy to use or optimize. GraphX remains a core part of Spark. Companies continue to build production applications on top of it, and it still sees some minor feature development. The GraphX API is well documented simply because it hasn’t changed much since its creation. However, some of the developers of Spark (including some of the original authors of GraphX) have recently created a next-generation graph analytics library on Spark: GraphFrames. GraphFrames extends GraphX to provide a DataFrame API and support for Spark’s different language bindings so that users of Python can take advantage of the scalability of the tool.

71. Illustrate some demerits of using Spark.

Ans:

Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems. Developers need to be careful while running their applications in Spark. Instead of running everything on a single node, the work must be distributed over multiple clusters.

72. What are accumulators in Apache spark?

Ans:

The write only variables which are initially executed once and send to the workers are accumulators. On the basis of the logic written, these workers will be updated and sent back to the driver which will process it on the basis of logic. A driver has the potential to exercise accumulator’s value.

73.  Name the languages which are supported by Apache Spark and which one is most popular?

Ans:

Apache Spark supports the languages Java, Python, Scala and R. among them Scala and Python have interactive shares for Apache Spark and Scala shell can be easily accessed through the ./bin/spark-shell and Python can be accessed through ./bin/pyspark. Among them, Scala is the most popular because Apache Spark is written in Scala.

74. If map reduce is inferior to Spark then is there any benefit of learning it?

Ans:

Apache Spark is far better than MapReduce but still learning MapReduce is essential. MapReduce is a paradigm which is even used by Spark as big data tools. When the data is large and grows bigger, in that case, MapReduce is much relevant. Data tools like pig and hive convert their message queries into MapReduce in order to optimize them properly.

75. What are different o/p methods to get result?

Ans:

  1. 1. collect()
  2. 2. show()
  3. 3. take()
  4. 4. foreach(println)

76. Do you need to install Spark on all nodes of Yarn cluster? Why?

Ans:

No, because Spark runs on top of Yarn. 

Are you looking training with Right Jobs?

Contact Us

Popular Courses