Sqoop Interview Questions and Answers
Last updated on 24th Oct 2020, Blog, Interview Question
If you’re looking for Sqoop Interview Questions for Experienced or Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research Hadoop has a market share of about 21.5%. So, You still have opportunity to move ahead in your career in Hadoop Development. ACTE offers Advanced Sqoop Interview Questions 2020 that helps you in cracking your interview & acquire dream career as Hadoop Developer.
1. Mention the best features of Apache Sqoop.
Apache Sqoop is a tool in the Hadoop ecosystem that has several advantages. Like
- 1. Parallel import/export
- 2. Connectors for all major RDBMS Databases
- 3. Import results of SQL query
- 4. Incremental Load
- 5. Full Load
- 6. Kerberos Security Integration
- 7. Load data directly into Hive / HBase
- 8. Compression
- 9. Support for Accumulo
2. What is Sqoop Import? Explain its purpose.
While it comes to import tables from RDBMS to HDFS we use the Sqoop Import tool. Generally, we can consider that each row in a table is a record in HDFS. Also, when we talk about text files all records are there as text data. However, when we talk about Avro and sequence files all records are there as binary data here. To be more specific, it imports individual tables from RDBMS to HDFS.
3. What is the default file format to import data using Apache Sqoop?
By using two file formats Sqoop allows data import. Such as:
Delimited Text File Format:
Basically, to import data using Sqoop this is the default file format. Moreover, to the import command in Sqoop, this file format can be explicitly specified using the –as-text file argument. Likewise, passing this argument will produce the string-based representation of all the records to the output files with the delimiter characters between rows and columns.
Sequence File Format:
We can say, Sequence file format is a binary file format. Their records are stored in custom record-specific data types which are shown as Java classes. In addition, Sqoop automatically creates these data types and manifests them as java classes.
4. How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?
However, direct import of BLOB and CLOB large objects is not supported by Apache Sqoop import command. So, in order to import large objects like I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.
5. How can you execute a free-form SQL query in Sqoop to import the rows in a sequential manner?
By using the –m 1 option in the Sqoop import command we can accomplish it. Basically, it will create only one MapReduce task which will then import rows serially.
6. Does Apache Sqoop have a default database?
Yes, MySQL is the default database.
7. How will you list all the columns of a table using Apache Sqoop?
Since to list all the columns we do not have any direct command like sqoop-list-columns. So, indirectly we can achieve this is to retrieve the columns of the desired tables and redirect them to a file that can be viewed manually containing the column names of a particular table.
- Sqoop import –m 1 –connect ‘jdbc: sql server: //name of server;
- database=name database;
- password=mypassword’ –query
- “SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE
- table_name=’table interest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’
8. If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?
By using incremental parameter with data import we can synchronize the data–
However, with one of the two options, we can use incremental parameter-
Basically, we should use incremental import with the append option. Even if the table is getting updated continuously with new rows and increasing row id values then. Especially, where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
However, in this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the last modified column in the source, the values would be updated.
9. Name a few import control commands. How can Sqoop handle large objects?
To import RDBMS data, we use import control commands
Append: Append data to an existing dataset in HDFS.
Columns: columns to import from the table
Where: where clause to use during import. —
Where the common large objects are Blog and Clob. Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set the lob limit as ZERO (0) then it is stored in external memory.
10. How can we import data from a particular row or column? What are the destination types allowed in the Sqoop import command?
Basically, on the basis of where clause, Sqoop allows to Export and Import the data from the data table. So, the syntax is,
- <col1,col2……> –where
- sqoop import –connect jdbc:mysql://db.one.com/corp –table INTELLIPAAT_EMP –where “start_date> ’2016-07-20’ ”
- sqoop eval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”
- sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”
However, into following services Sqoop supports data imported:
- 1. HDFS
- 2. Hive
- 3. Hbase
- 4. Hcatalog
- 5. Accumulo
11. When to use –target-dir and when to use –warehouse-dir while importing data?
Basically, we use –target-dir to specify a particular directory in HDFS. Whereas we use –warehouse-dir to specify the parent directory of all the sqoop jobs. So, in this case under the parent directory sqoop will create a directory with the same name as the table.
Subscribe For Free Demo[contact-form-7 404 "Not Found"]
12. What is the process to perform an incremental data load in Sqoop?
In Sqoop, the process to perform incremental data load is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. Moreover, in Sqoop the delta data can be facilitated through the incremental load command.
In addition, by using Sqoop import command we can perform incremental load. Also, by loading the data into the hive without overwriting it. However, in Sqoop the different attributes that need to be specified during incremental load are
- 1. Mode (incremental) : It shows how Sqoop will determine what the new rows are. Also, it has value as Append or Last Modified.
- 2. Col (Check-column) : Basically, it specifies the column that should be examined to find out the rows to be imported.
- 3. Value (last-value) : It denotes the maximum value of the check column from the previous import operation.
13. What is the significance of using –compress-codec parameter?
However, we use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.
14. Can free-form SQL queries be used with the Sqoop import command? If yes, then how can they be used?
In Sqoop, we can use SQL queries with the import command. Basically, we should use the import command with the –e and – query options to execute free-form SQL queries. But note that the –target dir value must be specified While using the –e and –query options with the import command.
15. What is the importance of eval tools?
Basically, Sqoop Eval helps to run sample SQL queries against Database as well as preview the results on the console. Moreover, it helps to know what data we can import or if the desired data is imported or not.
16. How can you import only a subset of rows from a table?
In the sqoop import statement, by using the WHERE clause we can import only a subset of rows.
17. What are the limitations of importing RDBMS tables into Hcatalog directly?
By making use of –hcatalog –database option with the –hcatalog –table, we can import RDBMS tables into Hcatalog directly. However, there is one limitation to it is that it does not support several arguments like –as-Avro file, -direct, -as-sequencefile, -target-dir , -export-dir.
18. What is the advantage of using –password-file rather than -P option while preventing the display of password in the sqoop import statement?
Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads from standard input, preventing automation.
19. What do you mean by Free Form Import in Sqoop?
By using any SQL Sqoop can import data from a relational database query rather than only using table and column name parameters.
20. What is the role of JDBC drivers in Sqoop?
Basically, sqoop needs a connector to connect to different relational databases. Since, as a JDBC driver, every DB vendor makes this connector available which is specific to that DB. Hence, to interact with Sqoop needs the JDBC driver of each of the databases it needs.
21. Is JDBC driver enough to connect sqoop to the databases?
No. to connect to a database Sqoop needs both JDBC and connector.
22. What is InputSplit in Hadoop?
Input Split is defined as while a Hadoop job runs, it splits input files into chunks and also assigns each split to a mapper to process.
23. What is the work of Export in Hadoop sqoop?
Export tool transfer the data from HDFS to RDBMS
24. Use of Codegen command in Hadoop sqoop?
Basically, Codegen command generates code to interact with database records
25. Use of Help command in Hadoop sqoop?
Help command in Hadoop sqoop generally list available commands
26. How can you schedule a job using Oozie?
However, Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
27. What is the importance of — the split-by clause in running parallel import tasks in sqoop?
In Sqoop, it mentions the column name based on whose value the data will be divided into groups of records. Further, by the MapReduce tasks, these groups of records will be read in parallel.
28. What is a sqoop metastore?
A tool that Sqoop hosts a shared metadata repository is what we call sqoop metastore. Moreover, multiple users and/or remote users can define and execute saved jobs (created with the sqoop job) defined in this metastore.
In addition, with the –meta-connect argument Clients must be configured to connect to the metastore in sqoop-site.xml.
29. What is the purpose of sqoop-merge?
The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
30. How can you see the list of stored jobs in sqoop metastore?
- sqoop job –list
31. Which database does the sqoop metastore run on?
Basically, on the current machine running sqoop-metastore launches, a shared HSQLDB database instance.
32. Where can the metastore database be hosted?
Anywhere, it means we can host a metastore database within or outside of the Hadoop cluster.
33. Give the sqoop command to see the content of the job named myjob?
- Sqoop job –show myjob
34. How can you control the mapping between SQL data types and Java types?
we can configure the mapping between by using the –map-column-java property.
- $ sqoop import … –map-column-java id = String, value = Integer
35. Is it possible to add a parameter while running a saved job?
Yes, by using the –exec option we can add an argument to a saved job at runtime.
- sqoop job –exec job name — — newparameter
36. What is the usefulness of the options file in sqoop.
To specify the command line values in a file and use it in the sqoop commands we use the options file in sqoop.
The –connect parameter’s value and –user name value can be stored in a file and used again and again with different sqoop commands.
Get Hands-on Experience from Sqoop Certification CourseWeekday / Weekend BatchesSee Batch Details
37. How can you avoid importing tables one-by-one when importing a large number of tables from a database?
Using the command
- sqoop import-all-tables
- exclude-tables table1,table2 ..
Basically, this will import all the tables except the ones mentioned in the exclude-tables clause.
38. How can you control the number of mappers used by the sqoop command?
To control the number of mappers executed by a sqoop command we use the parameter –num-mappers. Moreover, we should start with choosing a small number of map tasks and then gradually scale up as choosing a high number of mappers initially may slow down the performance on the database side.
39. What is the default extension of the files produced from a sqoop import using the –compress parameter?
40. What is the disadvantage of using –direct parameter for faster data load by sqoop?
The native utilities used by databases to support faster load do not work for binary data formats like SequenceFile.
41. How will you update the rows that are already exported?
Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.
42. What are the basic commands in Apache Sqoop and its uses?
The basic commands of Apache Sqoop and their uses are:
- 1. Codegen- It helps to generate code to interact with database records.
- 2. Create- hive-table- It helps to Import a table definition into a hive
- 3. Eval- It helps to evaluate SQL statement and display the results
- 4. Export- It helps to export an HDFS directory into a database table
- 5. Help- It helps to list the available commands
- 6. Import- It helps to import a table from a database to HDFS
- 7. Import-all-tables- It helps to import tables from a database to HDFS
- 8. List-databases- It helps to list available databases on a server
- 9. List-tables- It helps to list tables in a database
- 10. Version- It helps to display the version information
43. How did the Sqoop word come about? Sqoop is which type of tool and the main use of sqoop?
Sqoop word came from SQL+HADOOP=SQOOP.
Basically, it is a data transfer tool. We use Sqoop to import and export a large amount of data from RDBMS to HDFS and vice versa.
44.What is Sqoop Validation?
It means to validate the data copied. Either import or export by comparing the row counts from the source as well as the target post copy. Likewise, we use this option to compare the row counts between source as well as the target just after data imported into HDFS. Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.
45. What is the Purpose to Validate in Sqoop?
In Sqoop, validating the data copied is Validation’s main purpose. Basically, either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.
46. What is Sqoop Job?
To perform an incremental import if a saved job is configured, then state regarding the most recently imported rows is updated in the saved job. Basically, that allows the job to continually import only the newest rows.
47. What is Sqoop Import Mainframe Tool and its Purpose?
Basically, a tool which we use to import all sequential datasets in a partitioned dataset (PDS) on a mainframe to HDFS is Sqoop Import Mainframe. That tool is what we call an import mainframe tool. Also, A PDS is akin to a directory on the open systems. Likewise, in a dataset, the records can only contain character data. Moreover here, records will be stored as a single text field with the entire record.
48. What is the purpose of Sqoop List Tables?
Basically, the main purpose of sqoop-list-tables is list tables present in a database. Learn all insights of Sqoop List Tables, follow the link: Sqoop List Tables – Arguments and Examples
49. Difference Between Apache Sqoop vs Flume.
So, let’s discuss all the differences on the basis of features.
|Features||Apache Sqoop||Apache Flume|
|Data Flow||Basically, Sqoop works with any type of relational database system (RDBMS) that has the basic JDBC connectivity. Also, Sqoop can import data from NoSQL databases like MongoDB, Cassandra and along with it. Moreover, it allows data transfer to Apache Hive or HDFS.||Likewise, Flume works with streaming data sources that are generated continuously in Hadoop environments. Like log files.|
|Type of Loading||Basically, Sqoop load is not driven by events.||Here, data loading is completely event-driven.|
|When to use||However, if the data is being available in Teradata, Oracle, MySQL, PostgreSQL or any other JDBC compatible database it is considered an ideal fit.||While we move bulk of streaming data from sources like JMS or spooling directories, it is the best choice.|
|Link to HDFS||Basically, for importing data in Apache Sqoop, HDFS is the destination.||In Apache Flume, data generally flow to HDFS through channels.|
|Architecture||Basically, it has connector based architecture. However, that means the connectors know a great deal in connecting with the various data sources. Also to fetch data correspondingly.||However, it has agent-based architecture. Basically, it means code written in Flume is a call agent that may be responsible for fetching the data.|
50.When to use –target-dir and when to use –warehouse-dir while importing data?
Basically, we use –target-dir to specify a particular directory in HDFS. Whereas we use –warehouse-dir to specify the parent directory of all the sqoop jobs. So, in this case under the parent directory sqoop will create a directory with the same name as the table.
51. What Is The Role Of A Jdbc Driver In A Sqoop Set Up?
To connect to different relational databases, sqoop needs a connector. Almost every DB vendor makes this connector available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the databases it needs to interact with.
52. How Can You Import Only A Subset Of Rows Form A Table?
By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
53. How Can We Import A Subset Of Rows From A Table Without Using The Where Clause?
We can run a filtering query on the database and save the result to a temporary table in database.Then use the sqoop import command without using the where clause
54. What Is The Advantage Of Using Password-file Rather Than -p Option While Preventing The Display Of Password In The Sqoop Import Statement?
The password-file option can be used inside a sqoop script while the -P option reads from standard input , preventing automation.
55. What is the significance of using –compress-codec parameter?
However, we use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2 we use the compress -code parameter.
Best Sqoop Training with Advanced Topics By Expert Trainers
- Instructor-led Sessions
- Real-life Case Studies
56. What Is The Disadvantage Of Using Direct Parameter For Faster Data Load By Sqoop?
The native utilities used by databases to support faster load do not work for binary data formats like SequenceFile
57. When The Source Data Keeps Getting Updated Frequently, What Is The Approach To Keep It In Sync With The Data In Hdfs Imported By Sqoop?
Sqoop can have 2 approaches.
- 1. To use the incremental parameter with append option where values of some columns are checked and only in case of modified values the row is imported as a new row.
- 2. To use the incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.
58. How Do You Fetch Data Which Is The Result Of Join Between Two Tables?How Can We Slice The Data To Be Imported To Multiple Parallel Tasks?
Using the –split-by parameter we specify the column name based on which sqoop will divide the data to be imported into multiple chunks to be run in parallel.
59. How Can You Choose A Name For The Mapreduce Job Which Is Created On Submitting A Free-form Query Import?
By using the –mapreduce-job-name parameter. Below is an example of the command.
- sqoop import
- –connect jdbc:mysql://mysql.example.com/sqoop
- –username sqoop
- –password sqoop
- –query ‘SELECT normcities.id,
- FROM normcities
- JOIN countries USING(country_id)
- WHERE $CONDITIONS’
- –split-by id
- –target-dir cities
- –mapreduce-job-name normcities
60. Before Starting The Data Transfer Using Mapreduce Job, Sqoop Takes A Long Time To Retrieve The Minimum And Maximum Values Of Columns Mentioned In –split-by Parameter. How Can We Make It Efficient?
We can use the –boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query.
61. What Is The Difference Between The Parameters sqoop.export.records.per.statement and sqoop.export.statements.per.transaction?
|It specifies the number of records that will be used in each insert statement.||It specifies how many insert statements can be processed parallel during a transaction.|
62. How Will You Implement All-or-nothing Load Using Sqoop?
Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful.
63. How Do You Clear The Data In A Staging Table Before Loading It By Sqoop?
By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.
64. How Can You Sync A Exported Table With Hdfs Data In Which Some Rows Are Deleted?
Truncate the target table and load it again.
65. How Can You Export Only A Subset Of Columns To A Relational Table Using Sqoop?
By using the –column parameter in which we mention the required column names as a comma separated list of values.
66. How Can We Load To A Column In A Relational Table Which Is Not Null But The Incoming Value From Hdfs Has A Null Value?
By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted into the target table.
67. How Can You Schedule A Sqoop Job Using Oozie?
Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
68. Sqoop Imported A Table Successfully To Hbase But It Is Found That The Number Of Rows Is Fewer Than Expected. What Can Be The Cause?
Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a row, those rows get dropped.
69. Give A Sqoop Command To Show All The Databases In A Mysql Server.?
- $ sqoop list-databases –connect jdbc:mysql://database.example.com/
70. How Can You Force Sqoop To Execute A Free Form Sql Query Only Once And Import The Rows Serially?
By using the –m 1 clause in the import command, sqoop creates only one mapreduce task which will import the rows sequentially.
71. In A Sqoop Import Command You Have Mentioned To Run 8 Parallel Mapreduce Task But Sqoop Runs Only 4. What Can Be The Reason?
The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have a number of parallel tasks less or equal to that of the MapReduce cluster.
72. What Is The Importance Of –split-by Clause In Running Parallel Import Tasks In Sqoop?
The –split-by clause mentions the column name based on whose value the data will be divided into groups of records. These groups of records will be read in parallel by the mapreduce tasks.
73. What Does This Sqoop Command Achieve? $ sqoop import –connect
–table foo –target-dir /dest
It imports data from a database to a HDFS file named foo located in the directory /dest
74. What Happens When A Table Is Imported Into A Hdfs Directory Which Already Exists Using The –append Parameter?
Using the –append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing file names in that directory.
75. How To Import Only The Updated Rows Form A Table Into Hdfs Using Sqoop Assuming The Source Has Last Update Timestamp Details For Each Row?
By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.
76. What Are The Two File Formats Supported By Sqoop For Import?
- 1. Delimited text
- 2. Sequence Files
77. Give A Sqoop Command To Import The Columns Employee_id,first_name,last_name From The Mysql Table Employee?
- $ sqoop import –connect jdbc:mysql://host/dbname –table EMPLOYEES –columns “employee_id,first_name,last_name”
78. Give A Sqoop Command To Run Only 8 Mapreduce Tasks In Parallel?
- $ sqoop import –connect jdbc:mysql://host/dbname –table table_name -m 8
79. What Does The Following Query Do?
$ Sqoop Import –connect Jdbc:mysql://host/dbname –table Employees –where “start_date > ‘2017-03-31’
It imports the employees who have joined after 31-Mar-2017.
80. Give A Sqoop Command To Import All The Records From The Employee Table Divided Into Groups Of Records By The Values In The Column Department_id.?
- $ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES –split-by dept_id
81. What Does The Following Query Do?
$ Sqoop Import –connect Jdbc:mysql://db.foo.com/somedb –table Sometable
–where “id > 1000” –target-dir /incremental_dataset –append
It performs an incremental import of new data, after having already imported the first 100,0rows of a table
82. Give A Sqoop Command To Import Data From All Tables In The Mysql Db Db1.?
- sqoop import-all-tables –connect jdbc:mysql://host/DB1
83. Give A Command To Execute A Stored Procedure Named Proc1 Which Exports Data To From Mysql Db Named Db1 Into A Hdfs Directory Named Dir1.?
- $ sqoop export –connect jdbc:mysql://host/DB1 –call proc1 –export-dir /Dir1
84. Which Database The Sqoop Metastore Runs On?
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
Are you looking training with Right Jobs?Contact Us
- Hadoop Interview Questions and Answers
- What is Informatica PowerCenter?
- Talend Interview Questions and Answers
- Apache Spark & Scala Tutorial
- kafka Interview Questions and Answers
- E Learning Sample Resumes
- Apache Oozie Sample Resumes
- Business Objects Interview Questions and Answers
- Cassandra Interview Questions and Answers
- Sqoop Interview Questions and Answers