Apache pig tutorials LEARNOVITA

What is Apache Pig ? : A Definitive Guide | Everything You Need to Know [ OverView ]

Last updated on 03rd Nov 2022, Artciles, Blog

About author

Yamni (Apache Maven Engineer )

Yamni has 5+ years of experience in the field of Apache Maven Engineer. Her project remains a healthy top-level project of the Apache Foundation as AWS Athena, CSV, JSON, ORC, Apache Parquet, and Avro. She has skills with PostgreSQL RDS, DynamoDB, MongoDB, QLDB, Atlas AWS, and Elastic Beanstalk PaaS.

(5.0) | 19845 Ratings 2227
    • In this article you will learn:
    • 1.What is a Pig in Hadoop?
    • 2.Components of Pig.
    • 3.How Pig Works and Stages of a Pig Operations.
    • 4.Salient Features of Pig.
    • 5.Data Model in Pig.
    • 6.Pig Execution Modes.
    • 7.Pig Interactive Modes.
    • 8.Pig Commands.
    • 9.Conclusion.

What is a Pig in Hadoop?

Pig is a scripting platform that runs on Hadoop clusters designed to process and analyse a large datasets. Pig is an extensible, self-optimizing, and also easily programmed.Programmers can use a Pig to write a data transformations without knowing Java. Pig uses both the structured and unstructured data as input to perform analytics and uses HDFS to store results.

Pig – Example:

Yahoo scientists use a grid tools to scan through a petabytes of data. Many of them write a scripts to test a theory or gain deeper insights; however, in data factory, data may not be in standardized state. This makes Pig a good option as it supports the data with partial or unknown schemas and also as semi or unstructured data.

Components of Pig:

There arethe two major components of a Pig:

  • Pig Latin script language.
  • A runtime engine.

Pig Latin script language:

The Pig Latin script is the procedural data flow language. It contains syntax and commands that can be applied to an implement business logic. Examples of a Pig Latin are LOAD and STORE.

A runtime engine:

The runtime engine is compiler that produces a sequences of a MapReduce programs. It uses HDFS to save and retrieve data. It is also used to interact with a Hadoop system (HDFS and MapReduce).A runtime engine parses, validates, and compiles script operations into a v sequence of MapReduce jobs.

Apache Pig

How Pig Works and Stages of a Pig Operations:

Pig operations can be explained in a following three stages:

Stage 1: Load a data and write a Pig script.

In this stage, data is be loaded and Pig script is written.

  • A = LOAD ‘myfile’
  • AS (x, y, z);
  • B = FILTER A by a x > 0
  • C = GROUP B BY x;
  • D = FOREACH A GENERATE
  • x, COUNT(B);
  • STORE D INTO ‘output’;

Stage 2: Pig Operations.

In a second stage, Pig execution engine Parses and also checks script. If it passes script optimized and logical and physical plan is generated for execution.The job is be submitted to Hadoop as a job explained as a MapReduce Task. Pig Monitors a status of job using a Hadoop API and reports to a client.

Stage 3: Execution of plan

In a final stage, results are dumped on section or stored in a HDFS depending on user command.

Salient Features of Pig:

Developers and analysts like to use a Pig as it offers more features. Some of features are as follows:

  • Provision for a step-by-step procedural control and the ability to an operate directly over files.
  • Schemas that, though the optional, can be assigned dynamically.
  • Support to the User Defined Functions, or UDFs, and to different data types.

Data Model in Pig:

Atom: It is the simple atomic value like int, long, double, or string.

Tuple: It is the sequence of fields that can be of any data type.

Bag: It is the collection of tuples of potentially varying structures and can contain the duplicates.

Map: It is the associative array.

The key must be a char array, but a value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of an uninterpreted bytes. Pig can infer field’s type based on a use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with known or explicitly set the return type. Furthermore, it can infer field type based on a schema information provided by a LOAD function or explicitly declared using a AS clause.

Apache Pig in Hadoop

Nested Data Model:

  • Pig Latin has fully-nestable data model with the Atomic values, Tuples, Bags or lists, and Maps. This implies a one data type can be nested within a another.
  • The advantage is that this is the more natural to programmers than flat Tuples. Also, it avoids the expensive joins. Now will look into various execution modes pig works in.

Pig Execution Modes:

Pig works in a two execution modes: Local and MapReduce:

Local mode:

In a local mode, Pig engine takes input from the Linux file system and the output is stored in same file system. Pig Execution local mode is an explained below.

MapReduce mode:

In MapReduce mode, Pig engine directly interacts and also executes in HDFS and MapReduce.

Pig Interactive Modes:

The two modes in which Pig Latin program can be written Interactive and Batch.

Interactive mode:

In Interactive mode means coding and executing a script, line by line.

Batch mode:

Batch mode, all scripts are coded in a file with the extension .pig and a file is directly executed.

Pig Commands:

Given below in a table are some frequently used a Pig Commands:

    CommandFunction
    Load Reads data from a system
    Store Writes data to the file system
    Foreach Applies expressions to every record and outputs one or more records
    Filter Applies predicate and removes a records that do not return true
    Group/cogroup Collects a records with a same key from one or more inputs
    join Joins two or more inputs based on the key
    order Sorts records based on the key
    distinct Removes a duplicate records
    union Merges a data sets
    split Splits a data into two or more sets based on a filter conditions
    stream Sends all the records through a user-provided binary
    dump Writes a output to stdout
    limit Limits number of records

Conclusion:

Pig is a technology that builds a bridge between the Hadoop, Hive and other data management technologies which can be further used to eradicate a problems related to management of big or any size of a data. MapReduce has gone out of style because it is way simpler to write code in Pig and also Apache Spark is partly responsible. On top of it, it is high effective also as we have seen that 10 lines of a MapReduce code is equivalent to the single line of Pig code.

Are you looking training with Right Jobs?

Contact Us

Popular Courses