Pig vs. Java MapReduce: what to know

- select the contributor at the end of the page -
Oday ouyay antway otay itewray ApReduceMay obsjay? Everway iedtray usingway Igpay Atinlay?

Any guesses on what that crazy Welsh-looking text above is? Can you decipher what I asked? If you guessed the made-up language known as Pig Latin, you guessed correct. And I asked if you wanted to write MapReduce jobs, and have ever tried using Pig Latin. However, the Pig Latin I'm talking about isn't made-up and will help you write MapReduce jobs-without using Java. But before we jump into exactly what Pig Latin is, let's talk about getting started in the Hadoop Stack and MapReduce.

Writing MapReduce jobs: getting started

When I first started out in the Hadoop stack, I didn't know where to start. I knew I wanted to begin writing MapReduce jobs, but I didn't know how. I had some Java experience, but it was from years ago. And since Java is Hadoop's native language, I was worried about writing my first MapReduce job.

Sound familiar? You may be asking yourself questions like:

How long will it take to get my first program written?

What do you mean I need to use Maven to import the libraries?

How do I extend Mapper or Reduce? (I have no idea!)

A lot of these questions can seem intimidating if you're not familiar with Java, but, guess what? In my first MapReduce job, I didn't have to learn any of these things. In fact, for my first MapReduce job I only needed to be familiar with basic SQL. It's possible by using Pig Latin instead of Java.

What is Pig Latin, the programming language?

Pig Latin is Pig's language that allows developers to sort, join, parse, transform and calculate unstructured and semi-structured data in MapReduce all while using a language similar to SQL versus Java.

If Pig Latin is Pig's language, what is Pig?

Pig is an application that works on top of MapReduce, Yarn or Tez. Pig is written in Java and compiles Pig Latin scripts into to MapReduce jobs. Think of Pig as a compiler that takes Pig Latin scripts and transforms them into Java.

Pig vs. Java MapReduce

Pig vs. Java MapReduce: word count comparison

Now that we know how Pig works, let's take a look at a comparison of a simple word count application written in both Java and Pig Latin.

Example: Java MapReduce word count

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>

public void reduce(Text key, Iterable values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Example from: http://wiki.apache.org/hadoop/WordCount

As we can see in the Java example, the script is around 65 lines with the spaces included. At the very beginning of the program, we have to import 10 different libraries just to begin writing our MapReduce Job. We also have to use the Mapper and Reducer classes. Once we have our script, we still have to compile it and deploy it.

Example: Pig Latin script

input_lines = LOAD '/tmp/word.txt' AS (line:chararray);

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/results.txt';

Example from: pig.apache.org

Pig vs. Java MapReduce: positives of using Pig instead of Java

First of all, looking at the Pig Latin script we notice that it's only seven lines of code. If nothing else was different, we could say it's faster to write in Pig Latin just because of the number of lines. Next, we see that we don't have to import any libraries. Lastly, we can focus on is how much easier it is to read for someone with a little SQL background. Look at some of the keywords that we have seen in SQL: Group By and Order By.

Pig vs. Java MapReduce: drawbacks of using Pig instead of Java

Java is a first-class language in Hadoop and will always give the developer more options. However, Pig is written in Java and allows for developers to write User Defined functions in Java that leverage Java Libraries. So, we can call Pig Latin a second-class language in MapReduce, and other languages like Python, Bash and C# are 3rd class.

Pig vs. Java MapReduce: top takeaways

  • Pig is application that runs on top of MapReduce and abstracts Java MapReduce jobs away from developers.
  • Pig Latin uses a lot fewer lines of code than the Java MapReduce script.
  • The Pig Latin script was is easier to read for someone without a Java background.
  • MapReduce jobs can written in Pig Latin.
  • Java is a great and powerful language, but it has a higher learning curve than something like Pig Latin. Therefore, using a higher-level language, like Pig Latin, enables many more developers/analysts to write MapReduce jobs.

The more developers and analysts who can write MapReduce jobs, the more creative our applications become.

Ywhay otnay eginbay earninglay Igpay Atinlay odaytay?

Translation: Why not begin learning Pig Latin today?

Get our content first. In your inbox.

Loading form...

If this message remains, it may be due to cookies being disabled or to an ad blocker.

Contributor

Thomas Henson

is a Senior Software Engineer and Certified ScrumMaster. He has been involved in many projects from building web applications to setting up Hadoop clusters. Thomas’s specialization is with Hortonworks Data Platform and Agile Software Development. Thomas is a proud alumnus of the University of North Alabama where he received his BBA - Computer Information System, and his MBA - Information Systems. He currently resides in north Alabama with his wife and daughter, where he hacks away at running.