Pig vs. Java MapReduce: what to know
- select the contributor at the end of the page -
Any guesses on what that crazy Welsh-looking text above is? Can you decipher what I asked? If you guessed the made-up language known as Pig Latin, you guessed correct. And I asked if you wanted to write MapReduce jobs, and have ever tried using Pig Latin. However, the Pig Latin I'm talking about isn't made-up and will help you write MapReduce jobs-without using Java. But before we jump into exactly what Pig Latin is, let's talk about getting started in the Hadoop Stack and MapReduce.
Writing MapReduce jobs: getting started
When I first started out in the Hadoop stack, I didn't know where to start. I knew I wanted to begin writing MapReduce jobs, but I didn't know how. I had some Java experience, but it was from years ago. And since Java is Hadoop's native language, I was worried about writing my first MapReduce job.
Sound familiar? You may be asking yourself questions like:
How long will it take to get my first program written? What do you mean I need to use Maven to import the libraries? How do I extend Mapper or Reduce? (I have no idea!)
A lot of these questions can seem intimidating if you're not familiar with Java, but, guess what? In my first MapReduce job, I didn't have to learn any of these things. In fact, for my first MapReduce job I only needed to be familiar with basic SQL. It's possible by using Pig Latin instead of Java.
What is Pig Latin, the programming language?
Pig Latin is Pig's language that allows developers to sort, join, parse, transform and calculate unstructured and semi-structured data in MapReduce all while using a language similar to SQL versus Java.
If Pig Latin is Pig's language, what is Pig?
Pig is an application that works on top of MapReduce, Yarn or Tez. Pig is written in Java and compiles Pig Latin scripts into to MapReduce jobs. Think of Pig as a compiler that takes Pig Latin scripts and transforms them into Java.
Pig vs. Java MapReduce: word count comparison
Now that we know how Pig works, let's take a look at a comparison of a simple word count application written in both Java and Pig Latin.
Example: Java MapReduce word count
package org.myorg;
import java.util.*;import java.io.IOException;
import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.fs.Path;
public class WordCount {
private final static IntWritable one = new IntWritable(1); private Text word = new Text();public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }public void reduce(Text key, Iterable values, Context context)
Configuration conf = new Configuration();public static void main(String[] args) throws Exception {
Job job = new Job(conf, "wordcount");
job.setOutputValueClass(IntWritable.class);job.setOutputKeyClass(Text.class);
job.setReducerClass(Reduce.class);job.setMapperClass(Map.class);
job.setOutputFormatClass(TextOutputFormat.class);job.setInputFormatClass(TextInputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));FileInputFormat.addInputPath(job, new Path(args[0]));
} }job.waitForCompletion(true);
Example from: http://wiki.apache.org/hadoop/WordCount
As we can see in the Java example, the script is around 65 lines with the spaces included. At the very beginning of the program, we have to import 10 different libraries just to begin writing our MapReduce Job. We also have to use the Mapper and Reducer classes. Once we have our script, we still have to compile it and deploy it.
Example: Pig Latin script
input_lines = LOAD '/tmp/word.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/results.txt';
Example from: pig.apache.org
Pig vs. Java MapReduce: positives of using Pig instead of Java
First of all, looking at the Pig Latin script we notice that it's only seven lines of code. If nothing else was different, we could say it's faster to write in Pig Latin just because of the number of lines. Next, we see that we don't have to import any libraries. Lastly, we can focus on is how much easier it is to read for someone with a little SQL background. Look at some of the keywords that we have seen in SQL: Group By and Order By.
Pig vs. Java MapReduce: drawbacks of using Pig instead of Java
Java is a first-class language in Hadoop and will always give the developer more options. However, Pig is written in Java and allows for developers to write User Defined functions in Java that leverage Java Libraries. So, we can call Pig Latin a second-class language in MapReduce, and other languages like Python, Bash and C# are 3rd class.
Pig vs. Java MapReduce: top takeaways
- Pig is application that runs on top of MapReduce and abstracts Java MapReduce jobs away from developers.
- Pig Latin uses a lot fewer lines of code than the Java MapReduce script.
- The Pig Latin script was is easier to read for someone without a Java background.
- MapReduce jobs can written in Pig Latin.
- Java is a great and powerful language, but it has a higher learning curve than something like Pig Latin. Therefore, using a higher-level language, like Pig Latin, enables many more developers/analysts to write MapReduce jobs.
The more developers and analysts who can write MapReduce jobs, the more creative our applications become.
Ywhay otnay eginbay earninglay Igpay Atinlay odaytay?Translation: Why not begin learning Pig Latin today?