Pig is an open source engine for executing parallelized data transformations which run on Hadoop. This course shows you how Pig can help you work on incomplete data with an inconsistent schema, or perhaps no schema at all.
Pig is an open source software which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. In this course, Data Transformations with Apache Pig, you'll learn about data transformations with Apache. First, you'll start with the very basics which will show you how to get Pig installed and get started working with the Grunt shell. Next, you'll discover how to load data into relations in Pig and store transformed results to files via load and store commands. Then, you'll work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York. Finally, you'll explore advanced constructs such as the nested foreach and also gives you a brief glimpse into the world of MapReduce and shows you how easy it is to implement this construct in Pig. By the end of this course, you'll have a better understanding of data transformations with Apache Pig.
A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.
Course Overview Hi, my name is Janani Ravi, and welcome to this course on performing data transformations using Apache Pig. I'll introduce myself first. I have a master's degree in electrical engineering from Stanford, and have worked at companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content. Pig is an open source engine, which is part of the Hadoop ecosystem of technologies. Pig is great at working with data which are beyond traditional databases or data warehouses. Pig can deal well with missing, incomplete, or inconsistent data, which has no schema. Pig has its own language for expressing data manipulations, which is Pig Latin. This course starts from the very basics. It gives you an overview of Pig, shows you how to get Pig installed and running on your system, and gets you started working with the Grunt shell. You will see how you can load data into relations in Pig, store transformed relations to files via load and store commands. The main focus of the course is on how the data can be transformed to make it more useful for analysis. This course will cover the foreach generate command, along with a range of evaluation and filter functions. You'll also work on a real world dataset that you analyze accidents in New York City using collision data from the city of New York website. And finally, we'll cover advanced constructs such as the nested foreach, and also get a brief glimpse into the world of MapReduce, the parallel programming model which powers Hadoop.
Working with Basic Data Transformations In this module, we'll focus on data transformations applied to relations in Pig, starting with the very basic ones. We've had a brief introduction to the foreach command in the previous module. In this module, we'll see how we can use them with column names, along with column indices. In addition to the load and store functions, which we are familiar with, and which we'll see in a little bit more detail, we'll see new categories or functions that we can work with in Pig, the evaluate and filter functions. We'll also work with an assorted list of other commands that you might find useful, such as the distinct, sort, limit, and split commands. Let's formally define the foreach and generate command, which work on individual records in the relation. A relation can be conceptualized as a table in a traditional database. The column headers correspond to the names of fields in a relation, that is, if the names are available. Every row in a traditional database is the exact equivalent of a tuple in a Pig relation. Remember, though, that a traditional database has a very strict schema and doesn't deal well with unstructured data. Unstructured data is what Pig works well with. The foreach command iterates over every record or every tuple in a relation, or it can also be applied to the tuples within an inner bag of a relation. You can then apply the functions on individual fields within the records, and project the fields that you're interested in. Select specified fields, use them within expressions, perform other formatting operations, get the data in the exact format that you envision into the resulting relation. The result of a foreach generate statement is a relation which is stored in a variable.