Over the last two years, I have been working on a program that makes it possible for novices to work with big data analytics.
Red Sqirl is a web-based big data application that simplifies the analysis of large data sets. With Red Sqirl, you can quickly access the power of the Hadoop eco-system, analyzing massive amounts of data rapidly and cost-effectively. It is an open platform that users can extend, thereby simplifying the Hadoop ecosystem (Hadoop, Hive, Pig, HBase, Oozie, etc) because you no longer have to master each of those underlying technologies. This makes Red Sqirl optimal for enhancing the productivity of data scientists.
Red Sqirl accesses your data through third-party processing and storage engines and organizes them into packages. For example, the Red Sqirl Pig package gives you access to Apache Pig. Red Sqirl also classifies a reusable piece of analyses into models.
Furthermore, Red Sqirl is open-source and Free.
You can learn more at www.redsqirl.com
Additionally, a video version of this tutorial is available: https://youtu.be/LL6adYq4YL4
The idea here is to give a glimpse of Red Sqirl's abilities.
Red Sqirl is a platform that you can install directly on top of a Hadoop Cluster. This tutorial requires the Red Sqirl Docker image.
At this point we’ll assume you already know what Red Sqirl is, and that you’ve installed it successfully.
If you want to follow along with this tutorial step by step on your local environment, you’ll need to know where the Red Sqirl installation folder is (/opt/redsqirl-2.6.0-0.12/) and have Red Sqirl Pig package already installed.
To Sign in, we’ll just use the OS username & password that you used to install Red Sqirl.
If you’re signing in for the first time, you’ll be prompted with a window about updating your footer menu. Click OK.
So this is our Red Sqirl interface. As you can see, it’s made up of three main sections.
The flow chart canvas:
The flow chart canvas is where we create a data analysis workflow. It also contains the actions that we can use for our analyses. We’ll find it at the bottom, in the canvas footer.
The remote file system:
This is where we can connect to any ssh server. For instance, you can ssh localhost, which will correspond to the server that Red Sqirl is installed on.
Click on the remote file system tab and then click on the plus button. Then fill out the form as necessary. Now we can see our remote file system.
The help tab is made up of:
These are the three main parts of the Red Sqirl interface.
Inside the Red Sqirl installation folder we’ve already given you some data for this tutorial. So what we need to do now is copy this data into our HDFS (Hadoop Distributed File System).
To do this we’ll first go back to the Remote File System.
There are a few different files in this folder. For this tutorial we’ll be using the file named
getting_started.txt. Now we can click on
tutorial_data and click on the
Go back to the flowchart canvas section.
To accomplish these tasks, we first need to change the footer. The action footer is the little frame on the bottom left of the FlowChart tab.
You should see a new footer tab called “extraPig” with “Pig Audit” inside. To remove this new menu, you need to do the following.
Let’s analyze the data.
pig_tutorial_data.mrtxt. If you cannot find it, refresh the view by clicking on the search button. Click OK.
Let's rename the fields:
pig_tutorial. By default, it is saved in the redsqirl-save HDFS directory and the file will have the extension ‘.rs’. Click OK to save.
How do we view the modified data?
We can now see that the arcs around the source action icon have changed.
The arcs around the icons, give information about the status of that action.
To check what the arcs mean, just click on the legend on the top left of the canvas. To hide the legend, just click it again.
Now we’re ready to start processing data.
The Pig Aggregator is an action in which aggregation methods are allowed to be used when selecting columns as you would in an SQL statement. These aggregation methods are AVG, MAX, SUM, and so on. This action will group by either the selected attributes or all, which is the default if no attributes are selected.
In the Pig footer menu, we can select the aggregator action.
We can also sort the rows -- On the top of the table click the “+” symbol to add a new row to the table. One thing to note: the check box on each row is only used for sorting and deleting, we don’t need to have each row ticked in order to continue.
Click on the pen in Operation field of the new row and click the “SUM()” function and add the parameters “communication.offpeak_voice” and “communication.peak_voice”. In between the parameters add a “+” symbol so that the operation would read “SUM(communication.offpeak_voice + communication.peak_voice)”
Go on menu click Project and click on “Save & Run”
This will take some time to run because Pig uses Map-Reduce.
And that’s it!
As soon as the process is finished, we can see the results. Go to the Pig Aggregator Action, go into options, and click on “data output.” We can see the path at the top and we can download our results as a CSV.
To make each dataset interactable with each other it is necessary to perform a join on them.
Now we want to make a condition to see what subscribers have a higher total voice calls than the average of the entire dataset. The easiest would be to add the condition in Join, but we will create a new Pig Select Action for demonstration purposes.
We would now like to store the “nl_vs_total” intermediate result before running the workflow. To save the result:
You can now “Save and Run,” and see your result in the Data Output of “nl_vs_total” and “high_voice.”
To see the results, leave your mouse on the action “nl_vs_total” or “high_voice”, > Options > Data Output. You can close it by hitting “Cancel” or “OK”
Once you are happy with the result you can clean all the data generated by this workflow by clicking on “Select All” and then “Clean Actions” in the “Edit” top menu.
The final canvas should be something like this;
Red Sqirl is a program that simplifies and facilitates data analysis. Analysis methods in Red Sqirl are efficient; they are quickly created, easily shared, and easily reused.
In this tutorial, we