We'd all love the ability to predict the future, regardless of our life experiences, occupation, and ethics. The idea of knowing something before it happens, and thus knowing something before everyone else, can put us at a major advantage in almost every competitive situation.
That brings us to predictive data analytics.
So, as an example, we'll use bank loans. Through predictive analytics we can build a model to predict whether or not a person will be given a loan if they were to apply for one.
To build our predictive model, we'll be using Red Sqirl.
Red Sqirl is a web-based big data application that simplifies data analysis using a drag-and-drop interface. With Red Sqirl you can simply access the power of the Hadoop eco-system and its packages for different Hadoop technologies.
Using data taken from the Machine Learning Repository (located here, and Red Sqirl, we will build a predictive model using a Spark Decision Tree to predict if an applicant would be "good" = 1 or "bad" = 2 to receive a loan if they were to apply to this bank.
To use Spark in Red Sqirl you'll need the following python libraries on all data nodes:
More information about this is available here:
There are tutorials for starting with Red Sqirl here.
Video tutorial:
Written tutorials:
Once you're somewhat familiar with the Red Sqirl platform, we can start our Decision Tree Model tutorial.
Our goal is to predict if a bank will classify a person as “good” (score = 1) or “bad” (score = 2) using their data. Our prediction will help us determine if they should receive a loan.
The data used in the model are related to demographics, credit aim, history, account status, and many other barometers which allow decision makers in the bank to grant or withhold credit.
More info about the dataset can be found at this UCI archive(https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data). (If the link does not work, just enter "UCI German Credit Data" into Google.)
We’ll use the data with categorical variables titled german.data.
You can simply download the data and follow along.
Before starting in Red Sqirl, we need to add an extra column to the data with CustId. We add a string starting with GD001.
To start, create a german_credit.mrtxt
folder in your Hadoop File System. Then download the data as a .txt file and place them into the german_credit.mrtxt folder.
By leaving the mouse cursor on the source action you'll be able to see some of the configuration details.
Next, drop a new Pig Select action icon onto the canvas.
Now we will save our model. Go to File -> Save as -> call it “ourmodel” -> click OK.
Then drop another Pig Select action icon onto the canvas.
Now drop another Pig Select action icon onto the canvas.
We are changing the output type to buffered to see intermediate results here. The arcs will change colour to blue.
You can also use Sample Pig to create training and prediction datasets instead of creating the “prep” action with RANDOM() and following the Pig Select action.
You can link these actions to Spark Decision Tree as described in next step.
+
:
We’ll create a calculation for a matrix with a count of cases correctly and incorrectly classified in our model. Then we'll calculate accuracy, precision and recall to measure the model's performance.
In this confusion matrix, the "correct" cells are the following:
The "error" cells are:
Drop a Pig Aggregate action icon onto the canvas.
Now we’ll run our decision tree model by going to Project -> Save and run.
We’ve counted the cases classified correctly and incorrectly in the last Aggregate Pig Action and we got performance measures from the Select Pig action to evaluate the models performance:
You may have different values here. It depends on your prediction and training datasets sizes. As we are using RANDOM(), the number of cases in training and prediction obtained by the condition >= 0.3 and < 0.3 may vary.
To determine how well this model performs, ask and answer these questions:
Now, when the model is done and you took a look at its output and performance, you can try to predict if you would get a loan from the bank based on the decision tree model as a way of evaluating you as a “good” or “bad” customer.
All you need to do now is add new lines into the prediction set with your data (or data from your family, friends, colleagues) and run a model to get the answer to the question, would you (or your loved ones) get a loan in the given circumstances.
When you fill in your data, leave the Cost field empty as this is the field value you are trying to predict.
You can also build a Decision Tree Model with numerical versions of these data, or a logistic regression model, or SVM in Red Sqirl to predict if you will get a loan, and compare the performance between the different models and input types.
Enjoy!