Everything you need to know about machine learning: part 2

- select the contributor at the end of the page -
This is the second of a three-part series. Part 1 covers the basics of machine learning, while this article gives a more in-depth look into Microsoft Azure Machine Learning and how to access it via Web services. Finally, the third part will go through some examples. In the first part of this series on Microsoft Azure Machine Learning (MAML), I laid out the basics of machine learning and provided some basic terminology. I also showed a high-level example that takes you from starting with your data and progressing, all the way to testing the model. In Part 2, we'll go back through this process again with a practical example based on the Titanic data set. Then, finally, at the end, we'll wrap it up with some new information and make a prediction based on the input values. Since MAML is all about cloud computing on Microsoft Azure’s public cloud, I'm going to show you how to use Web services to provide the data to input and also to get the prediction output returned. In the next section, I'll jump right into things using the terminology  introduced in Part 1 to go through the details of a hands-on example, so it might be a good idea for you to go back and re-read the first part as a quick refresher.

Hands-on example

Microsoft has provided an online Web-based interface to work with MAML, named Azure Machine Learning Studio. I’m not going to get into the finer details of Azure ML Studio detailing, like how to drag and drop and link things together, but in Part 3 of this series, you'll get a video that walks you through how I’ve assembled all of the modules or blocks together. If you need additional guidance, head here and enter “Azure Machine Learning” in the search box, and additional information will be available. Using the Kaggle Titanic data is an example of a supervised machine learning scenario. Basically, I have data that I will define having a label and features. In this case, the features I'll use to create my model are “passenger class,” “sex,” “age” and “fare”, and I'll define the label as whether the particular person survived or not, which is defined by “survived.” This is what the workflow for this example can look like in MAML Studio: maml 01 If we go block-by-block starting from the top, briefly:
  • Add our data set.
  • Select certain features (because such things as the name may not be relevant to training a model).
  • Clean our data (some passengers had an empty age, so I filled in empty values with the median age).
  • Spit the data (I chose to use 80 percent of my data to train my model, and the remaining 20 percent to test).
  • Setup the model with 80 percent of the original data and initialize it as a “Multiclass Decision Forest,” which is a classification-type algorithm.
  • Score the model.
  • Evaluate the model.
Once I run this, I can visualize how well my model performed against the 20 percent test data, which is presented in the form of a “confusion matrix." maml 02 Using my trained model, I predicted non-survivors (“0”) with a 90.1 percent success rate, and I predicted survivors (“1”) with a 67.2 percent success rate. So, my model does pretty well with predicting non-survivors, but just so-so with survivors. This begs the question, is there a way to improve on the success rate of my predictions?

Getting a better prediction

I used an algorithm named “Multiclass Decision Forest,” but was that the best choice? Remembering that I’m dealing with a classification-type problem, MAML Studio provides another algorithm named “Multiclass Neural Network.” This part of MAML is awesome; with just a few clicks, I can add another model to my experiment. I’m just showing the important part of the workflow here where I’ve added the new algorithm and joined all of the blocks together, and I’m using the same training and test data: maml 03 So now, what I’ve done is initiated, trained and scored, and now I can evaluate both models. Not only that, but I can combine the evaluation into one single view: maml 04 The “confusion matrix” to the left is the same as my original experiment, and the one to the right is my new matrix based on the algorithm I added. If you compare both matrices, you might notice that my second model was more accurate in predicting non-survivors successfully (now 97.3 percent instead of 90.1 percent), but it was worse when predicting survivors (now 55.2 percent instead of 67.2 percent). It’s quite possible that another classification algorithm would provide better predictions, and there’s no shortage of possibilities as MAML currently provides 14 different algorithms. There’s a lot of experimenting that can be done here to try to find the best model in a particular scenario. In the final part of this series, I’ll talk more about algorithm choices. You need to pick the right type of algorithm because some are likely optimized if you’re working with a classification or clustering machine learning task (MAML Studio can help guide you with its samples). Here’s a snapshot of three current scenarios: maml 05 For example, sample five would likely fit well with what I've done with the Kaggle Titanic data experiment. Now, I need to talk about how to access the Web service remotely, because that’s what this is all about -- think, “Data Scientist as a Service.”

Using the Web service

The above runs us through an example that eventually provides us with a model we can use to make predictions. The point of having MAML in the cloud is that you want to be able to interact with it easily. You can publish your model by setting the appropriate locations in your MAML workflow as a published input and output (see the video in Part 3 of this series, the MAML Studio documentation or Channel9 videos for more details). Then, you can easily access sample code in R, Python or C# (with sample code provided in MAML Studio to help you). Below, I’m showing an example where I'm using Windows PowerShell to access the web service: maml 06 Notice that the important features I passed included passenger class: 1, sex: F, age: 35 and fare: 75. My model is telling me it predicts that this person should have survived (the “1” at the end). If you’re interested, you can find my sample code here. This sets up my web service in a staging environment that doesn’t necessarily guarantee with any kind of uptime. To get published service level, I need to “productionize” my web service, which we'll discuss next. (At the time of writing, I couldn't find anything officially stating what service level Microsoft guaranteed for this service.)

Making money from my model

Microsoft provides the Azure Market Place as a sort of Azure application store where you can make your work public, and even charge for others to use your service. By default, you need an API key to access the Web service mentioned in the previous section. If you want to really put your model in production (without having to give out an API key, and you want to actually charge people to use it), you would publish your Web service to the Azure Market Place, which would provide an OData endpoint. You can see existing, published Web services specifically related to ML here.

Up next

This wraps up Part 2 of our machine learning series and has hopefully made you more confident in using MAML. In Part 3, I’ll provide a few more examples, possibly with a video that walks you through what I’ve done to create the Kaggle Titanic experiment and close the loop on a few more items that I’ve mentioned in this series so far.

Get our content first. In your inbox.

Loading form...

If this message remains, it may be due to cookies being disabled or to an ad blocker.

Contributor

Marco Shaw

Marco Shaw is an IT consultant working in Canada. He has been working in the IT industry for over 12 years. He was awarded the Microsoft MVP award for his contributions to the Windows PowerShell community for 5 consecutive years (2007-2011). He has co-authored a book on Windows PowerShell, contributed to Microsoft Press and Microsoft TechNet magazine, and also contributed chapters for other books such as Microsoft System Center Operations Manager and Microsoft SQL Server. He has spoken at Microsoft TechDays in Canada and at TechMentor in the United States. He currently holds the GIAC GSEC and RHCE certifications, and is actively working on others.