data

Everything you need to know about machine learning: part 1

- select the contributor at the end of the page -
This is the first of a three-part series. This article covers the basics of machine learning. The second part will dive deeper into Microsoft Azure Machine Learning and how to access it via web services. Finally, the third part will go through some actual examples.

Data science is one of the hottest jobs these days, and it's no surprise given the wealth of information it provides us. Data science involves using and manipulating data to gain useful insight or knowledge. The kind of information we get from data science can be applied to several vital areas like fraud detection, sales forecasting, and image and language recognition, to name just a few.

But despite its growing popularity as a career choice, being a data scientist isn't easy. It's a role that draws from various disciplines including mathematics, statistics and programming; skills that can take years to master. And because the amount of data, both structured and unstructured, is increasing beyond what most of us can handle, we now have something called machine learning, which lets us process this data using computer systems.

What exactly is machine learning?

Machine learning (ML) is basically a subfield of Artificial Intelligence (AI). Instead of having to write so much code to find and exploit patterns in data, ML makes it so that we simply supply the data and then let the computer system find those patterns for us.

Of course, not everything in data science or ML requires a Masters or PhD, and this is the goal of Microsoft Azure Machine Learning (MAML).

For example, to look for patterns or make predictions, it can help to approach a problem by looking at creating a representation (we refer to this later as the "model") in a two (X, Y) or three (X, Y, Z) dimensional space. Unlike humans, computers are not limited when it comes to handling bigger and bigger amounts of data.

How can a computer can learn from data?

First, it's important to know that there are different ways a computer can learn from data. Let's say that there are two major learning scenarios for ML: supervised learning and unsupervised learning. In supervised learning you know some of the data's characteristics. For example, you may know that you have millions of pictures of cats and dogs and you want to feed this data into your system so that it can determine whether future pictures it processes are cats or dogs.

But with unsupervised learning, you're feeding a lot of data to your system so that it can determine whether or not there's a pattern. Once the data has been processed, you can use what you've learned to make predictions on new data.

These two types of learning are good enough to get a basic understanding of ML. However, there are several other types that exist.

Other types of learning problems

It's pretty easy for a computer to ingest as much data as you can provide it (Hadoop, for one, is a good tool to store large amounts of data), but once you've fed this data to the system, you still need to guide it on what sort of problem or task you want it to handle.

Here are three major types of learning problems or tasks that MAML attempts to cover:

  • Classification: You want the computer to help you gain insight to make predictions that typically fall in a true/false or positive/negative result. You can have more than just two possibilities here, but let's keep things simple for now.
  • Clustering: You're looking to determine if there are groups of objects that share similar characteristics. For example, in a social network the result could be used to identify communities or groups of people with various similarities which could be focused on one person, a product or similar interests.
  • Regression: You're typically trying to predict a real value. An example could be trying to predict the sale price of a house based on characteristics such as city, age and size.

OK, now let's run through an example of machine learning at a very high level.

An example of machine learning

It's important to know that data science is often about experimentation. That's why your canvas in MAML was likely named “Experiments” by Microsoft. In the final part of this series, I'll use data from www.kaggle.com to get into more detail on this. Using the Titanic example, we'll make predictions on whether a person has survived the disaster or not. But first, let's walk through each step here beginning with the data to making predictions.

First, make sure your data is clean

When you have huge amounts of data, it might not always be clean. This can create issues like having partially incomplete data fields, for one. There are several approaches you can use here, depending on the data (it's important to understand what you're working with at a high level). For example, you can decide to drop entire rows if there's a column that has an empty value or you can fill empty columns with a value based on the remaining data, such as using the median value to fill in empty values.

Next, identify your features and labels

Not all data is meaningful, especially when it comes to making predictions. For example, your data may contain dozens of rows of items like first and last names, but that doesn't mean it's valuable in every case. When dealing with data related to humans, things like age, sex and height can be more useful than names.

Because you need to have a basic understanding of your data, you typically know the column headers. Once you've reviewed all of the column headers, it should be relatively easy to pick out the most useful information for your task. These columns you identify as useful are known as features.

In our upcoming Titanic example (stay tuned for that in the next part of this series) we'll try to predict survival. This will be our label. If you're still unsure about features and labels, don't sweat it – this will all become more clear in the next post.

Now, set up a model

Now that you have clean data, and you're clear on features and labels, the next step is to choose a model. A model iss a high-level representation of your features and labels. As a simple example, we can create a plot of the price of house sales based on size. Typically, we'd assume something like this:

house chart

In this case, our feature is the house size and our label is the house price. This is a regression type problem. We're hoping to build a model based on the house size to help us predict price. If you look at the graph closely, you can almost draw an inclined straight line through the data points. In this example, our simplified model is “house size in 1000s of square feet x 1 = house price in 1000s of $.”

This allows us to make a prediction: How much would a 3,500 square foot house normally cost? Using this model, it would be $3,500. (That's a good deal even in my small hometown!)

In the case of the Titanic data, we'll work with a classification model because our goal is to predict if someone on the Titanic has survived or not. So, we pick an algorithm best suited to deal with classifications (using an algorithm that typically deals with regression might provide an incorrect prediction). Knowing what algorithm to pick can be something done by chance, but there's no reason you can't choose multiple algorithms to compare results (more on that later).

Once you've chosen your algorithm, you'll start the process of developing that model with your data.

Training and testing the model

A typical approach is to take the available data and split it up into “training” and “testing.” There's no general rule for splitting data, but let's use an 80/20 split where 80 percent of the initial data is used to train the model, and the remaining 20 percent is used to test it. At this point, we provide 80 percent of the available data to our model or algorithm so it can learn based on what we've defined as the features.

Once you've trained your model by processing all of the available training data, you should test the model using valid data. This means using test data and features in an attempt to predict the label. This part in testing is important: Our Titanic test data actually shows whether the person survived or not, but we'll use the features of our test entries to make predictions, then we can compare our prediction with the actual value.

At this point, we may have tested our model and we're confident that it's ready to make future predictions.

So, what's next?

In the next part of this series, we'll dig deeper into how to work with MAML via a web browser and how to access it remotely via web services.

Get our content first. In your inbox.

Loading form...

If this message remains, it may be due to cookies being disabled or to an ad blocker.

Contributor

Marco Shaw

Marco Shaw is an IT consultant working in Canada. He has been working in the IT industry for over 12 years. He was awarded the Microsoft MVP award for his contributions to the Windows PowerShell community for 5 consecutive years (2007-2011). He has co-authored a book on Windows PowerShell, contributed to Microsoft Press and Microsoft TechNet magazine, and also contributed chapters for other books such as Microsoft System Center Operations Manager and Microsoft SQL Server. He has spoken at Microsoft TechDays in Canada and at TechMentor in the United States. He currently holds the GIAC GSEC and RHCE certifications, and is actively working on others.