Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Image Classification with PyTorch

Apr 1, 2020 • 19 Minute Read


PyTorch has revolutionized the approach to computer vision or NLP problems. It's a dynamic deep-learning framework, which makes it easy to learn and use.

In this guide, we will build an image classification model from start to finish, beginning with exploratory data analysis (EDA), which will help you understand the shape of an image and the distribution of classes. You'll learn to prepare data for optimum modeling results and then build a convolutional neural network (CNN) that will classify images according to whether they contain a cactus or not.

Click here to download the aerial cactus dataset from an ongoing Kaggle competition. Instead of MNIST B/W images, this dataset contains RGB image channels. Hence, it is perfect for beginners to use to explore and play with CNN. It's also a chance to classify something other than cats and dogs.

Importing Library and Data

To begin, import the torch and torchvision frameworks and their libraries with numpy, pandas, and sklearn. Libraries and functions used in the code below include:

  • transforms, for basic image transformations
  • torch.nn.functional, which contains useful activation functions
  • Dataset and Dataloader, PyTorch's data loading utility
      import pandas as pd 
import matplotlib.pyplot as plt 
import torch
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

from import Dataset, DataLoader
from sklearn.model_selection import train_test_split

%matplotlib inline
      import os
# place the files in your IDE working dicrectory .
labels = pd.read_csv(r'/aerialcactus/train.csv')
submission = pd.read_csv(r'/aerialcactus/sample_submission.csv)

train_path = r'/aerialcactus/train/train/'
test_path = r'/aerialcactus/test/test/'
      label = 'Has Cactus', 'Hasn\'t Cactus'
plt.figure(figsize = (8,8))
plt.pie(labels.groupby('has_cactus').size(), labels = label, autopct='%1.1f%%', shadow=True, startangle=90)

As per the pie chart, the data is biased towards one class. Imbalanced data will affect the final results. We already have enough data for CNN to produce results, so there is no need for any data sampling or augmentation.

Image Pre-processing

 Images in a dataset do not usually have the same pixel intensity and dimensions. In this section, you will pre-process the dataset by standardizing the pixel values. The next required process is transforming raw images into tensors so that the algorithm can process them.

      import matplotlib.image as img
fig,ax = plt.subplots(1,5,figsize = (15,3))

for i,idx in enumerate(labels[labels['has_cactus'] == 1]['id'][-5:]):
    path = os.path.join(train_path,idx)
      fig,ax = plt.subplots(1,5,figsize = (15,3))
for i,idx in enumerate(labels[labels['has_cactus'] == 0]['id'][:5]):
    path = os.path.join(train_path,idx)

Use the below code to standardize the image by defined mean and standard deviation because using raw image data will not give the desired results.

      import numpy as np
import matplotlib.pyplot as plt

def imshow(image, ax=None, title=None, normalize=True):
    if ax is None:
        fig, ax = plt.subplots()
    image = image.numpy().transpose((1, 2, 0))

    if normalize:
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image = std * image + mean
        image = np.clip(image, 0, 1)

    ax.tick_params(axis='both', length=0)

    return ax
      class CactiDataset(Dataset):
    def __init__(self, data, path , transform = None):
        super().__init__() = data.values
        self.path = path
        self.transform = transform
    def __len__(self):
        return len(
    def __getitem__(self,index):
        img_name,label =[index]
        img_path = os.path.join(self.path, img_name)
        image = img.imread(img_path)
        if self.transform is not None:
            image = self.transform(image)
        return image, label


      train_transform = transforms.Compose([transforms.ToPILImage(),

test_transform = transforms.Compose([transforms.ToPILImage(),

valid_transform = transforms.Compose([transforms.ToPILImage(),

Splitting the Dataset

How well the model can learn depends on the variety and volume of the data. We need to divide our data into a training set and a validation set using train_test_split.

Training dataset: The model learns from this dataset's examples. It fits a parameter to a classifier.

Validation dataset: The examples in the validation dataset are used to tune the hyperparameters, such as learning rate and epochs. The aim of creating a validation set is to avoid large overfitting of the model. It is a checkpoint to know if the model is fitted well with the training dataset.

Test dataset: This dataset test the final evolution of the model, measuring how well it has learned and predicted the desired output. It contains unseen, real-life data.

      train, valid_data = train_test_split(labels, stratify=labels.has_cactus, test_size=0.2)
      train_data = CactiDataset(train, train_path, train_transform )
valid_data = CactiDataset(valid_data, train_path, valid_transform )
test_data = CactiDataset(submission, test_path, test_transform )

Define the values of hyperparameters.

      # Hyper parameters

num_epochs = 35
num_classes = 2
batch_size = 25
learning_rate = 0.001

Whenever you initialize the batch of images, it is on the CPU for computation by default. The function torch.cuda.is_available() will check whether a GPU is present. If CUDA is present, .device("cuda") will route the tensor to the GPU for computation.

      # CPU or GPU

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

The device will use CUDA with a single GPU processor. This will make our calculations faster. If you have a CPU in your system, no problem. You can use Google Colab, which provides free GPU.

In the code below, dataloader ombines a dataset and a sampler and provides an iterable over the given dataset. dataset()indicates which dataset to load form the available data. For details, read this documentation.

      train_loader = DataLoader(dataset = train_data, batch_size = batch_size, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset = valid_data, batch_size = batch_size, shuffle=False, num_workers=0)
test_loader = DataLoader(dataset = test_data, batch_size = batch_size, shuffle=False, num_workers=0)
      import numpy as np
import matplotlib.pyplot as plt

def imshow(image, ax=None, title=None, normalize=True):
    if ax is None:
        fig, ax = plt.subplots()
    image = image.numpy().transpose((1, 2, 0))

    if normalize:
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image = std * image + mean
        image = np.clip(image, 0, 1)

    ax.tick_params(axis='both', length=0)

    return ax
      trainimages, trainlabels = next(iter(train_loader))

fig, axes = plt.subplots(figsize=(12, 12), ncols=5)
print('training images')
for i in range(5):
    axe1 = axes[i] 
    imshow(trainimages[i], ax=axe1, normalize=False)


The next step is to make a CNN model that learns ffrom the manipulated training dataset.

Designing a Convolution Neural Network (CNN)

If you try to recognize objects in a given image, you notice features like color, shape, and size that help you identify objects in images. The same technique is used by a CNN. The two main layers in a CNN are the convolution and pooling layer, where the model makes a note of the features in the image, and the fully connected (FC) layer, where classification takes place.

​Image Source:

Convolution Layer

Mathematically, convolution is an operation performed on two functions to produce a third function. Convolution is operating in speech processing (1 dimension), image processing (2 dimensions), and video processing (3 dimensions). The convolution layer forms a thick filter on the image.

The convolutional layer’s output shape is affected by the choice of kernel size, input dimensions, padding, and strides (number of pixels by which the window moves).

In this model, a 3x3 kernel size is used. It will have 27 weights and 1 bias.

This is what happens behind the CNN.

Image Source:

The factors that affect the convolutional layer’s output shape are the kernel size, input dimensions, padding and strides (no.of pixel by which the window moves). In this model 3x3 kernel filter is used. It will have 27 weights and 1 bias.

Similarly, carry out the calculation of layer 2.

Pooling Layer

A drawback of a convolution feature map is that it records the exact position of features. Even the smallest development in the feature map will produce different results. This problem is solved by down sampling the feature map. It will be a lower version of the image with important features intact. In this model, max pooling is used. It calculates the maximum value of each patch of the feature map.

Some brief notes about important parameters of __init__ model and forward are stated below:

Activation Layer

During forward propagation, activation function is used on each layer. The non-linearity transformation is introduced by the activation function. A neural network without an activation function is just a linear regression model, so it can not be ignored. Below is a list of activation functions.

Putting it All Together...

      epochs = 35
batch_size = 25
learning_rate = 0.001
      import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module): 
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=10, kernel_size=3)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=3)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(720, 1024)
        self.fc2 = nn.Linear(1024, 2)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(x.shape[0],-1)
        x = F.relu(self.fc1(x))
        x = F.dropout(x,
        x = self.fc2(x)
        return x

Create a complete CNN.

      model = CNN()


There are different types of losses implemented in machine learning. In this guide, cross-entropy loss is used. In this context, it is also known as log loss. Notice it has the same formula as that of likelihood, but it contains a log value.

The best thing about this function is that if the prediction is 0, the first half goes away, and if the prediction is 1, the second half drops. With this, you can estimate of where your model can go wrong while predicting the label. Changes are to be made during training to minimize the loss.


Select any one optimizer algorithm available in the torch.optim package. The optimizers have some elements of the gradient descent. By changing the model parameters, like weights, and adding bias, the model can be optimized. The learning rate will decide how big the steps should be to change the parameters.

  1. Calculate what a small change in each weight would do to the loss function (selecting the direction to reach minima).
  2. Adjust each weight based on its gradient (i.e., take a small step in the determined direction).
  3. Keep doing steps 1 and 2 until the loss function gets as low as possible.

Here, adaptive moment estimation (Adam) is used as an optimizer. It is a blend of RMSprop and stochastic gradient descent.

Loss function and optimization go hand-in-hand. Loss function checks whether the model is moving in the correct direction and making progress, whereas optimization improves the model to deliver accurate results.

      model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr = learning_rate)
# keeping-track-of-losses 
train_losses = []
valid_losses = []

for epoch in range(1, num_epochs + 1):
    # keep-track-of-training-and-validation-loss
    train_loss = 0.0
    valid_loss = 0.0
    # training-the-model
    for data, target in train_loader:
        # move-tensors-to-GPU 
        data =
        target =
        # clear-the-gradients-of-all-optimized-variables
        # forward-pass: compute-predicted-outputs-by-passing-inputs-to-the-model
        output = model(data)
        # calculate-the-batch-loss
        loss = criterion(output, target)
        # backward-pass: compute-gradient-of-the-loss-wrt-model-parameters
        # perform-a-ingle-optimization-step (parameter-update)
        # update-training-loss
        train_loss += loss.item() * data.size(0)
    # validate-the-model
    for data, target in valid_loader:
        data =
        target =
        output = model(data)
        loss = criterion(output, target)
        # update-average-validation-loss 
        valid_loss += loss.item() * data.size(0)
    # calculate-average-losses
    train_loss = train_loss/len(train_loader.sampler)
    valid_loss = valid_loss/len(valid_loader.sampler)
    # print-training/validation-statistics 
    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
        epoch, train_loss, valid_loss))
      # test-the-model
model.eval()  # it-disables-dropout
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in valid_loader:
        images =
        labels =
        outputs = model(images)
        _, predicted = torch.max(, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    print('Test Accuracy of the model: {} %'.format(100 * correct / total))

# Save, 'model.ckpt')
      %matplotlib inline
%config InlineBackend.figure_format = 'retina'

plt.plot(train_losses, label='Training loss')
plt.plot(valid_losses, label='Validation loss')


Take a deep breath! A CNN-based image classifier is ready, and it gives 98.9% accuracy. As per the graph above, training and validation loss decrease exponentially as the epochs increase. The losses are in line with each other, which proves that the model is reliable and there is no underfitting or overfitting of the model.

Data preparation is the most important and time-intensive process in data science. It is a great skill to know how to play around with data in the initial stage. Getting to know your data is what makes a good data scientist. This guide is not a complete one-stop for pre-processing, but you got a brief overview.

You also learned about the layers involved in designing the CNN model, the role of loss, and optimizer functions.

Building your own neural network is a cumbersome task, and that's why transfer learning (taking knowledge from one situation and applying it to another) is used a lot these days. Nevertheless, it is always good to have foundational knowledge.