Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Time Profiling a Neural Network Model

Mar 13, 2020 • 17 Minute Read


You can control what you can measure. Profiling is a way to measure how code behaves in relation to the resources it uses. Time profiling provides information about how much time each line took to execute, how much time each module took for execution, and how many recursive calls were made. Memory profiling provides information about the amount of memory required for execution. By profiling your code, you can understand if there are any bottlenecks or memory leaks. This provides enough evidence for you to detect these incidents before they turn into problems.

This guide covers creating dataset objects and batches of data, feeding them to a machine learning (ML) model, and profiling the code using TensorFlow. It will use the CIFAR 10 dataset, which contains 50,000 images with labels for training and 10,000 images with labels for testing.

Design and Construct the Data Pipeline

A common procedure to manage data from one or multiple sources into a target system includes three steps: extract, transform, and load (ETL).

  • Extract raw data from multiple sources, including local disks or distributed systems
  • Transform data to parser, pre-process, augment, shuffle and batch it.
  • Load the transformed data into GPUs or TPUs to run ML model training and inference.

The following image illustrates the image classification data pipeline.

Creating TF Records

Once you download the images, you will see 5 batches of training images and 1 batch of test images, as shown in this screenshot.

One data batch is a 10000 x 3072 matrix expressed in a numpy array. It denotes 10,000 image samples that are each 32 x 32 pixels in size. Reshape the row vector as shown below, expected by Tensorflow as (width, height, num_channel). This can be consumed by the Tensorflow dataset API.

Create the Datasets Using In-memory Objects

We have used the API to create datasets from in-memory tensors. This API constructs a dataset by slicing the first dimension of the tensor into features and labels.

      train_set =, y_train)).map(data_aug,
  test_set =, y_test)).batch(BATCH_SIZE)

Transform the Datasets

Once the dataset objects have been created, they need to to be transformed for ML model consumption. Transformation includes data rearrangement, data cleanup, and data standardization and normalization. Standardize the features (in this case, pixel values) so that the mean is equal to zero and the standard deviation is 1. This makes our features lie on a Gaussian distribution. Data normalization rescales features to a specific range from 0 to the mean of the features.

      train_mean = np.mean(x_train, axis=(0,1,2))
train_std = np.std(x_train, axis=(0,1,2))
normalize = lambda x: ((x - train_mean) / train_std).astype('float32') 
# Add Padding
x_train = normalize(pad4(x_train))
x_test = normalize(x_test)

Next, augment the data for the 32 x 32 pixel size.

      data_aug = lambda x, y: (tf.image.random_flip_left_right(tf.image.random_crop(x, [32, 32, 3])), y)

The data is transformed online using This is very similar to pandas's Dataframe.apply(). This API takes a function and applies it to each element in the dataset. It takes a tf.Tensor object and returns a newly transformed tf.Tensor object.

Prefetch Data

The prefetch() API collects the next batch that will be loaded into the input pipeline while the current batch is being served by the ML model.

Load the Dataset to the Model

Feed the created dataset objects into the tf.keras-based model. The layers are defined using a tf.keras model sub-classing APIS. It is recommended to use tf.keras functional APIs to define neural network layers.

      # Using Kaiming He Initialization for kernel nitialization
def init_kaiming_he(shape, dtype=tf.float32, partition_info=None):
  fan =[:-1])
  bound = 1 / math.sqrt(fan)
  return tf.random.uniform(shape, minval=-bound, maxval=bound, dtype=dtype)

Now, create a convolutional neural network, followed by batch normalization and activation.

      # Creating model using subclassing API. 
# Same can be achieved using Functional API or Sequential API
class ConvBN(tf.keras.Model):
  def __init__(self, c_out):
    self.conv = tf.keras.layers.Conv2D(filters=c_out, kernel_size=3, padding="SAME", kernel_initializer=init_kaiming_he, use_bias=False) = tf.keras.layers.BatchNormalization(momentum=0.9, epsilon=1e-5)

  def call(self, inputs):
    return tf.nn.relu(

Now a Resnet block is constructed.

      class ResBlk(tf.keras.Model):
  def __init__(self, c_out, pool, res = False):
    self.conv_bn = ConvBN(c_out)
    self.pool = pool
    self.res = res
    if self.res:
      self.res1 = ConvBN(c_out)
      self.res2 = ConvBN(c_out)
  def call(self, inputs):
    h = self.pool(self.conv_bn(inputs))
    if self.res:
      h = h + self.res2(self.res1(h))
    return h Creating model using subclassing API. 
# Same can be achieved using Functional API or Sequential API
class ConvBN(tf.keras.Model):
  def __init__(self, c_out):
    self.conv = tf.keras.layers.Conv2D(filters=c_out, kernel_size=3, padding="SAME", kernel_initializer=init_kaiming_he, use_bias=False) = tf.keras.layers.BatchNormalization(momentum=0.9, epsilon=1e-5)

  def call(self, inputs):
    return tf.nn.relu(
      class CIFAR10AutoGrad(tf.keras.Model):
  def __init__(self, c=64, weight=0.125):
    self.maxpool = tf.keras.layers.MaxPooling2D()
    self.init_conv_bn = ConvBN(c)
    self.blk1 = ResBlk(c*3, self.maxpool, res = True)
    self.blk2 = ResBlk(c*6, self.maxpool)
    self.blk3 = ResBlk(c*9, self.maxpool, res = True)
    self.pool = tf.keras.layers.GlobalMaxPool2D()
    self.linear = tf.keras.layers.Dense(10, kernel_initializer=init_kaiming_he, use_bias=False)
    self.weight = weight
# Building the model
  def call(self, x, y):
    h = self.pool(self.maxpool(self.blk3(self.blk2(self.blk1(self.init_conv_bn(x))))))
    h = self.linear(h) * self.weight
    ce = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=h, labels=y) 
    loss = tf.reduce_sum(ce)
    correct = tf.reduce_sum(tf.cast(tf.math.equal(tf.argmax(h, axis = 1), y), tf.float32))
    return loss, correct

Model Training with Custom Logic

To compute and record the gradient of an operation with respect to its input variables, tf.GradientTape is used. All forward operations get recorded on tape, and to compute the gradient of those operations, the tape is played backward and discarded. To train a model call with tf.GradientTape(), call the forward pass on the input tensor inside the tf.GradientTape context manager and then compute the loss function.

      t = time.time()

def fit():
  # Create batches for test data
  test_set =, y_test)).batch(BATCH_SIZE)
  train_set =, y_train)).map(data_aug,
  #summary_writer = tf.summary.create_file_writer('./log/{}'.format("%Y-%m-%d-%H-%M-%S")))
  train_summary_writer = tf.summary.create_file_writer('./logs/train')
  test_summary_writer = tf.summary.create_file_writer('./logs/test')    

  for epoch in range(EPOCHS):
    train_loss = test_loss = train_acc = test_acc = 0.0
    with train_summary_writer.as_default():
      tf.summary.trace_on(graph=True, profiler=True) # To start capturing profiler in Tensorboard
      for (x, y) in tqdm(train_set):# Iterate over the batches of train dataset objects .
        with tf.GradientTape() as tape:
          #logits,actual  value for this minibatch
          loss, correct = model(x, y)
          var = model.trainable_variables
          grads = tape.gradient(loss, var)
          opt.apply_gradients(zip(grads, var))
        # Add extra losses created during this forward pass
        train_loss += loss.numpy()
        train_acc += correct.numpy()

      tf.summary.scalar('train loss', train_loss/len_train, step=epoch)
      tf.summary.scalar('train acc', (train_acc/len_train)*100, step=epoch)
      tf.summary.trace_export(name="Train", step=epoch,profiler_outdir='./logs/train/trace') # Close Profiling when we do export
    with test_summary_writer.as_default(): 
      tf.summary.trace_on(graph=True, profiler=True)
      for (x, y) in test_set:#  Iterate over the batches of train dataset objects .

        loss, correct = model(x, y)
        test_loss += loss.numpy()
        test_acc += correct.numpy()         
      tf.summary.scalar('test loss', test_loss/len_test, step=epoch)
      tf.summary.scalar('test acc', (test_acc/len_test)*100, step=epoch) 
      tf.summary.trace_export(name="Test", step=epoch,profiler_outdir='./logs/test/trace')  

      template = 'Epoch {}, lr:{:.3f},Train Loss: {:.3f},Train Accuracy: {:.3f}, Test Loss: {:.3f}, Test Accuracy: {:.3f},Time Taken: {:.2f}'
      print (template.format(epoch+1,lr_schedule(epoch+1),train_loss/len_train,(train_acc/len_train)*100,test_loss/len_test,(test_acc/len_test)*100,time.time() - t))

Using TensorBoard for Custom Training

  1. Create stateful metrics that accumulate values from training and can be logged at any point during the training process. This example uses train_acc, train_loss, test_acc and test_loss.
  2. Define the train and test functions that compute the metrics and accumulate them.
  3. Log the metrics using tf.summary.scalar() on the metrics result.

The above code creates a file to store summary of loss and accuracy metrics. We need to enable it using the below API.

      tf.summary.trace_on(graph=True, profiler=True)

The summary can be exported using the following.

      tf.summary.trace_export(name="Train", step=epoch,profiler_outdir='./logs/train')

In the same way, call the summary writer for the test or validation data. When you call tf.summary.trace_export(), tf.summary.trace_on() will be closed.

Visualizing Tensorboard

      %tensorboard --logdir logs/

Time Profiling using CProfile

Tensorboard skips the first iteration, which builds the graph. The profiler tab in tensorboard shows the time taken by each resource for execution. To view a custom visualization, use CProfile to compute the time taken by each functional call.

      if __name__ == '__main__':
   profile.runctx('print (fit())', globals(),locals(),"./logs/")

The saved profiler logs can be read using the stats module.

      from pstats import Stats
from io import StringIO

# print stats to a string
stats = Stats("./logs/",stream=result)


# chop the string into a csv-like buffer
result='\n'.join([', '.join(line.rstrip().split(None,6)) for line in result.split('\n')])

# save it to disk

The cProfile output is divided into five columns:

  1. ncalls: The number of times the function was called.
  2. tottime: The total time spent in the function without taking into account the calls to other functions.
  3. cumtime: The time in the function including other function calls.
  4. percall: The time spent for a single call of the function. This can be obtained by dividing the total or cumulative time by the number of calls.
  5. filename:lineno: The filename and corresponding line numbers. This information is not available when calling C extensions modules.

After converting profiler logs into csv format, you can do analysis as with any other pandas data frame.

      df = pd.read_csv('./logs/cifar10.csv',skipinitialspace=True)
# Display Column names
# Rename pandas col
df = df.drop(['filename:lineno(function)'],axis=1)
df=df[:-1] # deleting last row as part of data cleaning
df['total_num_calls']=df.index # Resetting index
df['total_num_calls']=df['total_num_calls'].str.split("/",n=1,expand=True) # Extracting total calls #from total num of calls / number of recursion
# Change data type
convert_dict = {'module_name': str, 
                'total_num_calls': int,
                'tottime':float   ,
                'ncalls' :float,
df = df.astype(convert_dict)

Derive the total time by multiplying the time taken per call by the total number of calls.

      df['total_time']=df['percall'] * df['total_num_calls']
# Converting to seconds
df['total_time'] = df['total_time'] / 1000.0
      # Create extra copy for safety 

Extract only the module name.

      df1['module_name']= df['module_name'].str.extract('(\\')
# dropping if any NaN exist

# Remove extra white Space

Create a dataframe that groups all modules with the same name and adds their total time.

# Adding same modules total time
df2 = df1.groupby(["module_name"])["total_time"].sum()
df2 = pd.DataFrame(df2)
df2 = df2[df2['total_time']>1]
df2 = df2.sort_values(by='total_time',ascending=False)

This code has filtered the values greater than one second and sorted and plot top 20 modules that consume the most time.

      %matplotlib inline
# Plotting
top20 = df2.head(20)
g=sns.catplot(x='module_name', y='total_time', data=top20, kind='bar')
g.set_xlabels("function called")
g.set_ylabels("Total_Time in seconds")
plt.title('Total Time Taken in seconds', weight='bold', fontsize=16)

To see the time taken by the modules' convolution, batch normalization, and gradient tape we used while creating a model, use a filter to visualize the selected modules.

      df3 =df2[df2['module_name'].str.contains("grad") | 
        df2['module_name'].str.contains('normal') |
        df2['module_name'].str.contains('tape') |
        df2['module_name'].str.contains('array') |
        df2['module_name'].str.contains('tensor') |
        df2['module_name'].str.contains('deprecat') |
        df2['module_name'].str.contains('iter') |
        df2['module_name'].str.contains('pool') |
        df2['module_name'].str.contains('layer') |
        df2['module_name'].str.contains('ops') |
        df2['module_name'].str.contains('custom') ]

Memory Profiling

Memory profiler summarizes the memory used in the process.

      !pip install memory_profiler
      !pip install psutil # optional dependency for memory profiler to execute fast
      %load_ext  memory_profiler
      %memit fit() # To know how much memory script uses overall.

This code took peak memory: 6912.27 MiB, increment: 1144.72 MiB. (1 MiB [mebibyte] is equivalent to 1,048,576 bytes. It is different from 1 MB [megabyte], which is equivalent to 1,000,000 bytes.)


This guide described custom profiling of Tensorflow models. Using a profiler on your code, you can find memory leaks and bottlenecks, enabling you to optimize the code.