You can control what you can measure. Profiling is a way to measure how code behaves in relation to the resources it uses. Time profiling provides information about how much time each line took to execute, how much time each module took for execution, and how many recursive calls were made. Memory profiling provides information about the amount of memory required for execution. By profiling your code, you can understand if there are any bottlenecks or memory leaks. This provides enough evidence for you to detect these incidents before they turn into problems.
This guide covers creating dataset objects and batches of data, feeding them to a machine learning (ML) model, and profiling the code using TensorFlow. It will use the CIFAR 10 dataset, which contains 50,000 images with labels for training and 10,000 images with labels for testing.
A common procedure to manage data from one or multiple sources into a target system includes three steps: extract, transform, and load (ETL).
The following image illustrates the image classification data pipeline.
Once you download the images, you will see 5 batches of training images and 1 batch of test images, as shown in this screenshot.
One data batch is a 10000 x 3072 matrix expressed in a numpy array. It denotes 10,000 image samples that are each 32 x 32 pixels in size. Reshape the row vector as shown below, expected by Tensorflow as (width, height, num_channel). This can be consumed by the Tensorflow dataset API.
We have used the tf.data.Dataset.from_tensor_slices()
API to create datasets from in-memory tensors. This API constructs a dataset by slicing the first dimension of the tensor into features and labels.
1 train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).map(data_aug,num_parallel_calls=tf.data.experimental.AUTOTUNE).shuffle(len_train).batch(BATCH_SIZE).prefetch(1)
2 test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)
Once the dataset objects have been created, they need to to be transformed for ML model consumption. Transformation includes data rearrangement, data cleanup, and data standardization and normalization. Standardize the features (in this case, pixel values) so that the mean is equal to zero and the standard deviation is 1. This makes our features lie on a Gaussian distribution. Data normalization rescales features to a specific range from 0 to the mean of the features.
1train_mean = np.mean(x_train, axis=(0,1,2))
2train_std = np.std(x_train, axis=(0,1,2))
3normalize = lambda x: ((x - train_mean) / train_std).astype('float32')
4# Add Padding
5x_train = normalize(pad4(x_train))
6x_test = normalize(x_test)
Next, augment the data for the 32 x 32 pixel size.
1data_aug = lambda x, y: (tf.image.random_flip_left_right(tf.image.random_crop(x, [32, 32, 3])), y)
The data is transformed online using tf.data.Dataset.map()
. This is very similar to pandas's Dataframe.apply()
. This API takes a function and applies it to each element in the dataset. It takes a tf.Tensor
object and returns a newly transformed tf.Tensor
object.
The prefetch()
API collects the next batch that will be loaded into the input pipeline while the current batch is being served by the ML model.
Feed the created dataset objects into the tf.keras-based
model. The layers are defined using a tf.keras
model sub-classing APIS. It is recommended to use tf.keras
functional APIs to define neural network layers.
1# Using Kaiming He Initialization for kernel nitialization
2def init_kaiming_he(shape, dtype=tf.float32, partition_info=None):
3 fan = np.prod(shape[:-1])
4 bound = 1 / math.sqrt(fan)
5 return tf.random.uniform(shape, minval=-bound, maxval=bound, dtype=dtype)
Now, create a convolutional neural network, followed by batch normalization and activation.
1# Creating model using subclassing API.
2# Same can be achieved using Functional API or Sequential API
3class ConvBN(tf.keras.Model):
4 def __init__(self, c_out):
5 super().__init__()
6 self.conv = tf.keras.layers.Conv2D(filters=c_out, kernel_size=3, padding="SAME", kernel_initializer=init_kaiming_he, use_bias=False)
7 self.bn = tf.keras.layers.BatchNormalization(momentum=0.9, epsilon=1e-5)
8
9 def call(self, inputs):
10 return tf.nn.relu(self.bn(self.conv(inputs)))
Now a Resnet block is constructed.
1 class ResBlk(tf.keras.Model):
2 def __init__(self, c_out, pool, res = False):
3 super().__init__()
4 self.conv_bn = ConvBN(c_out)
5 self.pool = pool
6 self.res = res
7 if self.res:
8 self.res1 = ConvBN(c_out)
9 self.res2 = ConvBN(c_out)
10
11 def call(self, inputs):
12 h = self.pool(self.conv_bn(inputs))
13 if self.res:
14 h = h + self.res2(self.res1(h))
15 return h Creating model using subclassing API.
16# Same can be achieved using Functional API or Sequential API
17class ConvBN(tf.keras.Model):
18 def __init__(self, c_out):
19 super().__init__()
20 self.conv = tf.keras.layers.Conv2D(filters=c_out, kernel_size=3, padding="SAME", kernel_initializer=init_kaiming_he, use_bias=False)
21 self.bn = tf.keras.layers.BatchNormalization(momentum=0.9, epsilon=1e-5)
22
23 def call(self, inputs):
24 return tf.nn.relu(self.bn(self.conv(inputs)))
1class CIFAR10AutoGrad(tf.keras.Model):
2 def __init__(self, c=64, weight=0.125):
3 super().__init__()
4 self.maxpool = tf.keras.layers.MaxPooling2D()
5 self.init_conv_bn = ConvBN(c)
6 self.blk1 = ResBlk(c*3, self.maxpool, res = True)
7 self.blk2 = ResBlk(c*6, self.maxpool)
8 self.blk3 = ResBlk(c*9, self.maxpool, res = True)
9 self.pool = tf.keras.layers.GlobalMaxPool2D()
10 self.linear = tf.keras.layers.Dense(10, kernel_initializer=init_kaiming_he, use_bias=False)
11 self.weight = weight
12# Building the model
13 def call(self, x, y):
14 h = self.pool(self.maxpool(self.blk3(self.blk2(self.blk1(self.init_conv_bn(x))))))
15 h = self.linear(h) * self.weight
16 ce = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=h, labels=y)
17 loss = tf.reduce_sum(ce)
18 correct = tf.reduce_sum(tf.cast(tf.math.equal(tf.argmax(h, axis = 1), y), tf.float32))
19 return loss, correct
To compute and record the gradient of an operation with respect to its input variables, tf.GradientTape
is used. All forward operations get recorded on tape, and to compute the gradient of those operations, the tape is played backward and discarded. To train a model call with tf.GradientTape()
, call the forward pass on the input tensor inside the tf.GradientTape
context manager and then compute the loss function.
1t = time.time()
2
3def fit():
4 # Create batches for test data
5 test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)
6 train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).map(data_aug,num_parallel_calls=tf.data.experimental.AUTOTUNE).shuffle(len_train).batch(BATCH_SIZE).prefetch(1)
7 #summary_writer = tf.summary.create_file_writer('./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))
8 train_summary_writer = tf.summary.create_file_writer('./logs/train')
9 test_summary_writer = tf.summary.create_file_writer('./logs/test')
10
11 for epoch in range(EPOCHS):
12 train_loss = test_loss = train_acc = test_acc = 0.0
13 with train_summary_writer.as_default():
14 tf.summary.trace_on(graph=True, profiler=True) # To start capturing profiler in Tensorboard
15 tf.keras.backend.set_learning_phase(1)
16 for (x, y) in tqdm(train_set):# Iterate over the batches of train dataset objects .
17 with tf.GradientTape() as tape:
18 #logits,actual value for this minibatch
19 loss, correct = model(x, y)
20 var = model.trainable_variables
21 grads = tape.gradient(loss, var)
22 opt.apply_gradients(zip(grads, var))
23 global_step.assign_add(1)
24 # Add extra losses created during this forward pass
25 train_loss += loss.numpy()
26 train_acc += correct.numpy()
27
28 tf.summary.scalar('train loss', train_loss/len_train, step=epoch)
29 tf.summary.scalar('train acc', (train_acc/len_train)*100, step=epoch)
30 tf.summary.trace_export(name="Train", step=epoch,profiler_outdir='./logs/train/trace') # Close Profiling when we do export
31 tf.keras.backend.set_learning_phase(0)
32
33 with test_summary_writer.as_default():
34 tf.summary.trace_on(graph=True, profiler=True)
35 for (x, y) in test_set:# Iterate over the batches of train dataset objects .
36
37 loss, correct = model(x, y)
38 test_loss += loss.numpy()
39 test_acc += correct.numpy()
40 tf.summary.scalar('test loss', test_loss/len_test, step=epoch)
41 tf.summary.scalar('test acc', (test_acc/len_test)*100, step=epoch)
42 tf.summary.trace_export(name="Test", step=epoch,profiler_outdir='./logs/test/trace')
43
44 template = 'Epoch {}, lr:{:.3f},Train Loss: {:.3f},Train Accuracy: {:.3f}, Test Loss: {:.3f}, Test Accuracy: {:.3f},Time Taken: {:.2f}'
45 print (template.format(epoch+1,lr_schedule(epoch+1),train_loss/len_train,(train_acc/len_train)*100,test_loss/len_test,(test_acc/len_test)*100,time.time() - t))
tf.summary.scalar()
on the metrics result.1tf.summary.create_file_writer('./logs/train')
The above code creates a file to store summary of loss and accuracy metrics. We need to enable it using the below API.
1tf.summary.trace_on(graph=True, profiler=True)
The summary can be exported using the following.
1 tf.summary.trace_export(name="Train", step=epoch,profiler_outdir='./logs/train')
In the same way, call the summary writer for the test or validation data. When you call tf.summary.trace_export()
, tf.summary.trace_on()
will be closed.
1 %tensorboard --logdir logs/
Tensorboard skips the first iteration, which builds the graph. The profiler tab in tensorboard shows the time taken by each resource for execution. To view a custom visualization, use CProfile to compute the time taken by each functional call.
1if __name__ == '__main__':
2 profile.runctx('print (fit())', globals(),locals(),"./logs/cifar10.prof")
The saved profiler logs can be read using the stats
module.
1from pstats import Stats
2from io import StringIO
3
4# print stats to a string
5result=StringIO()
6stats = Stats("./logs/cifar10.prof",stream=result)
7stats.strip_dirs()
8stats.sort_stats('cumulative')
9stats.print_stats()
10
11result=result.getvalue()
12
13# chop the string into a csv-like buffer
14result='ncalls'+result.split('ncalls')[-1]
15result='\n'.join([', '.join(line.rstrip().split(None,6)) for line in result.split('\n')])
16print(result)
17
18# save it to disk
19f=open('./logs/cifar10.csv','w')
20f.write(result)
21f.close()
The cProfile output is divided into five columns:
1. ncalls
: The number of times the function was called.
2. tottime
: The total time spent in the function without taking into account the calls to other functions.
3. cumtime
: The time in the function including other function calls.
4. percall
: The time spent for a single call of the function. This can be obtained by dividing the total or cumulative time by the number of calls.
5. filename:lineno
: The filename and corresponding line numbers. This information is not available when calling C extensions modules.
After converting profiler logs into csv format, you can do analysis as with any other pandas data frame.
1df = pd.read_csv('./logs/cifar10.csv',skipinitialspace=True)
2# Display Column names
3df.columns
4# Rename pandas col
5df.rename(columns={'percall.1':'module_name'},inplace=True)
6df = df.drop(['filename:lineno(function)'],axis=1)
7df=df[:-1] # deleting last row as part of data cleaning
8df['total_num_calls']=""
9df['total_num_calls']=df.index # Resetting index
10df['total_num_calls']=df['total_num_calls'].str.split("/",n=1,expand=True) # Extracting total calls #from total num of calls / number of recursion
11# Change data type
12convert_dict = {'module_name': str,
13 'total_num_calls': int,
14 'tottime':float ,
15 'cumtime':float,
16 'ncalls' :float,
17 'percall':float
18 }
19df = df.astype(convert_dict)
Derive the total time by multiplying the time taken per call by the total number of calls.
1df['total_time']=df['percall'] * df['total_num_calls']
2df['total_time']=df['total_time'].apply(np.ceil)
3# Converting to seconds
4df['total_time'] = df['total_time'] / 1000.0
1# Create extra copy for safety
2df1=df
Extract only the module name.
1df1['module_name']= df['module_name'].str.extract('(\\w+.py)')
2# dropping if any NaN exist
3df1=df1.dropna(subset=['module_name'])
4
5# Remove extra white Space
6df1['module_name']=df1['module_name'].str.strip()
Create a dataframe that groups all modules with the same name and adds their total time.
1df2=df1
2# Adding same modules total time
3df2 = df1.groupby(["module_name"])["total_time"].sum()
4df2 = pd.DataFrame(df2)
5df2=df2.reset_index()
6df2 = df2[df2['total_time']>1]
7df2 = df2.sort_values(by='total_time',ascending=False)
8df2
This code has filtered the values greater than one second and sorted and plot top 20 modules that consume the most time.
1%matplotlib inline
2# Plotting
3top20 = df2.head(20)
4g=sns.catplot(x='module_name', y='total_time', data=top20, kind='bar')
5g.set_xticklabels(rotation=90)
6g.set_xlabels("function called")
7g.set_ylabels("Total_Time in seconds")
8plt.title('Total Time Taken in seconds', weight='bold', fontsize=16)
9plt.show()
To see the time taken by the modules' convolution, batch normalization, and gradient tape we used while creating a model, use a filter to visualize the selected modules.
1df3 =df2[df2['module_name'].str.contains("grad") |
2 df2['module_name'].str.contains('conv')|
3 df2['module_name'].str.contains('backprop')|
4 df2['module_name'].str.contains('normal') |
5 df2['module_name'].str.contains('tape') |
6 df2['module_name'].str.contains('array') |
7 df2['module_name'].str.contains('tensor') |
8 df2['module_name'].str.contains('deprecat') |
9 df2['module_name'].str.contains('iter') |
10 df2['module_name'].str.contains('pool') |
11 df2['module_name'].str.contains('layer') |
12 df2['module_name'].str.contains('ops') |
13 df2['module_name'].str.contains('custom') ]
Plot it as before. The entire code can be viewed here.
Memory profiler summarizes the memory used in the process.
1!pip install memory_profiler
1!pip install psutil # optional dependency for memory profiler to execute fast
1%load_ext memory_profiler
1%memit fit() # To know how much memory script uses overall.
This guide described custom profiling of Tensorflow models. Using a profiler on your code, you can find memory leaks and bottlenecks, enabling you to optimize the code.