CUDA Tips: nvcc’s -code, -arch, -gencode
Introduction People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the d...
When training neural networks with the Keras API, we care about the data types and computation types since they are relevant to the convergence (numeric stability) and performance (memory footprint and computation efficiency). There are multiple “knobs” that we can turn to change the types by:
dtype
of input tensors, or explicitly tf.cast
the tensors;dtype
of the Keras Layer which defines data type of the layer’s
computations and weights;TF_FP16_CONV_USE_FP32_COMPUTE
for the
computation data type;mixed_precision.set_global_policy('mixed_float16')
.This may seem a bit confusing and it may not be clear how different settings affect each other and what we should expect about the actual weight/output data type and computation data type.
Therefore, I am trying to sweep through all possible combinations of the settings (fp16 or fp32) and the table below summarizes the obtained weight/output data type and computation data type for them and I hope the examples help.
Layer | Input | TF_FP16_CONV_USE _FP32_COMPUTE | Weight | Computation | Output |
---|---|---|---|---|---|
fp32 | fp32 | 1 | fp32 | fp32 | fp32 |
fp32 | fp32 | 0 | fp32 | fp32 | fp32 |
fp32 | fp16(⇒fp32) | 1 | fp32 | fp32 | fp32 |
fp32 | fp16(⇒fp32) | 0 | fp32 | fp32 | fp32 |
fp16 | fp32(⇒fp16) | 1 | fp16 | fp32 | fp16 |
fp16 | fp32(⇒fp16) | 0 | fp16 | fp16 | fp16 |
fp16 | fp16 | 1 | fp16 | fp32 | fp16 |
fp16 | fp16 | 0 | fp16 | fp16 | fp16 |
mixed | fp32(⇒fp16) | 1 | fp32(⇒fp16) | fp32 | fp16 |
mixed | fp32(⇒fp16) | 0 | fp32(⇒fp16) | fp16 | fp16 |
mixed | fp16 | 1 | fp32(⇒fp16) | fp32 | fp16 |
mixed | fp16 | 0 | fp32(⇒fp16) | fp16 | fp16 |
A(=>B) means the data is stored in A but will be automatically cast to B.
Basically we can summarize the behavior with four rules:
Introduction People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the d...
When training neural networks with the Keras API, we care about the data types and computation types since they are relevant to the convergence (numeric stab...
Introduction This post focuses on the GELU activation and showcases a debugging tool I created to visualize the TF op graphs. The Gaussian Error Linear Unit,...
Introduction My previous post, “Demystifying the Conv-Bias-ReLU Fusion”, has introduced a common fusion pattern in deep learning models. This post, on the ot...
Introduction Recently, I am working on a project regarding sparse tensors in Tensorflow. Sparse tensors are used to represent tensors with many zeros. To sav...
Introduction On my previous post Inside Normalizations of Tensorflow we discussed three common normalizations used in deep learning. They have in common a tw...
Introduction Recently I was working on a project related to the operation fusion in Tensorflow. My previous posts have covered several topics, such as how to...
Introduction My previous post, “Fused Operations in Tensorflow”, introduced the basics of operation fusion in deep learning by showing how to enable the grap...
Introduction The computations in deep learning models are usually represented by a graph. Typically, operations in the graph are executed one by one, and eac...
Introduction Horovod is an open source toolkit for distributed deep learning when the models’ size and data consumption are too large. Horovod exhibits many ...
Introduction Recurrent Neural Network (RNN) is widely used in AI applications of handwriting recognition, speech recognition, etc. It essentially consists of...
Introduction Recently I came across with optimizing the normalization layers in Tensorflow. Most online articles are talking about the mathematical definitio...