CUDA Tips: nvcc's -code, -arch, -gencode

Introduction

People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the difference of them, users may still miss the important information embedded in the document. This post summarizes the rules for using these options and their compatibility with other options.

Compiler Options:

-arch : specifies which virtual compute architecture the PTX code should be generated against. The valid format is like: -arch=compute_XY.
-code: specifies which actual sm architecture the SASS code should be generated against and be included in the binary. The valid format is like: -code=sm_XY.
-code: can also specify which PTX code should be included in the binary for the forward compatibility. The valid format is like: -code=compute_XY.
-gencode: combines both -arch and -code. The valid format is like: -gencode=arch=compute_XY,code=sm_XY
The relation of CUDA, PTX, SASS codes are summarized in the following figure.

nvcc options

Compatibility:

-arch=compute_*X is compatible with -code=sm_*Y when X≤Y. Examples:
- -gencode arch=compute_100,code=sm_103 => OK
- -gencode arch=compute_103,code=sm_100 => nvcc fatal : Incompatible code generation
-arch=compute_X* is compatible with -code=sm_Y* when X≤Y. Examples:
- -gencode arch=compute_90,code=sm_100 => OK
- -gencode arch=compute_100,code=sm_90 => nvcc fatal : Incompatible code generation

Compiled Results:

cubin: contains device binary code for a single architecture.
fatbin: may contain multiple PTX and cubin files.
- Note, the executable compiled by nvcc is not cubin/fatbin but contains cubin/fatbin files.
cuobjdump is a command can be used to examine or disassemble cubin/fatbin files or host executable.
- Basica usage
```
  cuobjdump <executable>
```
- A handy pipeline of commands to snip the results for brevity:
```
  cuobjdump <executable> \
| grep '$Fatbin\|arch =$' \
| awk 'NR % 2 == 1 { o=$0 ; next } { print o " " $0 }' \
| sort | uniq -c
```
  Compatibility
Executable generated by -code=sm_XY is only runnable on X.Y architecture.
Executable generated by -code=compute_XY is runnable on X'.Y' architecture with JIT when X’.Y’≥XY.

Use Cases:

Single SASS: -arch=compute_80 -code=sm_80
- “I want the code to only run on sm80 GPUs.”
Single PTX + Single SASS: -arch=compute_80 -code=compute_80,sm_80
- “I want the code to work on sm80 GPUs directly and be forward compatible with future GPUs through jitting compute80”
Single PTX + Multiple SASS: -arch=compute_80 -code=compute_80,sm_80,sm_86
- “I want the code to work on sm80 and sm86 GPUs directly and be forward compatible with future GPUs through jitting compute80”
Multiple PTX + Single SASS:
```
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=compute_86 \
-gencode=arch=compute_89,code=compute_89
```
- “I want the code to run directly on sm80 GPUs, use JIT compilation with compute_86 on sm86 GPUs, and JIT compile using compute_89 on other future GPUs”
Multiple PTX + Multiple SASS
```
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80 \
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89
```
- “I want the code to run directly on sm80 and sm89 GPUs, use JIT compilation with compute_80 on sm86 GPUs, and JIT compilation with compute_89 on future GPUs”

Architecture and Family Specific Options

CUDA 12.9 introduces Family-Specific Architecture features (link), complementing the Architecture-Specific features introduced with NVIDIA Hopper. These allow developers to optimize code using architecture-specific features with different levels of portability:

compute_100a includes all available instructions for the architecture but offers no forward compatibility
compute_100f includes a subset of architecture-specific instructions that maintain compatibility within the same GPU family (same major version)
compute_100 provides maximum forward compatibility across GPU generations but supports only the most basic instruction set

This relationship is illustrated in the Venn diagram below:

arch/family specific flags

Compatibility

The following table shows the compatibility matrix between PTX and binary code generation options. While some combinations are included for completeness (like -gencode arch=compute_100,code=sm_100a) and may not be commonly used in practice, this overview helps illustrate the relationships between architecture and family-specific flags.

-arch=compute_X is compatible with -code=sm_X[a|f]*. Examples:
- -gencode arch=compute_100,code=sm_100 => OK
- -gencode arch=compute_100,code=sm_100f => OK (A)
- -gencode arch=compute_100,code=sm_100a => OK
-arch=compute_Xa is compatible with -code=sm_Xa. Examples:
- -gencode arch=compute_100a,code=sm_100 => nvcc fatal : Incompatible code generation
- -gencode arch=compute_100a,code=sm_100f => nvcc fatal : Incompatible code generation
- -gencode arch=compute_100a,code=sm_100a => OK
-arch=compute_Xf is compatible with -code=sm_X[a|f]*. Examples:
- -gencode arch=compute_100f,code=sm_100 => OK (Equivalent with the above A)
- -gencode arch=compute_100f,code=sm_100f => OK (Equivalent with the above A)
- -gencode arch=compute_100f,code=sm_100a => OK

2022 3
2021 8
2020 1

2022

CUDA Tips: nvcc’s -code, -arch, -gencode

3 minute read

Introduction People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the d...

Expected Data Types in Mixed Precision Cheatsheet

1 minute read

When training neural networks with the Keras API, we care about the data types and computation types since they are relevant to the convergence (numeric stab...

Understanding the GeLU Fusion with TF-Grappler Visualization Tool

2 minute read

Introduction This post focuses on the GELU activation and showcases a debugging tool I created to visualize the TF op graphs. The Gaussian Error Linear Unit,...

2021

Demystifying the BatchNorm-Add-ReLU Fusion

2 minute read

Introduction My previous post, “Demystifying the Conv-Bias-ReLU Fusion”, has introduced a common fusion pattern in deep learning models. This post, on the ot...

Sparse Data Structure: Sorting Indices with Any Sorter + Custom Comparators

3 minute read

Introduction Recently, I am working on a project regarding sparse tensors in Tensorflow. Sparse tensors are used to represent tensors with many zeros. To sav...

Moving Mean and Moving Variance In Batch Normalization

8 minute read

Introduction On my previous post Inside Normalizations of Tensorflow we discussed three common normalizations used in deep learning. They have in common a tw...

Topological Sorting Explained

5 minute read

Introduction Recently I was working on a project related to the operation fusion in Tensorflow. My previous posts have covered several topics, such as how to...

Demystifying the Conv-Bias-ReLU Fusion

6 minute read

Introduction My previous post, “Fused Operations in Tensorflow”, introduced the basics of operation fusion in deep learning by showing how to enable the grap...

Fused Operations in Tensorflow

5 minute read

Introduction The computations in deep learning models are usually represented by a graph. Typically, operations in the graph are executed one by one, and eac...

Communications in Distributed Training with Tensorflow + Horovod

4 minute read

Introduction Horovod is an open source toolkit for distributed deep learning when the models’ size and data consumption are too large. Horovod exhibits many ...

Parameters In Tensorflow Keras RNN and CUDNN RNN

8 minute read

Introduction Recurrent Neural Network (RNN) is widely used in AI applications of handwriting recognition, speech recognition, etc. It essentially consists of...

2020

Inside Normalizations of Tensorflow

5 minute read

Introduction Recently I came across with optimizing the normalization layers in Tensorflow. Most online articles are talking about the mathematical definitio...

Kaixi Hou

CUDA Tips: nvcc's -code, -arch, -gencode

Introduction

Compiler Options:

Compatibility:

Compiled Results:

Compatibility

Use Cases:

Architecture and Family Specific Options

Compatibility

2022

2021

2020