CUDA Tips: nvcc's -code, -arch, -gencode

Introduction

People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the difference of them, users may still miss the important information embedded in the document. This post summarizes the rules for using these options and their compatibility with other options.

Compiler Options:

  • -arch : specifies which virtual compute architecture the PTX code should be generated against. The valid format is like: -arch=compute_XY.
  • -code: specifies which actual sm architecture the SASS code should be generated against and be included in the binary. The valid format is like: -code=sm_XY.
  • -code: can also specify which PTX code should be included in the binary for the forward compatibility. The valid format is like: -code=compute_XY.
  • -gencode: combines both -arch and -code. The valid format is like: -gencode=arch=compute_XY,code=sm_XY
  • The relation of CUDA, PTX, SASS codes are summarized in the following figure.

nvcc options

Compatibility:

  • -arch=compute_*X is compatible with -code=sm_*Y when X≤Y. Examples:
    • -gencode arch=compute_100,code=sm_103 => OK
    • -gencode arch=compute_103,code=sm_100 => nvcc fatal : Incompatible code generation
  • -arch=compute_X* is compatible with -code=sm_Y* when X≤Y. Examples:
    • -gencode arch=compute_90,code=sm_100 => OK
    • -gencode arch=compute_100,code=sm_90 => nvcc fatal : Incompatible code generation

Compiled Results:

  • cubin: contains device binary code for a single architecture.
  • fatbin: may contain multiple PTX and cubin files.
    • Note, the executable compiled by nvcc is not cubin/fatbin but contains cubin/fatbin files.
  • cuobjdump is a command can be used to examine or disassemble cubin/fatbin files or host executable.
    • Basica usage
        cuobjdump <executable>
      
    • A handy pipeline of commands to snip the results for brevity:
        cuobjdump <executable> \
      | grep '\(Fatbin\|arch =\)' \
      | awk 'NR % 2 == 1 { o=$0 ; next } { print o " " $0 }' \
      | sort | uniq -c
      

      Compatibility

  • Executable generated by -code=sm_XY is only runnable on X.Y architecture.
  • Executable generated by -code=compute_XY is runnable on X'.Y' architecture with JIT when X’.Y’≥XY.

Use Cases:

  • Single SASS: -arch=compute_80 -code=sm_80
    • “I want the code to only run on sm80 GPUs.”
  • Single PTX + Single SASS: -arch=compute_80 -code=compute_80,sm_80
    • “I want the code to work on sm80 GPUs directly and be forward compatible with future GPUs through jitting compute80”
  • Single PTX + Multiple SASS: -arch=compute_80 -code=compute_80,sm_80,sm_86
    • “I want the code to work on sm80 and sm86 GPUs directly and be forward compatible with future GPUs through jitting compute80”
  • Multiple PTX + Single SASS:
    -gencode=arch=compute_80,code=sm_80 \
    -gencode=arch=compute_86,code=compute_86 \
    -gencode=arch=compute_89,code=compute_89
    
    • “I want the code to run directly on sm80 GPUs, use JIT compilation with compute_86 on sm86 GPUs, and JIT compile using compute_89 on other future GPUs”
  • Multiple PTX + Multiple SASS
    -gencode=arch=compute_80,code=sm_80 \
    -gencode=arch=compute_80,code=compute_80 \
    -gencode=arch=compute_89,code=sm_89 \
    -gencode=arch=compute_89,code=compute_89
    
    • “I want the code to run directly on sm80 and sm89 GPUs, use JIT compilation with compute_80 on sm86 GPUs, and JIT compilation with compute_89 on future GPUs”

Architecture and Family Specific Options

CUDA 12.9 introduces Family-Specific Architecture features (link), complementing the Architecture-Specific features introduced with NVIDIA Hopper. These allow developers to optimize code using architecture-specific features with different levels of portability:

  • compute_100a includes all available instructions for the architecture but offers no forward compatibility
  • compute_100f includes a subset of architecture-specific instructions that maintain compatibility within the same GPU family (same major version)
  • compute_100 provides maximum forward compatibility across GPU generations but supports only the most basic instruction set

This relationship is illustrated in the Venn diagram below:

arch/family specific flags

Compatibility

The following table shows the compatibility matrix between PTX and binary code generation options. While some combinations are included for completeness (like -gencode arch=compute_100,code=sm_100a) and may not be commonly used in practice, this overview helps illustrate the relationships between architecture and family-specific flags.

  • -arch=compute_X is compatible with -code=sm_X[a|f]*. Examples:
    • -gencode arch=compute_100,code=sm_100 => OK
    • -gencode arch=compute_100,code=sm_100f => OK (A)
    • -gencode arch=compute_100,code=sm_100a => OK
  • -arch=compute_Xa is compatible with -code=sm_Xa. Examples:
    • -gencode arch=compute_100a,code=sm_100 => nvcc fatal : Incompatible code generation
    • -gencode arch=compute_100a,code=sm_100f => nvcc fatal : Incompatible code generation
    • -gencode arch=compute_100a,code=sm_100a => OK
  • -arch=compute_Xf is compatible with -code=sm_X[a|f]*. Examples:
    • -gencode arch=compute_100f,code=sm_100 => OK (Equivalent with the above A)
    • -gencode arch=compute_100f,code=sm_100f => OK (Equivalent with the above A)
    • -gencode arch=compute_100f,code=sm_100a => OK

2022

CUDA Tips: nvcc’s -code, -arch, -gencode

3 minute read

Introduction People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the d...

Back to top ↑

2021

Demystifying the BatchNorm-Add-ReLU Fusion

2 minute read

Introduction My previous post, “Demystifying the Conv-Bias-ReLU Fusion”, has introduced a common fusion pattern in deep learning models. This post, on the ot...

Topological Sorting Explained

5 minute read

Introduction Recently I was working on a project related to the operation fusion in Tensorflow. My previous posts have covered several topics, such as how to...

Demystifying the Conv-Bias-ReLU Fusion

6 minute read

Introduction My previous post, “Fused Operations in Tensorflow”, introduced the basics of operation fusion in deep learning by showing how to enable the grap...

Fused Operations in Tensorflow

5 minute read

Introduction The computations in deep learning models are usually represented by a graph. Typically, operations in the graph are executed one by one, and eac...

Back to top ↑

2020

Inside Normalizations of Tensorflow

5 minute read

Introduction Recently I came across with optimizing the normalization layers in Tensorflow. Most online articles are talking about the mathematical definitio...

Back to top ↑