CUDA Tips: nvcc's -code, -arch, -gencode

Introduction

People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the difference of them, users may still miss the important information embedded in the document. This post summarizes the rules for using these options and their compatibility with other options.

Compiler Options:

  • -arch : specifies which virtual compute architecture the PTX code should be generated against. The valid format is like: -arch=compute_XY.
  • -code: specifies which actual sm architecture the SASS code should be generated against and be included in the binary. The valid format is like: -code=sm_XY.
  • -code: can also specify which PTX code should be included in the binary for the forward compatibility. The valid format is like: -code=compute_XY.
  • -gencode: combines both -arch and -code. The valid format is like: -gencode=arch=compute_XY,code=sm_XY
  • The relation of CUDA, PTX, SASS codes are summarized in the following figure.

nvcc options

Compatibility:

  • -arch=compute_Xa is compatible with -code=sm_Xb when a≤b.
  • -arch=compute_X* is incompatible with -code=sm_Y* .

Compiled Results:

  • cubin: contains device binary code for a single architecture.
  • fatbin: may contain multiple PTX and cubin files.
    • Note, the executable compiled by nvcc is not cubin/fatbin but contains cubin/fatbin files.
  • cuobjdump is a command can be used to examine or disassemble cubin/fatbin files or host executable.
    • A handy pipeline of commands to obtain more concise results:
        cuobjdump <executable>  | grep '\(Fatbin\|arch =\)' | awk 'NR % 2 == 1 { o=$0 ; next } { print o " " $0 }' | sort | uniq -c
      

      Compatibility

  • Executable generated by -code=sm_XY is only runnable on X.Y architecture.
  • Executable generated by -code=compute_Xa is runnable on X.b architecture with JIT when b≥a.
  • Executable generated by -code=compute_ab is runnable on c.d architecture with JIT when c.d≥a.b.

Use Cases:

  • Single SASS: -arch=compute_80 -code=sm_80
  • Single PTX + Single SASS: -arch=compute_80 -code=compute_80,sm_80
  • Single PTX + Multiple SASS: -arch=compute_80 -code=compute_80,sm_80,sm_86
  • Multiple PTX + Single SASS:
    -gencode=arch=compute_80,code=sm_80 \
    -gencode=arch=compute_80,code=compute_80 \
    -gencode=arch=compute_86,code=compute_86
    
  • Multiple PTX + Multiple SASS
    -gencode=arch=compute_80,code=sm_80 \
    -gencode=arch=compute_80,code=compute_80 \
    -gencode=arch=compute_86,code=sm_86 \
    -gencode=arch=compute_86,code=compute_86
    

2022

CUDA Tips: nvcc’s -code, -arch, -gencode

1 minute read

Introduction People may feel confused by the options of -code, -arch, -gencode when compiling their CUDA codes. Although the official guidance explains the d...

Back to top ↑

2021

Demystifying the BatchNorm-Add-ReLU Fusion

2 minute read

Introduction My previous post, “Demystifying the Conv-Bias-ReLU Fusion”, has introduced a common fusion pattern in deep learning models. This post, on the ot...

Topological Sorting Explained

5 minute read

Introduction Recently I was working on a project related to the operation fusion in Tensorflow. My previous posts have covered several topics, such as how to...

Demystifying the Conv-Bias-ReLU Fusion

6 minute read

Introduction My previous post, “Fused Operations in Tensorflow”, introduced the basics of operation fusion in deep learning by showing how to enable the grap...

Fused Operations in Tensorflow

5 minute read

Introduction The computations in deep learning models are usually represented by a graph. Typically, operations in the graph are executed one by one, and eac...

Back to top ↑

2020

Inside Normalizations of Tensorflow

5 minute read

Introduction Recently I came across with optimizing the normalization layers in Tensorflow. Most online articles are talking about the mathematical definitio...

Back to top ↑