In [1]:
# Some non-essential packages we'll be using for this notebook
using Unitful
using Humanize
using BenchmarkTools

Programming NVIDIA GPUs with CUDAnative.jl

by: Tim Besard (GitHub: maleadt)

Talk: how does it work, what does it enable.

Workshop: how do you use CUDAnative.jl, realistically.

Follow along:

CUDA C replacement:

  • low-level, kernel programming
  • need to understand parallel & GPU programming

High-level wrappers: eg. GPUArrays.jl

Very usable, but restricted:

  • wrt. Julia: language subset
  • wrt. CUDA: limited functionality

Table of contents

  1. Requirements
  2. Installation
  3. Programming model
  4. Usage
  5. Performance & Optimization
  6. Debugging

1. Requirements

Operating system: Linux, macOS (unsupported)

Julia: 0.6, 0.7 (unsupported)

GPU: NVIDIA CUDA, sm >= 2.0

2. Installation

Dependencies: CUDAdrv.jl & LLVM.jl

LLVM.jl: requires Julia source build:

$ git clone https://github.com/JuliaLang/julia.git
$ cd julia
$ git checkout v0.6.0
$ make
$ ./julia

Pkg.add("CUDAnative")

  • Julia 0.6, LLVM 3.9.1
  • Linux x64 (Arch Linux)
  • NVIDIA driver 375.39
  • CUDA 7.5
  • NVIDIA GeForce GTX Titan
  • Intel Core i7-3770k

3. Programming model

Single Program, Multiple Data

In [2]:
function execute(kernel, threads, args...)
    global thread
    for thread in 1:threads
        kernel(args...)
    end
end;
In [3]:
function kernel(a, b, c)
    c[thread] = a[thread] + b[thread]
    return
end;
In [4]:
a = [1,2,3]
b = [4,5,6]
c = similar(a)

execute(kernel, length(a), a, b, c)

a+b == c
Out[4]:
true

Conceptually simple, but complex implementations:

  • threads ∈ blocks ∈ grid
  • storage behavior and performance
  • thread communication & synchronization

Again, CUDAnative.jl does not obviate that complexity.

4. Usage

CUDAnative.jl takes care of:

  • intrinsics
  • compiler support
  • language integration

Other CUDA operations: manual usage of eg. CUDAdrv.jl

  • device management
  • memory operations
  • context handling
  • ...
In [5]:
using CUDAdrv, CUDAnative
In [6]:
dev = CuDevice(0)
Out[6]:
CuDevice(0): GeForce GTX TITAN
In [7]:
ctx = CuContext(dev)
# ...
destroy!(ctx)
In [8]:
CuContext(dev) do ctx
    # ...
end
In [9]:
ctx = CuContext(dev);

GPU arrays

In [10]:
?CuArray
search: CuArray CuDeviceArray

Out[10]:
CuArray{T}(dims)
CuArray{T,N}(dims)

Construct an uninitialized N-dimensional dense CUDA array with element type T, where N is determined from the length or number of dims. dims may be a tuple or a series of integer arguments corresponding to the lengths in each dimension. If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length or number of dims.

Type aliases CuVector and CuMatrix are available for respectively 1 and 2-dimensional data.

CuArray{T}(src::Array{T})

Transfer a host array src to device, returning a CuArray.

In [11]:
d_a = CuArray{Int}(2)
Out[11]:
2-element Array{Int64,1}:
 4737786809106563072
 4719772410588954624
In [12]:
d_a[1] = 1
indexing not defined for CUDAdrv.CuArray{Int64,1}

Stacktrace:
 [1] setindex!(::CUDAdrv.CuArray{Int64,1}, ::Int64, ::Int64) at ./abstractarray.jl:966
In [13]:
a = Array(d_a)
Out[13]:
2-element Array{Int64,1}:
 4737786809106563072
 4719772410588954624

@cuda ... kernel(d_a)

Garbage collection

In [14]:
Mem.used() |> datasize
Out[14]:
"72.1 MB"
In [15]:
d_a = CuArray{Int}(10000000)
sizeof(d_a) |> datasize
Out[15]:
"80.0 MB"
In [16]:
Mem.used() |> datasize
Out[16]:
"152.2 MB"
In [17]:
finalize(d_a)
Mem.used() |> datasize
Out[17]:
"72.1 MB"

Avoid ERROR_OUT_OF_MEMORY: manual gc()

Basic example

Vector addition:

  • easy to parallelize
  • no communication, synchronization, ...
In [18]:
function kernel_vadd(a, b, c)
    i = threadIdx().x
    c[i] = a[i] + b[i]

    return
end;
In [19]:
len = 42
a = randn(Float32, len)
b = randn(Float32, len);
In [20]:
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);
In [21]:
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)  # asynchronous!

kernel_vadd<<<1,len>>>(d_a, d_b, d_c)

In [22]:
c = Array(d_c)                            # synchronizing
a+b  c
Out[22]:
true

Scale it up

In [23]:
len = 2048
a = randn(Float32, len)
b = randn(Float32, len);
In [24]:
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);
In [25]:
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)
CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE)

Stacktrace:
 [1] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/base.jl:127 [inlined]
 [2] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:84 [inlined]
 [3] _launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:64 [inlined]
 [4] launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:56 [inlined]
 [5] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:172 [inlined]
 [6] _cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:162 [inlined]
 [7] cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:154 [inlined]
 [8] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:137 [inlined]
 [9] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/profile.jl:42 [inlined]
 [10] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:136 [inlined]
 [11] generated_cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::#kernel_vadd, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:185
In [26]:
attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
Out[26]:
1024
In [27]:
attribute.(dev, (CUDAdrv.MAX_BLOCK_DIM_X,
                 CUDAdrv.MAX_BLOCK_DIM_Y,
                 CUDAdrv.MAX_BLOCK_DIM_Z))
Out[27]:
(1024, 1024, 64)
In [28]:
function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    
    if i <= length(c)
        c[i] = a[i] + b[i]
    end

    return
end;
In [29]:
threads = min(1024, len)
blocks = ceil(Int, len / threads)
@cuda (blocks,threads) kernel_vadd(d_a, d_b, d_c)
In [30]:
c = Array(d_c)
a+b  c
Out[30]:
true
In [31]:
?CUDAdrv.CuDim3
Out[31]:
CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall, allowing to pass dimensions as a plain integer or tuple without having to construct an explicit CuDim3 object.

Reflection

CUDAnative defines: code_llvm, code_ptx, code_sass

Why? Custom toolchain

CUDAnative also defines:

  • @code_{lowered|typed|warntype}: defer to Base
  • @code_llvm, @code_ptx, @code_sass: defer to CUDAnative

Why?

  • decode @cuda (...) kernel(...)
  • type conversions: CuArrayCuDeviceArray
In [32]:
@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
WARNING: both CUDAnative and Base export "@code_llvm"; uses of it in module Main must be qualified
UndefVarError: @code_llvm not defined
In [33]:
CUDAnative.@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
; Function Attrs: nounwind
define void @julia_kernel_vadd_62142(%CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16)) local_unnamed_addr #0 !dbg !5 {
top:
  %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
; Filename: /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/device/intrinsics/indexing.jl
; Source line: 18
  %3 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !7
; Source line: 44
  %4 = tail call i32 @llvm.nvvm.read.ptx.sreg.ntid.x(), !dbg !13
; Source line: 10
  %5 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !15
; Filename: In[28]
; Source line: 2
  %6 = add i32 %3, 1, !dbg !12
  %7 = zext i32 %6 to i64, !dbg !12
  %8 = add nsw i64 %7, -1, !dbg !12
  %9 = zext i32 %4 to i64, !dbg !12
  %10 = mul i64 %8, %9, !dbg !12
  %11 = add i32 %5, 1, !dbg !12
  %12 = zext i32 %11 to i64, !dbg !12
  %13 = add i64 %10, %12, !dbg !12
; Source line: 4
  %14 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 0, i64 0, !dbg !19
  %15 = load i64, i64* %14, align 8, !dbg !19, !tbaa !20
  %16 = icmp sgt i64 %13, %15, !dbg !19
  br i1 %16, label %L48, label %if, !dbg !19

if:                                               ; preds = %top
; Source line: 5
  %17 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %0, i64 0, i32 1, !dbg !23
  %18 = add i64 %13, -1, !dbg !23
  %19 = load float*, float** %17, align 8, !dbg !23, !tbaa !20
  %20 = getelementptr float, float* %19, i64 %18, !dbg !23
  %21 = load float, float* %20, align 8, !dbg !23, !tbaa !24
  %22 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %1, i64 0, i32 1, !dbg !23
  %23 = load float*, float** %22, align 8, !dbg !23, !tbaa !20
  %24 = getelementptr float, float* %23, i64 %18, !dbg !23
  %25 = load float, float* %24, align 8, !dbg !23, !tbaa !24
  %26 = fadd float %21, %25, !dbg !23
  %27 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 1, !dbg !23
  %28 = load float*, float** %27, align 8, !dbg !23, !tbaa !20
  %29 = getelementptr float, float* %28, i64 %18, !dbg !23
  store float %26, float* %29, align 8, !dbg !23, !tbaa !24
  br label %L48, !dbg !23

L48:                                              ; preds = %if, %top
; Source line: 8
  ret void, !dbg !26
}
In [34]:
@code_ptx kernel_vadd(d_a, d_b, d_c)
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_35
.address_size 64

	.file	1 "./In[28]"
	// .globl	julia_kernel_vadd_62150

.visible .func julia_kernel_vadd_62150(
	.param .b64 julia_kernel_vadd_62150_param_0,
	.param .b64 julia_kernel_vadd_62150_param_1,
	.param .b64 julia_kernel_vadd_62150_param_2
)
{
	.reg .pred 	%p<2>;
	.reg .f32 	%f<4>;
	.reg .s32 	%r<7>;
	.reg .s64 	%rd<18>;

	ld.param.u64 	%rd4, [julia_kernel_vadd_62150_param_2];
	mov.u32	%r2, %ctaid.x;
	mov.u32	%r3, %ntid.x;
	mov.u32	%r4, %tid.x;
	.loc 1 2 0
	add.s32 	%r5, %r2, 1;
	cvt.u64.u32	%rd5, %r5;
	add.s64 	%rd6, %rd5, -1;
	cvt.u64.u32	%rd7, %r3;
	mul.lo.s64 	%rd8, %rd6, %rd7;
	add.s32 	%r6, %r4, 1;
	cvt.u64.u32	%rd9, %r6;
	add.s64 	%rd1, %rd8, %rd9;
	.loc 1 4 0
	ld.u64 	%rd10, [%rd4];
	setp.gt.s64	%p1, %rd1, %rd10;
	@%p1 bra 	LBB0_2;
	ld.param.u64 	%rd3, [julia_kernel_vadd_62150_param_1];
	ld.param.u64 	%rd2, [julia_kernel_vadd_62150_param_0];
	.loc 1 5 0
	ld.u64 	%rd11, [%rd2+8];
	shl.b64 	%rd12, %rd1, 2;
	add.s64 	%rd13, %rd11, %rd12;
	ld.f32 	%f1, [%rd13+-4];
	ld.u64 	%rd14, [%rd3+8];
	add.s64 	%rd15, %rd14, %rd12;
	ld.f32 	%f2, [%rd15+-4];
	add.f32 	%f3, %f1, %f2;
	ld.u64 	%rd16, [%rd4+8];
	add.s64 	%rd17, %rd16, %rd12;
	st.f32 	[%rd17+-4], %f3;
LBB0_2:
	.loc 1 8 0
	ret;
}


In [35]:
@code_sass kernel_vadd(d_a, d_b, d_c)
	code for sm_35
		Function : julia_kernel_vadd_62165
	.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                   /* 0x08b010a010a01000 */
        /*0008*/                   MOV R1, c[0x0][0x44];                           /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_CTAID.X;                             /* 0x86400000129c0002 */
        /*0018*/                   MOV R6, c[0x0][0x150];                          /* 0x64c03c002a1c001a */
        /*0020*/                   IADD R0, R0, 0x1;                               /* 0xc0800000009c0001 */
        /*0028*/                   MOV R7, c[0x0][0x154];                          /* 0x64c03c002a9c001e */
        /*0030*/                   IADD R4.CC, R0, -0x1;                           /* 0xc88403ffff9c0011 */
        /*0038*/                   S2R R0, SR_TID.X;                               /* 0x86400000109c0002 */
                                                                                   /* 0x08b010b0b0118010 */
        /*0048*/                   LD.E.64 R2, [R6];                               /* 0xc5800000001c1808 */
        /*0050*/                   IADD.X R5, RZ, -0x1;                            /* 0xc88043ffff9ffc15 */
        /*0058*/                   IMUL.U32.U32 R9.CC, R4, c[0x0][0x28];           /* 0x61c40000051c1026 */
        /*0060*/                   IADD R0, R0, 0x1;                               /* 0xc0800000009c0001 */
        /*0068*/                   IMAD.U32.U32.HI.X R4.CC, R4, c[0x0][0x28], RZ;  /* 0x5217fc00051c1012 */
        /*0070*/                   IMAD.U32.U32.X R5, R5, c[0x0][0x28], R4;        /* 0x50101000051c1416 */
        /*0078*/                   IADD R4.CC, R9, R0;                             /* 0xe0840000001c2412 */
                                                                                   /* 0x08a0b0a000b0b010 */
        /*0088*/                   IADD.X R0, R5, RZ;                              /* 0xe08040007f9c1402 */
        /*0090*/                   ISUB RZ.CC, R4, R2;                             /* 0xe08c0000011c13fe */
        /*0098*/                   ISETP.GT.X.AND P0, PT, R0, R3, PT;              /* 0xdb485c00019c001e */
        /*00a0*/               @P0 BRA.U 0x168;                                    /* 0x120000006000023c */
        /*00a8*/              @!P0 MOV32I R5, 0x8;                                 /* 0x740000000423c016 */
        /*00b0*/              @!P0 IADD R10.CC, R5, c[0x0][0x140];                 /* 0x608400002820142a */
        /*00b8*/              @!P0 IADD.X R11, RZ, c[0x0][0x144];                  /* 0x6080400028a3fc2e */
                                                                                   /* 0x088c11909c81b010 */
        /*00c8*/              @!P0 IADD R14.CC, R5, c[0x0][0x148];                 /* 0x608400002920143a */
        /*00d0*/              @!P0 LD.E.64 R8, [R10];                              /* 0xc580000000202820 */
        /*00d8*/              @!P0 IADD.X R15, RZ, c[0x0][0x14c];                  /* 0x6080400029a3fc3e */
        /*00e0*/              @!P0 IADD R6.CC, R5, c[0x0][0x150];                  /* 0x608400002a20141a */
        /*00e8*/              @!P0 LD.E.64 R2, [R14];                              /* 0xc580000000203808 */
        /*00f0*/              @!P0 SHF.L.U64 R5, R4, 0x2, R0;                      /* 0xb7c0020001201015 */
        /*00f8*/              @!P0 IADD.X R7, RZ, c[0x0][0x154];                   /* 0x608040002aa3fc1e */
                                                                                   /* 0x08809c80b18080b0 */
        /*0108*/              @!P0 IMAD.U32.U32 R12.CC, R4, 0x4, R8;               /* 0xa004200002201031 */
        /*0110*/              @!P0 IADD.X R13, R9, R5;                             /* 0xe080400002a02436 */
        /*0118*/              @!P0 LD.E.64 R8, [R6];                               /* 0xc580000000201820 */
        /*0120*/              @!P0 IMAD.U32.U32 R10.CC, R4, 0x4, R2;               /* 0xa004080002201029 */
        /*0128*/              @!P0 IADD.X R11, R3, R5;                             /* 0xe080400002a00c2e */
        /*0130*/              @!P0 LD.E R3, [R12+-0x4];                            /* 0xc4fffffffe20300c */
        /*0138*/              @!P0 LD.E R0, [R10+-0x4];                            /* 0xc4fffffffe202800 */
                                                                                   /* 0x0800b81000a0a4b0 */
        /*0148*/              @!P0 IMAD.U32.U32 R4.CC, R4, 0x4, R8;                /* 0xa004200002201011 */
        /*0150*/              @!P0 IADD.X R5, R9, R5;                              /* 0xe080400002a02416 */
        /*0158*/              @!P0 FADD R0, R3, R0;                                /* 0xe2c0000000200c02 */
        /*0160*/              @!P0 ST.E [R4+-0x4], R0;                             /* 0xe4fffffffe201000 */
        /*0168*/                   MOV RZ, RZ;                                     /* 0xe4c03c007f9c03fe */
        /*0170*/                   EXIT;                                           /* 0x18000000001c003c */
        /*0178*/                   BRA 0x178;                                      /* 0x12007ffffc1c003c */
		........................................


Advanced Less basic example

Reduce across columns.

In [36]:
function cpu!(op::Function, data::Matrix{T}) where {T}
    for col in 1:size(data,2)
        accum = zero(T)
        for row in 1:size(data,1)
            accum = op(accum, data[row,col])
            data[row,col] = accum
        end
    end

    return
end
function cpu(op::Function, data::Matrix)
    data = copy(data)
    cpu!(op, data)
    return data
end
Out[36]:
cpu (generic function with 1 method)
In [37]:
const rows = 3
const cols = 4
a = collect(reshape(1:rows*cols, (rows, cols)))
Out[37]:
3×4 Array{Int64,2}:
 1  4  7  10
 2  5  8  11
 3  6  9  12
In [38]:
cpu(+, a)
Out[38]:
3×4 Array{Int64,2}:
 1   4   7  10
 3   9  15  21
 6  15  24  33
In [39]:
# naive GPU implementation
function gpu!(op::Function, data::CuDeviceMatrix)
    col = blockIdx().x
    row = threadIdx().x

    offset = 1
    while offset < row
        a = data[row,col]
        b = data[row - offset,col]
        sync_threads()

        data[row,col] = op(a,b)
        sync_threads()

        offset *= 2
    end

    return
end;
In [40]:
function gpu(op::Function, a::Matrix)
    d_a = CuArray(a)
    @cuda (size(a,2),size(a,1)) gpu!(op, d_a)
    Array(d_a)
end;
In [41]:
cpu(+, a)  gpu(+, a)
Out[41]:
true
In [42]:
a = rand(Float32, 1000, 1000);
@benchmark cpu(+, a)
Out[42]:
BenchmarkTools.Trial: 
  memory estimate:  3.81 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.065 ms (0.00% GC)
  median time:      1.081 ms (0.00% GC)
  mean time:        1.203 ms (9.32% GC)
  maximum time:     2.256 ms (29.94% GC)
  --------------
  samples:          4125
  evals/sample:     1
In [43]:
@benchmark gpu(+, a)
Out[43]:
BenchmarkTools.Trial: 
  memory estimate:  3.82 MiB
  allocs estimate:  31
  --------------
  minimum time:     1.639 ms (0.00% GC)
  median time:      1.655 ms (0.00% GC)
  mean time:        1.743 ms (4.84% GC)
  maximum time:     2.819 ms (23.53% GC)
  --------------
  samples:          2863
  evals/sample:     1

Benchmark the kernel time: CUDAdrv.@elapsed

In [44]:
function gpu_benchmark(op::Function, a::Matrix)
    d_a = CuArray(a)
    time = CUDAdrv.@elapsed begin
        @cuda (size(a,2),size(a,1)) gpu!(op, d_a)
    end    
    (time)u"s"
end;
In [45]:
gc()
minimum([gpu_benchmark(+,a) for _ in 1:1000]) |> u"ms"
Out[45]:
0.3496f0 ms

Dwarfed by data transfer time.

Still, many possible kernel optimizations:

  • work-efficiency
  • shared memory cache
  • memory load optimizations
  • intra-warp communication without synchronization

Only going to discusse performance & optimization of Julia constructs!

5. Performance & Optimization

Launch performance

julia> using CUDAnative

julia> foo() = nothing

julia> Base.@elapsed @cuda (1,1) foo()
8.501839023 s
julia> Base.@elapsed @cuda (1,1) foo()
36.033 μs

As fast as statically compiled C!

julia> bar() = nothing
bar (generic function with 1 method)

julia> Base.@elapsed @cuda (1,1) bar()
105.534056 ms

Initial overhead: inferring & compiling infrastructure

Kernel performance

Rodinia benchmarks: CUDA C vs CUDAnative.jl

Rodinia benchmarks

Performance is great! If you program carefully....

In [46]:
macro kernel_benchmark(ex)
    quote
        seconds = minimum([CUDAdrv.@elapsed $(esc(ex)) for _ in 1:1000])
        (seconds)u"s"
    end
end
Out[46]:
@kernel_benchmark (macro with 1 method)

Passing non-bits types

In [47]:
function kernel(a)
    a[threadIdx().x] = 0
    return
end;
In [48]:
len = 1000
d_a = CuArray{Int}(len);
In [49]:
(@kernel_benchmark @cuda (1,len) kernel(d_a)) |> u"μs"
Out[49]:
15.072f0 μs
In [50]:
function kernel(ptr, len)
    a = CuDeviceArray(len, ptr)
    a[threadIdx().x] = 0
    return
end;
In [51]:
(@kernel_benchmark @cuda (1,len) kernel(pointer(d_a), length(d_a))) |> u"μs"
Out[51]:
4.6080003f0 μs

Wide types

In [52]:
function dummy_kernel(a)
    i = threadIdx().x
    a[i] = CUDAnative.sqrt(a[i]*2.0)
    return
end;
In [53]:
CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})
.visible .func julia_dummy_kernel(...)
{
    .reg .f32   %f<3>;
    .reg .f64   %fd<4>;

    // a[i] * 2
    ld.f32       %f1, [%rd4+-4];
    cvt.f64.f32  %fd1, %f1;
    add.f64      %fd2, %fd1, %fd1;

    // a[i] = sqrt
    sqrt.rn.f64     %fd3, %fd2;
    cvt.rn.f32.f64  %f2, %fd3;
    st.f32          [%rd4+-4], %f2;

    ret;
}
  • extra register pressure → lower occupancy
  • slower operations (up to 64x on consumer GPUs)
In [54]:
function dummy_kernel(a)
    i = threadIdx().x
    a[i] = CUDAnative.sqrt(a[i] * eltype(a)(2))
    return
end;
In [55]:
CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})
.visible .func julia_dummy_kernel(...)
{
    .reg .f32   %f<4>;

    // a[i] * 2
    ld.f32   %f1, [%rd4+-4];
    add.f32  %f2, %f1, %f1;

    /// a[i] = sqrt
    sqrt.rn.f32  %f3, %f2;
    st.f32      [%rd4+-4], %f3;

    ret;
}

Julia Base doesn't always care wrt. Int:

  • StepRanges promote small integers
  • use of literal numbers (eg. 1 in size(::AbstractArray))

Profiling

NVVP timeline view

NVVP source-code view

In [56]:
?CUDAdrv.@profile
Out[56]:
@profile ex

Run expressions while activating the CUDA profiler.

Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.

In [57]:
?CUDAnative.@profile
Out[57]:
@profile ex

Runs your expression ex while activating the CUDA profiler upon first kernel launch. This makes it easier to profile accurately, without the overhead of initial compilation, memory transfers, ...

Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.

6. Debugging

NVIDIA tools

$ cuda-memcheck julia examples/oob.jl
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
=========     at 0x000000b8 in examples/oob.jl:18:julia_memset_61129
$ cuda-gdb --args julia examples/oob.jl
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) r
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0),
 thread (0,0,0), device 0, sm 13, warp 0, lane 0]

julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18
18          a[i] = val

(cuda-gdb) backtrace
#0  julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18

(cuda-gdb) l
16      function memset(a, val)
17          i = (blockIdx().x-1) * blockDim().x + threadIdx().x
18          a[i] = val
19          return nothing
20      end
(cuda-gdb) info locals
No locals.
(cuda-gdb) info args
No arguments.
(cuda-gdb) print val
No symbol "val" in current context.
(cuda-gdb) layout split
   ┌──examples/oob.jl───────────────────────────────────────────────────┐
   │16      function memset(a, val)                                     │
   │17          i = (blockIdx().x-1) * blockDim().x + threadIdx().x     │
  >│18          a[i] = val                                              │
   │19          return nothing                                          │
   │20      end                                                         │
   ┌────────────────────────────────────────────────────────────────────┐
   │0x3df57e8 <julia_memset_61129+40>     IADD R0, R0, 0x1              │
B+>│0x3df57f0 <julia_memset_61129+48>     IADD.X R5, RZ, c[0x0][0x144]  │
   │0x3df57f8 <julia_memset_61129+56>     IADD R6.CC, R0, -0x1          │
   └────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff In: julia_memset_61129  Line: 18  PC: 0x3df57f0
(cuda-gdb) next
(cuda-gdb) print $R5
$1 = 11

Unsupported language features

In [58]:
kernel_returns() = 1;
In [59]:
@code_sass kernel_returns()
kernel_returns() is not a valid kernel as it returns Int64

Stacktrace:
 [1] check_kernel(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:273
 [2] #code_sass#62(::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:83
 [3] code_sass(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:81
 [4] #code_sass#63 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99 [inlined]
 [5] code_sass(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99
 [6] code_sass_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116
In [60]:
call_runtime() = Int[]
Out[60]:
call_runtime (generic function with 1 method)
In [61]:
CUDAnative.@code_typed call_runtime()
Out[61]:
1-element Array{Any,1}:
 CodeInfo(:(begin 
        return $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0))
    end))=>Array{Int64,1}
In [62]:
CUDAnative.@code_llvm call_runtime()
error compiling call_runtime: emit_ccall for In[60]:1 requires the runtime language feature, which is disabled

Stacktrace:
 [1] _dump_function_linfo(::Core.MethodInstance, ::UInt64, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:713
 [2] _dump_function(::ANY, ::ANY, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:701
 [3] irgen(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:73
 [4] #code_llvm#58(::Bool, ::Bool, ::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34
 [5] code_llvm(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34
 [6] #code_llvm#59 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44 [inlined]
 [7] code_llvm(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44
 [8] code_llvm_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116
In [63]:
function call_library(a)
    i = threadIdx().x
    a[i] = sin(a[i])
    return
end
Out[63]:
call_library (generic function with 1 method)

This kills the driver...

julia> synchronize()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)

julia> ctx = CuContext(dev)
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
In [64]:
CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})
define void @julia_call_library_62847(...) {
  ...
  %9 = tail call float
       inttoptr (i64 139694085276592 to float (float)*)
       (float %8)
  ...
}
In [65]:
function call_library(a)
    i = threadIdx().x
    a[i] = CUDAnative.sin(a[i])
    return
end
Out[65]:
call_library (generic function with 1 method)
In [66]:
CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})
define void @julia_call_library_62847(...) {
  ...
  %9 = tail call float @__nv_sinf(float %8), !dbg !19
  ...
}

Output modes

DEBUG=1 julia --compilecache=no examples/vadd.jl

Checking validity of CUDA/driver/current/libcuda.so
Checking validity of bundled library at julia/usr/lib/libLLVM-3.9.1.so
(Re)compiling kernel kernel_vadd(...) for device capability 3.5.0
intrinsics/indexing.jl:18: marked this readnone call a tail call candidate
intrinsics/indexing.jl:44: marked this readnone call a tail call candidate
intrinsics/indexing.jl:10: marked this readnone call a tail call candidate
JIT info log:
       ptxas info    : 0 bytes gmem
       ptxas info    : Compiling entry function 'kernel_vadd' for 'sm_35'
       ptxas info    : Function properties for kernel_vadd
       ptxas         .   0 bytes stack frame,
       ptxas         .   0 bytes spill stores,
       ptxas         .   0 bytes spill loads
       ptxas info    : Used 14 registers, 344 bytes cmem[0]

TRACE=1 julia --compilecache=no examples/vadd.jl

cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuInit(flags=0) = 0
LLVMInstallFatalErrorHandler(Handler=Ptr{Void} 0x...) = nothing
LLVMGetGlobalContext() = Ptr{LLVM.API.LLVMOpaqueContext} 0x...
cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuDeviceGet(handle_ref=Base.RefValue{Int32}(...), ordinal=0) = 0
cuCtxCreate_v2(handle_ref=Base.RefValue{Ptr{Void}}(...),
               flags=SCHED_AUTO, dev=CUDAdrv.CuDevice(0, 0)) = 0

...

Finalizing CuContext at 0x...
cuCtxDestroy_v2(ctx=CUDAdrv.CuContext(Ptr{Void} 0x..., true, true)) = 0
Invalidating CuContext at 0x...
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuModule at 0x... because context is no longer valid

That's it!

Talk on CUDAnative.jl (Wednesday):

  • GPU parallelism & optimization
  • CUDAnative.jl inner workings

Lightning talk on LLMV.jl (Thursday)