Programming NVIDIA GPUs with CUDAnative.jl¶

by: Tim Besard (GitHub: maleadt)

Talk: how does it work, what does it enable.

Workshop: how do you use CUDAnative.jl, realistically.

Follow along:

slides: https://goo.gl/Rb1mo3
single-page: https://goo.gl/pbT471

CUDA C replacement:

low-level, kernel programming
need to understand parallel & GPU programming

High-level wrappers: eg. GPUArrays.jl

Very usable, but restricted:

wrt. Julia: language subset
wrt. CUDA: limited functionality

Table of contents¶

Requirements
Installation
Programming model
Usage
Performance & Optimization
Debugging

1. Requirements¶

Operating system: Linux, macOS (unsupported)

Julia: 0.6, 0.7 (unsupported)

GPU: NVIDIA CUDA, sm >= 2.0

2. Installation¶

Dependencies: CUDAdrv.jl & LLVM.jl

LLVM.jl: requires Julia source build:

$ git clone https://github.com/JuliaLang/julia.git
$ cd julia
$ git checkout v0.6.0
$ make
$ ./julia

Pkg.add("CUDAnative")

Julia 0.6, LLVM 3.9.1
Linux x64 (Arch Linux)
NVIDIA driver 375.39
CUDA 7.5

NVIDIA GeForce GTX Titan
Intel Core i7-3770k

3. Programming model¶

Single Program, Multiple Data

In [2]:

function execute(kernel, threads, args...)
    global thread
    for thread in 1:threads
        kernel(args...)
    end
end;

In [3]:

function kernel(a, b, c)
    c[thread] = a[thread] + b[thread]
    return
end;

In [4]:

a = [1,2,3]
b = [4,5,6]
c = similar(a)

execute(kernel, length(a), a, b, c)

a+b == c

Out[4]:

true

Conceptually simple, but complex implementations:

threads ∈ blocks ∈ grid
storage behavior and performance
thread communication & synchronization

Again, CUDAnative.jl does not obviate that complexity.

4. Usage¶

CUDAnative.jl takes care of:

intrinsics
compiler support
language integration

Other CUDA operations: manual usage of eg. CUDAdrv.jl

device management
memory operations
context handling
...

In [5]:

using CUDAdrv, CUDAnative

In [6]:

dev = CuDevice(0)

Out[6]:

CuDevice(0): GeForce GTX TITAN

In [7]:

ctx = CuContext(dev)
# ...
destroy!(ctx)

In [8]:

CuContext(dev) do ctx
    # ...
end

In [9]:

ctx = CuContext(dev);

GPU arrays¶

In [10]:

?CuArray

search: CuArray CuDeviceArray

Out[10]:

CuArray{T}(dims)
CuArray{T,N}(dims)

Construct an uninitialized N-dimensional dense CUDA array with element type T, where N is determined from the length or number of dims. dims may be a tuple or a series of integer arguments corresponding to the lengths in each dimension. If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length or number of dims.

Type aliases CuVector and CuMatrix are available for respectively 1 and 2-dimensional data.

CuArray{T}(src::Array{T})

Transfer a host array src to device, returning a CuArray.

In [11]:

d_a = CuArray{Int}(2)

Out[11]:

2-element Array{Int64,1}:
 4737786809106563072
 4719772410588954624

In [12]:

d_a[1] = 1

indexing not defined for CUDAdrv.CuArray{Int64,1}

Stacktrace:
 [1] setindex!(::CUDAdrv.CuArray{Int64,1}, ::Int64, ::Int64) at ./abstractarray.jl:966

In [13]:

a = Array(d_a)

Out[13]:

2-element Array{Int64,1}:
 4737786809106563072
 4719772410588954624

@cuda ... kernel(d_a)

Garbage collection¶

In [14]:

Mem.used() |> datasize

Out[14]:

"72.1 MB"

In [15]:

d_a = CuArray{Int}(10000000)
sizeof(d_a) |> datasize

Out[15]:

"80.0 MB"

In [16]:

Mem.used() |> datasize

Out[16]:

"152.2 MB"

In [17]:

finalize(d_a)
Mem.used() |> datasize

Out[17]:

"72.1 MB"

Avoid ERROR_OUT_OF_MEMORY: manual gc()

Basic example¶

Vector addition:

easy to parallelize
no communication, synchronization, ...

In [18]:

function kernel_vadd(a, b, c)
    i = threadIdx().x
    c[i] = a[i] + b[i]

    return
end;

In [19]:

len = 42
a = randn(Float32, len)
b = randn(Float32, len);

In [20]:

d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);

In [21]:

@cuda (1,len) kernel_vadd(d_a, d_b, d_c)  # asynchronous!

kernel_vadd<<<1,len>>>(d_a, d_b, d_c)

In [22]:

c = Array(d_c)                            # synchronizing
a+b ≈ c

Out[22]:

true

Scale it up¶

In [23]:

len = 2048
a = randn(Float32, len)
b = randn(Float32, len);

In [24]:

d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);

In [25]:

@cuda (1,len) kernel_vadd(d_a, d_b, d_c)

CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE)

Stacktrace:
 [1] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/base.jl:127 [inlined]
 [2] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:84 [inlined]
 [3] _launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:64 [inlined]
 [4] launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:56 [inlined]
 [5] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:172 [inlined]
 [6] _cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:162 [inlined]
 [7] cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:154 [inlined]
 [8] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:137 [inlined]
 [9] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/profile.jl:42 [inlined]
 [10] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:136 [inlined]
 [11] generated_cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::#kernel_vadd, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:185

In [26]:

attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)

Out[26]:

In [27]:

attribute.(dev, (CUDAdrv.MAX_BLOCK_DIM_X,
                 CUDAdrv.MAX_BLOCK_DIM_Y,
                 CUDAdrv.MAX_BLOCK_DIM_Z))

Out[27]:

(1024, 1024, 64)

In [28]:

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    
    if i <= length(c)
        c[i] = a[i] + b[i]
    end

    return
end;

In [29]:

threads = min(1024, len)
blocks = ceil(Int, len / threads)
@cuda (blocks,threads) kernel_vadd(d_a, d_b, d_c)

In [30]:

c = Array(d_c)
a+b ≈ c

Out[30]:

true

In [31]:

?CUDAdrv.CuDim3

Out[31]:

CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall, allowing to pass dimensions as a plain integer or tuple without having to construct an explicit CuDim3 object.

Reflection¶

CUDAnative defines: code_llvm, code_ptx, code_sass

Why? Custom toolchain

CUDAnative also defines:

@code_{lowered|typed|warntype}: defer to Base
@code_llvm, @code_ptx, @code_sass: defer to CUDAnative

Why?

decode @cuda (...) kernel(...)
type conversions: CuArray → CuDeviceArray

In [32]:

@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)

WARNING: both CUDAnative and Base export "@code_llvm"; uses of it in module Main must be qualified

UndefVarError: @code_llvm not defined

In [33]:

CUDAnative.@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)

; Function Attrs: nounwind
define void @julia_kernel_vadd_62142(%CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16)) local_unnamed_addr #0 !dbg !5 {
top:
  %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
; Filename: /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/device/intrinsics/indexing.jl
; Source line: 18
  %3 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !7
; Source line: 44
  %4 = tail call i32 @llvm.nvvm.read.ptx.sreg.ntid.x(), !dbg !13
; Source line: 10
  %5 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !15
; Filename: In[28]
; Source line: 2
  %6 = add i32 %3, 1, !dbg !12
  %7 = zext i32 %6 to i64, !dbg !12
  %8 = add nsw i64 %7, -1, !dbg !12
  %9 = zext i32 %4 to i64, !dbg !12
  %10 = mul i64 %8, %9, !dbg !12
  %11 = add i32 %5, 1, !dbg !12
  %12 = zext i32 %11 to i64, !dbg !12
  %13 = add i64 %10, %12, !dbg !12
; Source line: 4
  %14 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 0, i64 0, !dbg !19
  %15 = load i64, i64* %14, align 8, !dbg !19, !tbaa !20
  %16 = icmp sgt i64 %13, %15, !dbg !19
  br i1 %16, label %L48, label %if, !dbg !19

if:                                               ; preds = %top
; Source line: 5
  %17 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %0, i64 0, i32 1, !dbg !23
  %18 = add i64 %13, -1, !dbg !23
  %19 = load float*, float** %17, align 8, !dbg !23, !tbaa !20
  %20 = getelementptr float, float* %19, i64 %18, !dbg !23
  %21 = load float, float* %20, align 8, !dbg !23, !tbaa !24
  %22 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %1, i64 0, i32 1, !dbg !23
  %23 = load float*, float** %22, align 8, !dbg !23, !tbaa !20
  %24 = getelementptr float, float* %23, i64 %18, !dbg !23
  %25 = load float, float* %24, align 8, !dbg !23, !tbaa !24
  %26 = fadd float %21, %25, !dbg !23
  %27 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 1, !dbg !23
  %28 = load float*, float** %27, align 8, !dbg !23, !tbaa !20
  %29 = getelementptr float, float* %28, i64 %18, !dbg !23
  store float %26, float* %29, align 8, !dbg !23, !tbaa !24
  br label %L48, !dbg !23

L48:                                              ; preds = %if, %top
; Source line: 8
  ret void, !dbg !26
}

In [34]:

@code_ptx kernel_vadd(d_a, d_b, d_c)

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_35
.address_size 64

	.file	1 "./In[28]"
	// .globl	julia_kernel_vadd_62150

.visible .func julia_kernel_vadd_62150(
	.param .b64 julia_kernel_vadd_62150_param_0,
	.param .b64 julia_kernel_vadd_62150_param_1,
	.param .b64 julia_kernel_vadd_62150_param_2
)
{
	.reg .pred 	%p<2>;
	.reg .f32 	%f<4>;
	.reg .s32 	%r<7>;
	.reg .s64 	%rd<18>;

	ld.param.u64 	%rd4, [julia_kernel_vadd_62150_param_2];
	mov.u32	%r2, %ctaid.x;
	mov.u32	%r3, %ntid.x;
	mov.u32	%r4, %tid.x;
	.loc 1 2 0
	add.s32 	%r5, %r2, 1;
	cvt.u64.u32	%rd5, %r5;
	add.s64 	%rd6, %rd5, -1;
	cvt.u64.u32	%rd7, %r3;
	mul.lo.s64 	%rd8, %rd6, %rd7;
	add.s32 	%r6, %r4, 1;
	cvt.u64.u32	%rd9, %r6;
	add.s64 	%rd1, %rd8, %rd9;
	.loc 1 4 0
	ld.u64 	%rd10, [%rd4];
	setp.gt.s64	%p1, %rd1, %rd10;
	@%p1 bra 	LBB0_2;
	ld.param.u64 	%rd3, [julia_kernel_vadd_62150_param_1];
	ld.param.u64 	%rd2, [julia_kernel_vadd_62150_param_0];
	.loc 1 5 0
	ld.u64 	%rd11, [%rd2+8];
	shl.b64 	%rd12, %rd1, 2;
	add.s64 	%rd13, %rd11, %rd12;
	ld.f32 	%f1, [%rd13+-4];
	ld.u64 	%rd14, [%rd3+8];
	add.s64 	%rd15, %rd14, %rd12;
	ld.f32 	%f2, [%rd15+-4];
	add.f32 	%f3, %f1, %f2;
	ld.u64 	%rd16, [%rd4+8];
	add.s64 	%rd17, %rd16, %rd12;
	st.f32 	[%rd17+-4], %f3;
LBB0_2:
	.loc 1 8 0
	ret;
}

In [35]:

@code_sass kernel_vadd(d_a, d_b, d_c)

	code for sm_35
		Function : julia_kernel_vadd_62165
	.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                   /* 0x08b010a010a01000 */
        /*0008*/                   MOV R1, c[0x0][0x44];                           /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_CTAID.X;                             /* 0x86400000129c0002 */
        /*0018*/                   MOV R6, c[0x0][0x150];                          /* 0x64c03c002a1c001a */
        /*0020*/                   IADD R0, R0, 0x1;                               /* 0xc0800000009c0001 */
        /*0028*/                   MOV R7, c[0x0][0x154];                          /* 0x64c03c002a9c001e */
        /*0030*/                   IADD R4.CC, R0, -0x1;                           /* 0xc88403ffff9c0011 */
        /*0038*/                   S2R R0, SR_TID.X;                               /* 0x86400000109c0002 */
                                                                                   /* 0x08b010b0b0118010 */
        /*0048*/                   LD.E.64 R2, [R6];                               /* 0xc5800000001c1808 */
        /*0050*/                   IADD.X R5, RZ, -0x1;                            /* 0xc88043ffff9ffc15 */
        /*0058*/                   IMUL.U32.U32 R9.CC, R4, c[0x0][0x28];           /* 0x61c40000051c1026 */
        /*0060*/                   IADD R0, R0, 0x1;                               /* 0xc0800000009c0001 */
        /*0068*/                   IMAD.U32.U32.HI.X R4.CC, R4, c[0x0][0x28], RZ;  /* 0x5217fc00051c1012 */
        /*0070*/                   IMAD.U32.U32.X R5, R5, c[0x0][0x28], R4;        /* 0x50101000051c1416 */
        /*0078*/                   IADD R4.CC, R9, R0;                             /* 0xe0840000001c2412 */
                                                                                   /* 0x08a0b0a000b0b010 */
        /*0088*/                   IADD.X R0, R5, RZ;                              /* 0xe08040007f9c1402 */
        /*0090*/                   ISUB RZ.CC, R4, R2;                             /* 0xe08c0000011c13fe */
        /*0098*/                   ISETP.GT.X.AND P0, PT, R0, R3, PT;              /* 0xdb485c00019c001e */
        /*00a0*/               @P0 BRA.U 0x168;                                    /* 0x120000006000023c */
        /*00a8*/              @!P0 MOV32I R5, 0x8;                                 /* 0x740000000423c016 */
        /*00b0*/              @!P0 IADD R10.CC, R5, c[0x0][0x140];                 /* 0x608400002820142a */
        /*00b8*/              @!P0 IADD.X R11, RZ, c[0x0][0x144];                  /* 0x6080400028a3fc2e */
                                                                                   /* 0x088c11909c81b010 */
        /*00c8*/              @!P0 IADD R14.CC, R5, c[0x0][0x148];                 /* 0x608400002920143a */
        /*00d0*/              @!P0 LD.E.64 R8, [R10];                              /* 0xc580000000202820 */
        /*00d8*/              @!P0 IADD.X R15, RZ, c[0x0][0x14c];                  /* 0x6080400029a3fc3e */
        /*00e0*/              @!P0 IADD R6.CC, R5, c[0x0][0x150];                  /* 0x608400002a20141a */
        /*00e8*/              @!P0 LD.E.64 R2, [R14];                              /* 0xc580000000203808 */
        /*00f0*/              @!P0 SHF.L.U64 R5, R4, 0x2, R0;                      /* 0xb7c0020001201015 */
        /*00f8*/              @!P0 IADD.X R7, RZ, c[0x0][0x154];                   /* 0x608040002aa3fc1e */
                                                                                   /* 0x08809c80b18080b0 */
        /*0108*/              @!P0 IMAD.U32.U32 R12.CC, R4, 0x4, R8;               /* 0xa004200002201031 */
        /*0110*/              @!P0 IADD.X R13, R9, R5;                             /* 0xe080400002a02436 */
        /*0118*/              @!P0 LD.E.64 R8, [R6];                               /* 0xc580000000201820 */
        /*0120*/              @!P0 IMAD.U32.U32 R10.CC, R4, 0x4, R2;               /* 0xa004080002201029 */
        /*0128*/              @!P0 IADD.X R11, R3, R5;                             /* 0xe080400002a00c2e */
        /*0130*/              @!P0 LD.E R3, [R12+-0x4];                            /* 0xc4fffffffe20300c */
        /*0138*/              @!P0 LD.E R0, [R10+-0x4];                            /* 0xc4fffffffe202800 */
                                                                                   /* 0x0800b81000a0a4b0 */
        /*0148*/              @!P0 IMAD.U32.U32 R4.CC, R4, 0x4, R8;                /* 0xa004200002201011 */
        /*0150*/              @!P0 IADD.X R5, R9, R5;                              /* 0xe080400002a02416 */
        /*0158*/              @!P0 FADD R0, R3, R0;                                /* 0xe2c0000000200c02 */
        /*0160*/              @!P0 ST.E [R4+-0x4], R0;                             /* 0xe4fffffffe201000 */
        /*0168*/                   MOV RZ, RZ;                                     /* 0xe4c03c007f9c03fe */
        /*0170*/                   EXIT;                                           /* 0x18000000001c003c */
        /*0178*/                   BRA 0x178;                                      /* 0x12007ffffc1c003c */
		........................................

Advanced Less basic example¶

Reduce across columns.

In [37]:

const rows = 3
const cols = 4
a = collect(reshape(1:rows*cols, (rows, cols)))

Out[37]:

3×4 Array{Int64,2}:
 1  4  7  10
 2  5  8  11
 3  6  9  12

In [38]:

cpu(+, a)

Out[38]:

3×4 Array{Int64,2}:
 1   4   7  10
 3   9  15  21
 6  15  24  33

In [39]:

# naive GPU implementation
function gpu!(op::Function, data::CuDeviceMatrix)
    col = blockIdx().x
    row = threadIdx().x

    offset = 1
    while offset < row
        a = data[row,col]
        b = data[row - offset,col]
        sync_threads()

        data[row,col] = op(a,b)
        sync_threads()

        offset *= 2
    end

    return
end;

In [40]:

function gpu(op::Function, a::Matrix)
    d_a = CuArray(a)
    @cuda (size(a,2),size(a,1)) gpu!(op, d_a)
    Array(d_a)
end;

In [41]:

cpu(+, a) ≈ gpu(+, a)

Out[41]:

true

In [42]:

a = rand(Float32, 1000, 1000);
@benchmark cpu(+, a)

Out[42]:

BenchmarkTools.Trial: 
  memory estimate:  3.81 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.065 ms (0.00% GC)
  median time:      1.081 ms (0.00% GC)
  mean time:        1.203 ms (9.32% GC)
  maximum time:     2.256 ms (29.94% GC)
  --------------
  samples:          4125
  evals/sample:     1

In [43]:

@benchmark gpu(+, a)

Out[43]:

BenchmarkTools.Trial: 
  memory estimate:  3.82 MiB
  allocs estimate:  31
  --------------
  minimum time:     1.639 ms (0.00% GC)
  median time:      1.655 ms (0.00% GC)
  mean time:        1.743 ms (4.84% GC)
  maximum time:     2.819 ms (23.53% GC)
  --------------
  samples:          2863
  evals/sample:     1

Benchmark the kernel time: CUDAdrv.@elapsed

In [44]:

function gpu_benchmark(op::Function, a::Matrix)
    d_a = CuArray(a)
    time = CUDAdrv.@elapsed begin
        @cuda (size(a,2),size(a,1)) gpu!(op, d_a)
    end    
    (time)u"s"
end;

In [45]:

gc()
minimum([gpu_benchmark(+,a) for _ in 1:1000]) |> u"ms"

Out[45]:

0.3496f0 ms

Dwarfed by data transfer time.

Still, many possible kernel optimizations:

work-efficiency
shared memory cache
memory load optimizations
intra-warp communication without synchronization

Only going to discusse performance & optimization of Julia constructs!

5. Performance & Optimization¶

Launch performance¶

julia> using CUDAnative

julia> foo() = nothing

julia> Base.@elapsed @cuda (1,1) foo()
8.501839023 s

julia> Base.@elapsed @cuda (1,1) foo()
36.033 μs

As fast as statically compiled C!

julia> bar() = nothing
bar (generic function with 1 method)

julia> Base.@elapsed @cuda (1,1) bar()
105.534056 ms

Initial overhead: inferring & compiling infrastructure

Kernel performance¶

Rodinia benchmarks: CUDA C vs CUDAnative.jl

Rodinia benchmarks

Performance is great! If you program carefully....

Passing non-bits types¶

In [47]:

function kernel(a)
    a[threadIdx().x] = 0
    return
end;

In [48]:

len = 1000
d_a = CuArray{Int}(len);

In [49]:

(@kernel_benchmark @cuda (1,len) kernel(d_a)) |> u"μs"

Out[49]:

15.072f0 μs

In [50]:

function kernel(ptr, len)
    a = CuDeviceArray(len, ptr)
    a[threadIdx().x] = 0
    return
end;

In [51]:

(@kernel_benchmark @cuda (1,len) kernel(pointer(d_a), length(d_a))) |> u"μs"

Out[51]:

4.6080003f0 μs

Wide types¶

In [52]:

function dummy_kernel(a)
    i = threadIdx().x
    a[i] = CUDAnative.sqrt(a[i]*2.0)
    return
end;

In [53]:

CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})

.visible .func julia_dummy_kernel(...)
{
    .reg .f32   %f<3>;
    .reg .f64   %fd<4>;

    // a[i] * 2
    ld.f32       %f1, [%rd4+-4];
    cvt.f64.f32  %fd1, %f1;
    add.f64      %fd2, %fd1, %fd1;

    // a[i] = sqrt
    sqrt.rn.f64     %fd3, %fd2;
    cvt.rn.f32.f64  %f2, %fd3;
    st.f32          [%rd4+-4], %f2;

    ret;
}

extra register pressure → lower occupancy
slower operations (up to 64x on consumer GPUs)

In [54]:

function dummy_kernel(a)
    i = threadIdx().x
    a[i] = CUDAnative.sqrt(a[i] * eltype(a)(2))
    return
end;

In [55]:

CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})

.visible .func julia_dummy_kernel(...)
{
    .reg .f32   %f<4>;

    // a[i] * 2
    ld.f32   %f1, [%rd4+-4];
    add.f32  %f2, %f1, %f1;

    /// a[i] = sqrt
    sqrt.rn.f32  %f3, %f2;
    st.f32      [%rd4+-4], %f3;

    ret;
}

Julia Base doesn't always care wrt. Int:

StepRanges promote small integers
use of literal numbers (eg. 1 in size(::AbstractArray))

Profiling¶

NVVP timeline view

NVVP source-code view

In [56]:

?CUDAdrv.@profile

Out[56]:

@profile ex

Run expressions while activating the CUDA profiler.

Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.

In [57]:

?CUDAnative.@profile

Out[57]:

@profile ex

Runs your expression ex while activating the CUDA profiler upon first kernel launch. This makes it easier to profile accurately, without the overhead of initial compilation, memory transfers, ...

6. Debugging¶

NVIDIA tools¶

$ cuda-memcheck julia examples/oob.jl
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
=========     at 0x000000b8 in examples/oob.jl:18:julia_memset_61129

$ cuda-gdb --args julia examples/oob.jl
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) r

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0),
 thread (0,0,0), device 0, sm 13, warp 0, lane 0]

julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18
18          a[i] = val

(cuda-gdb) backtrace
#0  julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18

(cuda-gdb) l
16      function memset(a, val)
17          i = (blockIdx().x-1) * blockDim().x + threadIdx().x
18          a[i] = val
19          return nothing
20      end

(cuda-gdb) info locals
No locals.
(cuda-gdb) info args
No arguments.
(cuda-gdb) print val
No symbol "val" in current context.

(cuda-gdb) layout split
   ┌──examples/oob.jl───────────────────────────────────────────────────┐
   │16      function memset(a, val)                                     │
   │17          i = (blockIdx().x-1) * blockDim().x + threadIdx().x     │
  >│18          a[i] = val                                              │
   │19          return nothing                                          │
   │20      end                                                         │
   ┌────────────────────────────────────────────────────────────────────┐
   │0x3df57e8 <julia_memset_61129+40>     IADD R0, R0, 0x1              │
B+>│0x3df57f0 <julia_memset_61129+48>     IADD.X R5, RZ, c[0x0][0x144]  │
   │0x3df57f8 <julia_memset_61129+56>     IADD R6.CC, R0, -0x1          │
   └────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff In: julia_memset_61129  Line: 18  PC: 0x3df57f0

(cuda-gdb) next
(cuda-gdb) print $R5
$1 = 11

Unsupported language features¶

In [58]:

kernel_returns() = 1;

In [59]:

@code_sass kernel_returns()

kernel_returns() is not a valid kernel as it returns Int64

Stacktrace:
 [1] check_kernel(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:273
 [2] #code_sass#62(::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:83
 [3] code_sass(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:81
 [4] #code_sass#63 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99 [inlined]
 [5] code_sass(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99
 [6] code_sass_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116

In [60]:

call_runtime() = Int[]

Out[60]:

call_runtime (generic function with 1 method)

In [61]:

CUDAnative.@code_typed call_runtime()

Out[61]:

1-element Array{Any,1}:
 CodeInfo(:(begin 
        return $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0))
    end))=>Array{Int64,1}

In [62]:

CUDAnative.@code_llvm call_runtime()

error compiling call_runtime: emit_ccall for In[60]:1 requires the runtime language feature, which is disabled

Stacktrace:
 [1] _dump_function_linfo(::Core.MethodInstance, ::UInt64, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:713
 [2] _dump_function(::ANY, ::ANY, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:701
 [3] irgen(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:73
 [4] #code_llvm#58(::Bool, ::Bool, ::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34
 [5] code_llvm(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34
 [6] #code_llvm#59 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44 [inlined]
 [7] code_llvm(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44
 [8] code_llvm_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116

In [63]:

function call_library(a)
    i = threadIdx().x
    a[i] = sin(a[i])
    return
end

Out[63]:

call_library (generic function with 1 method)

This kills the driver...

julia> synchronize()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)

julia> ctx = CuContext(dev)
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)

In [64]:

CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})

define void @julia_call_library_62847(...) {
  ...
  %9 = tail call float
       inttoptr (i64 139694085276592 to float (float)*)
       (float %8)
  ...
}

In [65]:

function call_library(a)
    i = threadIdx().x
    a[i] = CUDAnative.sin(a[i])
    return
end

Out[65]:

call_library (generic function with 1 method)

In [66]:

CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})

define void @julia_call_library_62847(...) {
  ...
  %9 = tail call float @__nv_sinf(float %8), !dbg !19
  ...
}

Output modes¶

DEBUG=1 julia --compilecache=no examples/vadd.jl

Checking validity of CUDA/driver/current/libcuda.so
Checking validity of bundled library at julia/usr/lib/libLLVM-3.9.1.so
(Re)compiling kernel kernel_vadd(...) for device capability 3.5.0
intrinsics/indexing.jl:18: marked this readnone call a tail call candidate
intrinsics/indexing.jl:44: marked this readnone call a tail call candidate
intrinsics/indexing.jl:10: marked this readnone call a tail call candidate
JIT info log:
       ptxas info    : 0 bytes gmem
       ptxas info    : Compiling entry function 'kernel_vadd' for 'sm_35'
       ptxas info    : Function properties for kernel_vadd
       ptxas         .   0 bytes stack frame,
       ptxas         .   0 bytes spill stores,
       ptxas         .   0 bytes spill loads
       ptxas info    : Used 14 registers, 344 bytes cmem[0]

TRACE=1 julia --compilecache=no examples/vadd.jl

cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuInit(flags=0) = 0
LLVMInstallFatalErrorHandler(Handler=Ptr{Void} 0x...) = nothing
LLVMGetGlobalContext() = Ptr{LLVM.API.LLVMOpaqueContext} 0x...
cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuDeviceGet(handle_ref=Base.RefValue{Int32}(...), ordinal=0) = 0
cuCtxCreate_v2(handle_ref=Base.RefValue{Ptr{Void}}(...),
               flags=SCHED_AUTO, dev=CUDAdrv.CuDevice(0, 0)) = 0

...

Finalizing CuContext at 0x...
cuCtxDestroy_v2(ctx=CUDAdrv.CuContext(Ptr{Void} 0x..., true, true)) = 0
Invalidating CuContext at 0x...
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuModule at 0x... because context is no longer valid

That's it!¶

https://github.com/JuliaGPU/CUDAnative.jl

https://github.com/JuliaGPU/CUDAdrv.jl

https://github.com/maleadt/LLVM.jl

Talk on CUDAnative.jl (Wednesday):

GPU parallelism & optimization
CUDAnative.jl inner workings

Lightning talk on LLMV.jl (Thursday)