by: Tim Besard (GitHub: maleadt
)
Talk: how does it work, what does it enable.
Workshop: how do you use CUDAnative.jl, realistically.
Follow along:
CUDA C replacement:
High-level wrappers: eg. GPUArrays.jl
Very usable, but restricted:
Operating system: Linux, macOS (unsupported)
Julia: 0.6, 0.7 (unsupported)
GPU: NVIDIA CUDA, sm
>= 2.0
Dependencies: CUDAdrv.jl & LLVM.jl
LLVM.jl: requires Julia source build:
$ git clone https://github.com/JuliaLang/julia.git
$ cd julia
$ git checkout v0.6.0
$ make
$ ./julia
Pkg.add("CUDAnative")
Single Program, Multiple Data
function execute(kernel, threads, args...)
global thread
for thread in 1:threads
kernel(args...)
end
end;
function kernel(a, b, c)
c[thread] = a[thread] + b[thread]
return
end;
a = [1,2,3]
b = [4,5,6]
c = similar(a)
execute(kernel, length(a), a, b, c)
a+b == c
true
Conceptually simple, but complex implementations:
Again, CUDAnative.jl does not obviate that complexity.
CUDAnative.jl takes care of:
Other CUDA operations: manual usage of eg. CUDAdrv.jl
using CUDAdrv, CUDAnative
dev = CuDevice(0)
CuDevice(0): GeForce GTX TITAN
ctx = CuContext(dev)
# ...
destroy!(ctx)
CuContext(dev) do ctx
# ...
end
ctx = CuContext(dev);
?CuArray
search: CuArray CuDeviceArray
CuArray{T}(dims)
CuArray{T,N}(dims)
Construct an uninitialized N
-dimensional dense CUDA array with element type T
, where N
is determined from the length or number of dims
. dims
may be a tuple or a series of integer arguments corresponding to the lengths in each dimension. If the rank N
is supplied explicitly as in Array{T,N}(dims)
, then it must match the length or number of dims
.
Type aliases CuVector
and CuMatrix
are available for respectively 1 and 2-dimensional data.
CuArray{T}(src::Array{T})
Transfer a host array src
to device, returning a CuArray
.
d_a = CuArray{Int}(2)
2-element Array{Int64,1}: 4737786809106563072 4719772410588954624
d_a[1] = 1
indexing not defined for CUDAdrv.CuArray{Int64,1} Stacktrace: [1] setindex!(::CUDAdrv.CuArray{Int64,1}, ::Int64, ::Int64) at ./abstractarray.jl:966
a = Array(d_a)
2-element Array{Int64,1}: 4737786809106563072 4719772410588954624
@cuda ... kernel(d_a)
Mem.used() |> datasize
"72.1 MB"
d_a = CuArray{Int}(10000000)
sizeof(d_a) |> datasize
"80.0 MB"
Mem.used() |> datasize
"152.2 MB"
finalize(d_a)
Mem.used() |> datasize
"72.1 MB"
Avoid ERROR_OUT_OF_MEMORY
: manual gc()
function kernel_vadd(a, b, c)
i = threadIdx().x
c[i] = a[i] + b[i]
return
end;
len = 42
a = randn(Float32, len)
b = randn(Float32, len);
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);
@cuda (1,len) kernel_vadd(d_a, d_b, d_c) # asynchronous!
kernel_vadd<<<1,len>>>(d_a, d_b, d_c)
c = Array(d_c) # synchronizing
a+b ≈ c
true
len = 2048
a = randn(Float32, len)
b = randn(Float32, len);
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a);
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)
CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE) Stacktrace: [1] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/base.jl:127 [inlined] [2] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:84 [inlined] [3] _launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:64 [inlined] [4] launch at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:56 [inlined] [5] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:172 [inlined] [6] _cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:162 [inlined] [7] cudacall at /home/tbesard/Projects/Julia-CUDA/CUDAdrv/src/execution.jl:154 [inlined] [8] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:137 [inlined] [9] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/profile.jl:42 [inlined] [10] macro expansion at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:136 [inlined] [11] generated_cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::#kernel_vadd, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}, ::CUDAdrv.CuArray{Float32,1}) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/execution.jl:185
attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
1024
attribute.(dev, (CUDAdrv.MAX_BLOCK_DIM_X,
CUDAdrv.MAX_BLOCK_DIM_Y,
CUDAdrv.MAX_BLOCK_DIM_Z))
(1024, 1024, 64)
function kernel_vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
if i <= length(c)
c[i] = a[i] + b[i]
end
return
end;
threads = min(1024, len)
blocks = ceil(Int, len / threads)
@cuda (blocks,threads) kernel_vadd(d_a, d_b, d_c)
c = Array(d_c)
a+b ≈ c
true
?CUDAdrv.CuDim3
CuDim3(x)
CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for respectively the x
, y
and z
dimension. Unspecified dimensions default to 1
.
Often accepted as argument through the CuDim
type alias, eg. in the case of cudacall
, allowing to pass dimensions as a plain integer or tuple without having to construct an explicit CuDim3
object.
CUDAnative defines: code_llvm
, code_ptx
, code_sass
Why? Custom toolchain
CUDAnative also defines:
@code_{lowered|typed|warntype}
: defer to Base@code_llvm
, @code_ptx
, @code_sass
: defer to CUDAnativeWhy?
@cuda (...) kernel(...)
CuArray
→ CuDeviceArray
@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
WARNING: both CUDAnative and Base export "@code_llvm"; uses of it in module Main must be qualified
UndefVarError: @code_llvm not defined
CUDAnative.@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
; Function Attrs: nounwind define void @julia_kernel_vadd_62142(%CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16), %CuDeviceArray.4* nocapture readonly dereferenceable(16)) local_unnamed_addr #0 !dbg !5 { top: %ptls_i8 = tail call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2 ; Filename: /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/device/intrinsics/indexing.jl ; Source line: 18 %3 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !7 ; Source line: 44 %4 = tail call i32 @llvm.nvvm.read.ptx.sreg.ntid.x(), !dbg !13 ; Source line: 10 %5 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !15 ; Filename: In[28] ; Source line: 2 %6 = add i32 %3, 1, !dbg !12 %7 = zext i32 %6 to i64, !dbg !12 %8 = add nsw i64 %7, -1, !dbg !12 %9 = zext i32 %4 to i64, !dbg !12 %10 = mul i64 %8, %9, !dbg !12 %11 = add i32 %5, 1, !dbg !12 %12 = zext i32 %11 to i64, !dbg !12 %13 = add i64 %10, %12, !dbg !12 ; Source line: 4 %14 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 0, i64 0, !dbg !19 %15 = load i64, i64* %14, align 8, !dbg !19, !tbaa !20 %16 = icmp sgt i64 %13, %15, !dbg !19 br i1 %16, label %L48, label %if, !dbg !19 if: ; preds = %top ; Source line: 5 %17 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %0, i64 0, i32 1, !dbg !23 %18 = add i64 %13, -1, !dbg !23 %19 = load float*, float** %17, align 8, !dbg !23, !tbaa !20 %20 = getelementptr float, float* %19, i64 %18, !dbg !23 %21 = load float, float* %20, align 8, !dbg !23, !tbaa !24 %22 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %1, i64 0, i32 1, !dbg !23 %23 = load float*, float** %22, align 8, !dbg !23, !tbaa !20 %24 = getelementptr float, float* %23, i64 %18, !dbg !23 %25 = load float, float* %24, align 8, !dbg !23, !tbaa !24 %26 = fadd float %21, %25, !dbg !23 %27 = getelementptr inbounds %CuDeviceArray.4, %CuDeviceArray.4* %2, i64 0, i32 1, !dbg !23 %28 = load float*, float** %27, align 8, !dbg !23, !tbaa !20 %29 = getelementptr float, float* %28, i64 %18, !dbg !23 store float %26, float* %29, align 8, !dbg !23, !tbaa !24 br label %L48, !dbg !23 L48: ; preds = %if, %top ; Source line: 8 ret void, !dbg !26 }
@code_ptx kernel_vadd(d_a, d_b, d_c)
// // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_35 .address_size 64 .file 1 "./In[28]" // .globl julia_kernel_vadd_62150 .visible .func julia_kernel_vadd_62150( .param .b64 julia_kernel_vadd_62150_param_0, .param .b64 julia_kernel_vadd_62150_param_1, .param .b64 julia_kernel_vadd_62150_param_2 ) { .reg .pred %p<2>; .reg .f32 %f<4>; .reg .s32 %r<7>; .reg .s64 %rd<18>; ld.param.u64 %rd4, [julia_kernel_vadd_62150_param_2]; mov.u32 %r2, %ctaid.x; mov.u32 %r3, %ntid.x; mov.u32 %r4, %tid.x; .loc 1 2 0 add.s32 %r5, %r2, 1; cvt.u64.u32 %rd5, %r5; add.s64 %rd6, %rd5, -1; cvt.u64.u32 %rd7, %r3; mul.lo.s64 %rd8, %rd6, %rd7; add.s32 %r6, %r4, 1; cvt.u64.u32 %rd9, %r6; add.s64 %rd1, %rd8, %rd9; .loc 1 4 0 ld.u64 %rd10, [%rd4]; setp.gt.s64 %p1, %rd1, %rd10; @%p1 bra LBB0_2; ld.param.u64 %rd3, [julia_kernel_vadd_62150_param_1]; ld.param.u64 %rd2, [julia_kernel_vadd_62150_param_0]; .loc 1 5 0 ld.u64 %rd11, [%rd2+8]; shl.b64 %rd12, %rd1, 2; add.s64 %rd13, %rd11, %rd12; ld.f32 %f1, [%rd13+-4]; ld.u64 %rd14, [%rd3+8]; add.s64 %rd15, %rd14, %rd12; ld.f32 %f2, [%rd15+-4]; add.f32 %f3, %f1, %f2; ld.u64 %rd16, [%rd4+8]; add.s64 %rd17, %rd16, %rd12; st.f32 [%rd17+-4], %f3; LBB0_2: .loc 1 8 0 ret; }
@code_sass kernel_vadd(d_a, d_b, d_c)
code for sm_35 Function : julia_kernel_vadd_62165 .headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)" /* 0x08b010a010a01000 */ /*0008*/ MOV R1, c[0x0][0x44]; /* 0x64c03c00089c0006 */ /*0010*/ S2R R0, SR_CTAID.X; /* 0x86400000129c0002 */ /*0018*/ MOV R6, c[0x0][0x150]; /* 0x64c03c002a1c001a */ /*0020*/ IADD R0, R0, 0x1; /* 0xc0800000009c0001 */ /*0028*/ MOV R7, c[0x0][0x154]; /* 0x64c03c002a9c001e */ /*0030*/ IADD R4.CC, R0, -0x1; /* 0xc88403ffff9c0011 */ /*0038*/ S2R R0, SR_TID.X; /* 0x86400000109c0002 */ /* 0x08b010b0b0118010 */ /*0048*/ LD.E.64 R2, [R6]; /* 0xc5800000001c1808 */ /*0050*/ IADD.X R5, RZ, -0x1; /* 0xc88043ffff9ffc15 */ /*0058*/ IMUL.U32.U32 R9.CC, R4, c[0x0][0x28]; /* 0x61c40000051c1026 */ /*0060*/ IADD R0, R0, 0x1; /* 0xc0800000009c0001 */ /*0068*/ IMAD.U32.U32.HI.X R4.CC, R4, c[0x0][0x28], RZ; /* 0x5217fc00051c1012 */ /*0070*/ IMAD.U32.U32.X R5, R5, c[0x0][0x28], R4; /* 0x50101000051c1416 */ /*0078*/ IADD R4.CC, R9, R0; /* 0xe0840000001c2412 */ /* 0x08a0b0a000b0b010 */ /*0088*/ IADD.X R0, R5, RZ; /* 0xe08040007f9c1402 */ /*0090*/ ISUB RZ.CC, R4, R2; /* 0xe08c0000011c13fe */ /*0098*/ ISETP.GT.X.AND P0, PT, R0, R3, PT; /* 0xdb485c00019c001e */ /*00a0*/ @P0 BRA.U 0x168; /* 0x120000006000023c */ /*00a8*/ @!P0 MOV32I R5, 0x8; /* 0x740000000423c016 */ /*00b0*/ @!P0 IADD R10.CC, R5, c[0x0][0x140]; /* 0x608400002820142a */ /*00b8*/ @!P0 IADD.X R11, RZ, c[0x0][0x144]; /* 0x6080400028a3fc2e */ /* 0x088c11909c81b010 */ /*00c8*/ @!P0 IADD R14.CC, R5, c[0x0][0x148]; /* 0x608400002920143a */ /*00d0*/ @!P0 LD.E.64 R8, [R10]; /* 0xc580000000202820 */ /*00d8*/ @!P0 IADD.X R15, RZ, c[0x0][0x14c]; /* 0x6080400029a3fc3e */ /*00e0*/ @!P0 IADD R6.CC, R5, c[0x0][0x150]; /* 0x608400002a20141a */ /*00e8*/ @!P0 LD.E.64 R2, [R14]; /* 0xc580000000203808 */ /*00f0*/ @!P0 SHF.L.U64 R5, R4, 0x2, R0; /* 0xb7c0020001201015 */ /*00f8*/ @!P0 IADD.X R7, RZ, c[0x0][0x154]; /* 0x608040002aa3fc1e */ /* 0x08809c80b18080b0 */ /*0108*/ @!P0 IMAD.U32.U32 R12.CC, R4, 0x4, R8; /* 0xa004200002201031 */ /*0110*/ @!P0 IADD.X R13, R9, R5; /* 0xe080400002a02436 */ /*0118*/ @!P0 LD.E.64 R8, [R6]; /* 0xc580000000201820 */ /*0120*/ @!P0 IMAD.U32.U32 R10.CC, R4, 0x4, R2; /* 0xa004080002201029 */ /*0128*/ @!P0 IADD.X R11, R3, R5; /* 0xe080400002a00c2e */ /*0130*/ @!P0 LD.E R3, [R12+-0x4]; /* 0xc4fffffffe20300c */ /*0138*/ @!P0 LD.E R0, [R10+-0x4]; /* 0xc4fffffffe202800 */ /* 0x0800b81000a0a4b0 */ /*0148*/ @!P0 IMAD.U32.U32 R4.CC, R4, 0x4, R8; /* 0xa004200002201011 */ /*0150*/ @!P0 IADD.X R5, R9, R5; /* 0xe080400002a02416 */ /*0158*/ @!P0 FADD R0, R3, R0; /* 0xe2c0000000200c02 */ /*0160*/ @!P0 ST.E [R4+-0x4], R0; /* 0xe4fffffffe201000 */ /*0168*/ MOV RZ, RZ; /* 0xe4c03c007f9c03fe */ /*0170*/ EXIT; /* 0x18000000001c003c */ /*0178*/ BRA 0x178; /* 0x12007ffffc1c003c */ ........................................
Reduce across columns.
const rows = 3
const cols = 4
a = collect(reshape(1:rows*cols, (rows, cols)))
3×4 Array{Int64,2}: 1 4 7 10 2 5 8 11 3 6 9 12
cpu(+, a)
3×4 Array{Int64,2}: 1 4 7 10 3 9 15 21 6 15 24 33
# naive GPU implementation
function gpu!(op::Function, data::CuDeviceMatrix)
col = blockIdx().x
row = threadIdx().x
offset = 1
while offset < row
a = data[row,col]
b = data[row - offset,col]
sync_threads()
data[row,col] = op(a,b)
sync_threads()
offset *= 2
end
return
end;
function gpu(op::Function, a::Matrix)
d_a = CuArray(a)
@cuda (size(a,2),size(a,1)) gpu!(op, d_a)
Array(d_a)
end;
cpu(+, a) ≈ gpu(+, a)
true
a = rand(Float32, 1000, 1000);
@benchmark cpu(+, a)
BenchmarkTools.Trial: memory estimate: 3.81 MiB allocs estimate: 2 -------------- minimum time: 1.065 ms (0.00% GC) median time: 1.081 ms (0.00% GC) mean time: 1.203 ms (9.32% GC) maximum time: 2.256 ms (29.94% GC) -------------- samples: 4125 evals/sample: 1
@benchmark gpu(+, a)
BenchmarkTools.Trial: memory estimate: 3.82 MiB allocs estimate: 31 -------------- minimum time: 1.639 ms (0.00% GC) median time: 1.655 ms (0.00% GC) mean time: 1.743 ms (4.84% GC) maximum time: 2.819 ms (23.53% GC) -------------- samples: 2863 evals/sample: 1
Benchmark the kernel time: CUDAdrv.@elapsed
function gpu_benchmark(op::Function, a::Matrix)
d_a = CuArray(a)
time = CUDAdrv.@elapsed begin
@cuda (size(a,2),size(a,1)) gpu!(op, d_a)
end
(time)u"s"
end;
gc()
minimum([gpu_benchmark(+,a) for _ in 1:1000]) |> u"ms"
0.3496f0 ms
Dwarfed by data transfer time.
Still, many possible kernel optimizations:
Only going to discusse performance & optimization of Julia constructs!
julia> using CUDAnative
julia> foo() = nothing
julia> Base.@elapsed @cuda (1,1) foo()
8.501839023 s
julia> Base.@elapsed @cuda (1,1) foo()
36.033 μs
As fast as statically compiled C!
julia> bar() = nothing
bar (generic function with 1 method)
julia> Base.@elapsed @cuda (1,1) bar()
105.534056 ms
Initial overhead: inferring & compiling infrastructure
Rodinia benchmarks: CUDA C vs CUDAnative.jl
Performance is great! If you program carefully....
function kernel(a)
a[threadIdx().x] = 0
return
end;
len = 1000
d_a = CuArray{Int}(len);
(@kernel_benchmark @cuda (1,len) kernel(d_a)) |> u"μs"
15.072f0 μs
function kernel(ptr, len)
a = CuDeviceArray(len, ptr)
a[threadIdx().x] = 0
return
end;
(@kernel_benchmark @cuda (1,len) kernel(pointer(d_a), length(d_a))) |> u"μs"
4.6080003f0 μs
function dummy_kernel(a)
i = threadIdx().x
a[i] = CUDAnative.sqrt(a[i]*2.0)
return
end;
CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})
.visible .func julia_dummy_kernel(...)
{
.reg .f32 %f<3>;
.reg .f64 %fd<4>;
// a[i] * 2
ld.f32 %f1, [%rd4+-4];
cvt.f64.f32 %fd1, %f1;
add.f64 %fd2, %fd1, %fd1;
// a[i] = sqrt
sqrt.rn.f64 %fd3, %fd2;
cvt.rn.f32.f64 %f2, %fd3;
st.f32 [%rd4+-4], %f2;
ret;
}
function dummy_kernel(a)
i = threadIdx().x
a[i] = CUDAnative.sqrt(a[i] * eltype(a)(2))
return
end;
CUDAnative.code_ptx(DevNull, dummy_kernel, Tuple{CuDeviceArray{Float32,2}})
.visible .func julia_dummy_kernel(...)
{
.reg .f32 %f<4>;
// a[i] * 2
ld.f32 %f1, [%rd4+-4];
add.f32 %f2, %f1, %f1;
/// a[i] = sqrt
sqrt.rn.f32 %f3, %f2;
st.f32 [%rd4+-4], %f3;
ret;
}
Julia Base doesn't always care wrt. Int
:
StepRange
s promote small integers1
in size(::AbstractArray)
)?CUDAdrv.@profile
@profile ex
Run expressions while activating the CUDA profiler.
Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.
?CUDAnative.@profile
@profile ex
Runs your expression ex
while activating the CUDA profiler upon first kernel launch. This makes it easier to profile accurately, without the overhead of initial compilation, memory transfers, ...
Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.
$ cuda-memcheck julia examples/oob.jl
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 4
========= at 0x000000b8 in examples/oob.jl:18:julia_memset_61129
$ cuda-gdb --args julia examples/oob.jl
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) r
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0),
thread (0,0,0), device 0, sm 13, warp 0, lane 0]
julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18
18 a[i] = val
(cuda-gdb) backtrace
#0 julia_memset_61129<<<(1,1,1),(11,1,1)>>> () at examples/oob.jl:18
(cuda-gdb) l
16 function memset(a, val)
17 i = (blockIdx().x-1) * blockDim().x + threadIdx().x
18 a[i] = val
19 return nothing
20 end
(cuda-gdb) info locals
No locals.
(cuda-gdb) info args
No arguments.
(cuda-gdb) print val
No symbol "val" in current context.
(cuda-gdb) layout split
┌──examples/oob.jl───────────────────────────────────────────────────┐
│16 function memset(a, val) │
│17 i = (blockIdx().x-1) * blockDim().x + threadIdx().x │
>│18 a[i] = val │
│19 return nothing │
│20 end │
┌────────────────────────────────────────────────────────────────────┐
│0x3df57e8 <julia_memset_61129+40> IADD R0, R0, 0x1 │
B+>│0x3df57f0 <julia_memset_61129+48> IADD.X R5, RZ, c[0x0][0x144] │
│0x3df57f8 <julia_memset_61129+56> IADD R6.CC, R0, -0x1 │
└────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff In: julia_memset_61129 Line: 18 PC: 0x3df57f0
(cuda-gdb) next
(cuda-gdb) print $R5
$1 = 11
kernel_returns() = 1;
@code_sass kernel_returns()
kernel_returns() is not a valid kernel as it returns Int64 Stacktrace: [1] check_kernel(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:273 [2] #code_sass#62(::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:83 [3] code_sass(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:81 [4] #code_sass#63 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99 [inlined] [5] code_sass(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:99 [6] code_sass_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116
call_runtime() = Int[]
call_runtime (generic function with 1 method)
CUDAnative.@code_typed call_runtime()
1-element Array{Any,1}: CodeInfo(:(begin return $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) end))=>Array{Int64,1}
CUDAnative.@code_llvm call_runtime()
error compiling call_runtime: emit_ccall for In[60]:1 requires the runtime language feature, which is disabled Stacktrace: [1] _dump_function_linfo(::Core.MethodInstance, ::UInt64, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:713 [2] _dump_function(::ANY, ::ANY, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:701 [3] irgen(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/jit.jl:73 [4] #code_llvm#58(::Bool, ::Bool, ::VersionNumber, ::Function, ::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34 [5] code_llvm(::Base.PipeEndpoint, ::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:34 [6] #code_llvm#59 at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44 [inlined] [7] code_llvm(::Any, ::Any) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:44 [8] code_llvm_cputyped(::Function, ::Type{T} where T<:Tuple) at /home/tbesard/Projects/Julia-CUDA/CUDAnative/src/reflection.jl:116
function call_library(a)
i = threadIdx().x
a[i] = sin(a[i])
return
end
call_library (generic function with 1 method)
This kills the driver...
julia> synchronize()
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
julia> ctx = CuContext(dev)
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})
define void @julia_call_library_62847(...) {
...
%9 = tail call float
inttoptr (i64 139694085276592 to float (float)*)
(float %8)
...
}
function call_library(a)
i = threadIdx().x
a[i] = CUDAnative.sin(a[i])
return
end
call_library (generic function with 1 method)
CUDAnative.code_llvm(DevNull, call_library, Tuple{CuDeviceMatrix{Float32}})
define void @julia_call_library_62847(...) {
...
%9 = tail call float @__nv_sinf(float %8), !dbg !19
...
}
DEBUG=1 julia --compilecache=no examples/vadd.jl
Checking validity of CUDA/driver/current/libcuda.so
Checking validity of bundled library at julia/usr/lib/libLLVM-3.9.1.so
(Re)compiling kernel kernel_vadd(...) for device capability 3.5.0
intrinsics/indexing.jl:18: marked this readnone call a tail call candidate
intrinsics/indexing.jl:44: marked this readnone call a tail call candidate
intrinsics/indexing.jl:10: marked this readnone call a tail call candidate
JIT info log:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'kernel_vadd' for 'sm_35'
ptxas info : Function properties for kernel_vadd
ptxas . 0 bytes stack frame,
ptxas . 0 bytes spill stores,
ptxas . 0 bytes spill loads
ptxas info : Used 14 registers, 344 bytes cmem[0]
TRACE=1 julia --compilecache=no examples/vadd.jl
cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuInit(flags=0) = 0
LLVMInstallFatalErrorHandler(Handler=Ptr{Void} 0x...) = nothing
LLVMGetGlobalContext() = Ptr{LLVM.API.LLVMOpaqueContext} 0x...
cuDriverGetVersion(version_ref=Base.RefValue{Int32}(...)) = 0
cuDeviceGet(handle_ref=Base.RefValue{Int32}(...), ordinal=0) = 0
cuCtxCreate_v2(handle_ref=Base.RefValue{Ptr{Void}}(...),
flags=SCHED_AUTO, dev=CUDAdrv.CuDevice(0, 0)) = 0
...
Finalizing CuContext at 0x...
cuCtxDestroy_v2(ctx=CUDAdrv.CuContext(Ptr{Void} 0x..., true, true)) = 0
Invalidating CuContext at 0x...
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuArray at 0x... because context is no longer valid
Skipping finalizer for CuModule at 0x... because context is no longer valid
Talk on CUDAnative.jl (Wednesday):
Lightning talk on LLMV.jl (Thursday)