maleadt     posts     about

Compiling Julia for NVIDIA GPUs

For the few last months, I have been working on CUDA support for the Julia language. It is now possible to write kernels in Julia and without much hassle execute them on a NVIDIA GPU, but there are still many limitations.

Note from 2018: Much has happened since 2015, and as a result this blog post has become pretty stale. Check-out CUDAnative.jl for more details about the current state of affairs.

My work allows for code such as:

using CUDA

# define a kernel
@target ptx function kernel_vadd(a, b, c)
    i = blockId_x() + (threadId_x()-1) * numBlocks_x()
    c[i] = a[i] + b[i]

    return nothing

# set-up
dev = CuDevice(0)
ctx = CuContext(dev)
cgctx = CuCodegenContext(ctx, dev)

# create some data
dims = (3, 4)
a = round(rand(Float32, dims) * 100)
b = round(rand(Float32, dims) * 100)
c = Array(Float32, dims)

# execute!
len = prod(dims)
@cuda (len, 1) kernel_vadd(CuIn(a), CuIn(b), CuOut(c))

# verify
@show a+b == c

# tear-down

which is pretty neat I think :-)

I’ll start by giving a quick description of the modifications. Jump to the bottom of this post for usage instructions.


Compiling Julia for GPUs requires support at multiple levels. I’ve tried to avoid touching too much of core compiler; as a consequence most functionality is part of the CUDA.jl package. This should make it easier to maintain and eventually merge the code.

All of the relevant repositories are hosted at my Github page, and contain README and TODO files. If you have any questions though, feel free to contact me.

Julia compiler

Using the NVPTX back-end of LLVM, I have modified the Julia compiler so that it can generate PTX assembly. A non-exhaustive list of modifications:

Most of the code churn comes from using an address-preserving bitcast, which is already being upstreamed thanks to Valentin Churavy.

CUDA.jl support package

Generating PTX assembly is only one part of the puzzle: hardware needs to be configured, code needs to be uploaded, etc. This functionality is exposed through the CUDA runtime driver, which already was conveniently wrapped in the CUDA.jl package.

I have extended this package with functionality required for GPU code generation, and developed user-friendly wrappers which should make it easier to interact with PTX code:

The significant part is obviously the @cuda macro, allowing for seamless execution of kernel functions on your GPU. The macro compiles the kernel you’re calling to PTX assembly, and generates code for interacting with the driver (creating a module, uploading code, managing arguments, etc).

The argument management is also pretty interesting. In function of the argument type, it generates type conversions and/or memory operations in order to mimic Julia’s pass-by-sharing convention. For example, if you pass an array to a kernel, @cuda will automatically up- and download it when required1.

Most functionality of @cuda is built using staged functions, and thus only executes once without a recurring runtime cost. This means that it should be possible to reach the same average performance of a traditional, precompiled CUDA application :-)

GPU Ocelot emulator

I have also forked the GPU Ocelot project, which is a research project providing a dynamic compilation framework (read: emulator) for CUDA hardware. By extending API support calls and fixing certain bugs, you can use this as a drop-in replacement for, fully compatible with CUDA.jl.

In practice, I used this emulator for everyday development on a system without an NVIDIA GPU, while testing happened on real hardware.


The code is far from production ready: it is not cross-platform (Linux only), several changes should be discussed with upstream, and only a restricted subset of the language is supported. Most notable shortcomings:

In short: unless you’re only using relatively simple kernels with non-complex data interactions, this code is not yet usable for you.


Even though all code is pretty functional and well-maintained, you need some basic development skills to put the pieces together. Don’t expect a polished product!

Julia compiler

Compile the modified compiler from source, using LLVM 3.5:

$ git clone
$ cd julia
$ make LLVM_VER=3.5.0

Optionally, make sure Julia is not broken (this does not include GPU tests):

$ make LLVM_VER=3.5.0 testall

Note: the compiler will require libdevice to link kernel binaries. This library is only part of recent CUDA toolkits (version 5.5 or greater). If you use an older CUDA release, you will need to get a hold of these files. Afterwards, you can point Julia to them using the NVVMIR_LIBRARY_DIR environment variable.

GPU Ocelot emulator

If you don’t have any suitable CUDA hardware, you can use GPU Ocelot:

$ git clone --recursive
$ cd gpuocelot
$ CUDA_BIN_PATH=/opt/cuda-5.0/bin \
  CUDA_LIB_PATH=/opt/cuda-5.0/lib \
  CUDA_INC_PATH=/opt/cuda-5.0/include \
  python2 --install -p $(realpath ../julia/usr)

Note: this probably will not build flawlessly. You’ll need at least the CUDA toolkit2 (headers and tools, not the driver), gcc 4.6, scons, LLVM 3.5 and Boost. Check the README!

Now if you load CUDA.jl and it doesn’t find, it will look for instead:

$ ./julia
  > using CUDA

CUDA.jl support package

Installing packages is easy3 (just make sure you use the correct julia binary):

$ ./julia
  > Pkg.clone("")

Optionally, but recommended, test GPU support:

$ ./julia
  > Pkg.test("CUDA")

What now?

You tell me! I think this work can be a good start for future GPU endeavours in Julia land, even without most code being directly re-usable. For me at least it has been a very interesting project, but it’s in the hands of the community now.

  1. You can influence this behaviour using the CuIn, CuOut, and CuInOut wrapper types.

  2. GPU Ocelot is only compatible with CUDA 5.0 or older. This means you’ll need to get libdevice separately.

  3. If you don’t want to pollute your main package directory with this experimental stuff, redefine the JULIA_PKGDIR environment variable.