The shared write problem on GPU
We will write a GPU version of axpy!
function.
The main program
using NiLang, NiLang.AD
using CUDA
using KernelAbstractions
CUDA.allowscalar(true)
so far, this example requires patch: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/52
@i @kernel function axpy_kernel(y!, α, x)
# invcheckoff to turn of `reversibility checker`
# GPU can not handle errors!
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α
i → @index(Global)
end
end
@i function cu_axpy!(y!::AbstractVector, α, x::AbstractVector)
@launchkernel CUDADevice() 256 length(y!) axpy_kernel(y!, α, x)
end
@i function loss(out, y!, α, x)
cu_axpy!(y!, α, x)
# Note: the following code is stupid scalar operations on CuArray,
# They are only for testing.
for i=1:length(y!)
out += y![i]
end
end
y! = rand(100)
x = rand(100)
cuy! = y! |> CuArray
cux = x |> CuArray
α = 0.4
Check the correctness of results
using Test
cu_axpy!(cuy!, α, cux)
@test Array(cuy!) ≈ y! .+ α .* x
(~cu_axpy!)(cuy!, α, cux)
@test Array(cuy!) ≈ y!
Let's check the gradients
lsout = 0.0
@instr Grad(loss)(Val(1), lsout, cuy!, α, cux)
you will see a correct vector [0.4, 0.4, 0.4 ...]
grad.(cux)
you will see 0.0
.
grad(α)
Why some gradients not correct?
In the above example, α
is a scalar, whereas a scalar is not allowed to change in a CUDA kernel. What if we change α
to a CuArray?
This one works: using a vector of α
@i @kernel function axpy_kernel(y!, α, x)
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α[i]
i → @index(Global)
end
end
cuy! = y! |> CuArray
cux = x |> CuArray
cuβ = repeat([0.4], 100) |> CuArray
lsout = 0.0
@instr Grad(loss)(Val(1), lsout, cuy!, cuβ, cux)
You will see correct answer
grad.(cuβ)
This one has the shared write problem: using a vector of α
, but shared read.
@i @kernel function axpy_kernel(y!, α, x)
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α[i]
i → @index(Global)
end
end
cuy! = y! |> CuArray
cux = x |> CuArray
cuβ = repeat([0.4], 100) |> CuArray
lsout = 0.0
cuβ = [0.4] |> CuArray
Run the following will give you a happy error
ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces.
@instr Grad(loss)(Val(1), lsout, cuy!, cuβ, cux)
Because, shared write is not allowed. We need someone clever enough to solve this problem for us.
Conclusion
- Shared scalar: the gradient of a scalar will not be updated.
- Expanded vector: works properly, but costs more memory.
- Shared 1-element vector: error on shared write.
This page was generated using Literate.jl.