# The shared write problem on GPU

We will write a GPU version of `axpy!`

function.

## The main program

```
using NiLang, NiLang.AD
using CUDA
using KernelAbstractions
CUDA.allowscalar(true)
```

so far, this example requires patch: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/52

```
@i @kernel function axpy_kernel(y!, α, x)
# invcheckoff to turn of `reversibility checker`
# GPU can not handle errors!
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α
i → @index(Global)
end
end
@i function cu_axpy!(y!::AbstractVector, α, x::AbstractVector)
@launchkernel CUDADevice() 256 length(y!) axpy_kernel(y!, α, x)
end
@i function loss(out, y!, α, x)
cu_axpy!(y!, α, x)
# Note: the following code is stupid scalar operations on CuArray,
# They are only for testing.
for i=1:length(y!)
out += y![i]
end
end
y! = rand(100)
x = rand(100)
cuy! = y! |> CuArray
cux = x |> CuArray
α = 0.4
```

## Check the correctness of results

```
using Test
cu_axpy!(cuy!, α, cux)
@test Array(cuy!) ≈ y! .+ α .* x
(~cu_axpy!)(cuy!, α, cux)
@test Array(cuy!) ≈ y!
```

Let's check the gradients

```
lsout = 0.0
@instr Grad(loss)(Val(1), lsout, cuy!, α, cux)
```

you will see a correct vector `[0.4, 0.4, 0.4 ...]`

`grad.(cux)`

you will see `0.0`

.

`grad(α)`

## Why some gradients not correct?

In the above example, `α`

is a scalar, whereas a scalar is not allowed to change in a CUDA kernel. What if we change `α`

to a CuArray?

## This one works: using a vector of `α`

```
@i @kernel function axpy_kernel(y!, α, x)
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α[i]
i → @index(Global)
end
end
cuy! = y! |> CuArray
cux = x |> CuArray
cuβ = repeat([0.4], 100) |> CuArray
lsout = 0.0
@instr Grad(loss)(Val(1), lsout, cuy!, cuβ, cux)
```

You will see correct answer

`grad.(cuβ)`

## This one has the shared write problem: using a vector of `α`

, but shared read.

```
@i @kernel function axpy_kernel(y!, α, x)
@invcheckoff begin
i ← @index(Global)
y![i] += x[i] * α[i]
i → @index(Global)
end
end
cuy! = y! |> CuArray
cux = x |> CuArray
cuβ = repeat([0.4], 100) |> CuArray
lsout = 0.0
cuβ = [0.4] |> CuArray
```

Run the following will give you a happy error

ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces.

`@instr Grad(loss)(Val(1), lsout, cuy!, cuβ, cux)`

Because, shared write is not allowed. We need someone clever enough to solve this problem for us.

## Conclusion

- Shared scalar: the gradient of a scalar will not be updated.
- Expanded vector: works properly, but costs more memory.
- Shared 1-element vector: error on shared write.

*This page was generated using Literate.jl.*