Bessel function

An Bessel function of the first kind of order $\nu$ can be computed using Taylor expansion

\[ J_\nu(z) = \sum\limits_{n=0}^{\infty} \frac{(z/2)^\nu}{\Gamma(k+1)\Gamma(k+\nu+1)} (-z^2/4)^{n}\]

where $\Gamma(n) = (n-1)!$ is the Gamma function. One can compute the accumulated item iteratively as $s_n = -\frac{z^2}{4} s_{n-1}$.

using NiLang, NiLang.AD
using ForwardDiff: Dual

Since we need to use logarithmic numbers to handle the sequential mutiplication. Let's first add patch about the conversion between ULogarithmic and Dual number.

function Base.convert(::Type{Dual{T,V,N}}, x::ULogarithmic) where {T,V,N}
	Dual{T,V,N}(exp(x.log))
end

@i function ibesselj(y!::T, ν, z::T; atol=1e-8) where T
	if z == 0
		if v == 0
			out! += 1
		end
	else
		@routine @invcheckoff begin
			k ← 0
			@ones ULogarithmic{T} lz halfz halfz_power_2 s
			@zeros T out_anc
			lz *= convert(z)
			halfz *= lz / 2
			halfz_power_2 *= halfz ^ 2
			# s *= (z/2)^ν/ factorial(ν)
			s *= halfz ^ ν
			for i=1:ν
				s /= i
			end
			out_anc += convert(s)
			@from k==0 while s.log > -25 # upto precision e^-25
				k += 1
				# s *= 1 / k / (k+ν) * (z/2)^2
				s *= halfz_power_2 / (@const k*(k+ν))
				if k%2 == 0
					out_anc += convert(s)
				else
					out_anc -= convert(s)
				end
			end
		end
		y! += out_anc
		~@routine
	end
end

To obtain gradients, one call Grad(ibesselj)

y, x = 0.0, 1.0
Grad(ibesselj)(Val(1), y, 2, x)

(Val{1}(), GVar(0.0, 1.0), 2, GVar(1.0, 0.21024361588126536))

Here, Grad(ibesselj) is a callable instance of type Grad{typeof(ibesselj)}}. The first parameter Val(1) indicates the first argument is the loss.

To obtain second order gradients, one can Feed dual numbers to this gradient function.

_, hxy, _, hxx = Grad(ibesselj)(Val(1), Dual(y, zero(y)), 2, Dual(x, one(x)))
println("The hessian dy^2/dx^2 is $(grad(hxx).partials[1])")

The hessian dy^2/dx^2 is 0.13446683891656322

Here, the gradient field is a Dual number, it has a field partials that stores the derivative with respect to x. This is the Hessian that we need.

CUDA programming

The AD in NiLang avoids most heap allocation, so that it is able to execute on a GPU device We suggest using KernelAbstraction, it provides compatibility between CPU and GPU. To execute the above function on GPU, we need only 11 lines of code.

using CUDA, GPUArrays, KernelAbstractions

@i @kernel function bessel_kernel(out!, v, z)
    @invcheckoff i ← @index(Global)
    ibesselj(out![i], v, z[i])
    @invcheckoff i → @index(Global)
end

We have a macro support to KernelAbstraction in NiLang. So it is possible to launch directly like.

@i function befunc(out!, v::Integer, z)
    @launchkernel CUDADevice() 256 length(out!) bessel_kernel(out!, v, z)
end

It is equivalent to call

(~bessel_kernel)(CUDADevice(), 256)(out!, v, z; ndrange=length(out!))

But it will execute the job eagerly for you. We will consider better support in the future.

Except it is reversible

julia> @code_reverse @launchkernel CUDA() 256 length(out!) bessel_kernel(out!, v, z)
:(#= REPL[4]:1 =# @launchkernel CUDA() 256 length(out!) (~bessel_kernel)(out!, v, z))

To test this function, we first define input parameters a and output out!

a = CuArray(rand(128))
out! = CuArray(zeros(128))

We wrap the output with a randomly initialized gradient field, suppose we get the gradients from a virtual loss function. Also, we need to initialize an empty gradient field for elements in input cuda tensor a.

out! = ibesselj(out!, 2, GVar.(a))[1]
out_g! = GVar.(out!, CuArray(randn(128)))

Call the inverse program, the multiple dispatch will drive you to the goal.

(~ibesselj)(out_g!, 2, GVar.(a))

You will get CUDA arrays with GVar elements as output, their gradient fields are what you want. Cheers! Now you have a adjoint mode differentiable CUDA kernel.

Benchmark

We have different source to souce automatic differention implementations of the first type Bessel function $J_2(1.0)$ benchmarked and show the results below.

Package	Tangent/Adjoint	$T_{\rm min}$/ns	Space/KB
Julia	-	22	0
NiLang	-	59	0
ForwardDiff	Tangent	35	0
Manual	Adjoint	83	0
NiLang.AD	Adjoint	213	0
NiLang.AD (GPU)	Adjoint	1.4	0
Zygote	Adjoint	31201	13.47
Tapenade	Adjoint	?	?

Julia is the CPU time used for running the irreversible forward program, is the baseline of benchmarking. NiLang is the reversible implementation, it is 2.7 times slower than its irreversible counterpart. Here, we have remove the reversibility check. ForwardDiff gives the best performance because it is designed for functions with single input. It is even faster than manually derived gradients

\[\frac{\partial J_{\nu}(z)}{\partial z} = \frac{J_{\nu-1} - J_{\nu+1}}{2}\]

NiLang.AD is the reversible differential programming implementation, it considers only the backward pass. The benchmark of its GPU version is estimated on Nvidia Titan V by broadcasting the gradient function on CUDA array of size $2^17$ and take average. The Zygote benchmark considers both forward pass and backward pass. Tapenade is not yet ready.

This page was generated using Literate.jl.