GiggleLiu - The Numeric Monster

Debugging note: HKUST-GZ HPC

Step 0: Log in (with Linux operating system)

Edit .ssh/config by adding the following lines

Host hpc
    User you-user-name
    HostName 10.120.1.2

To avoid repeated typing password, one could use ssh-copy-id command, read more.

Step 1: Install Julia

Edit .bashrc or .zshrc by adding the following two lines

export JULIAUP_SERVER=https://mirror.nju.edu.cn/julia-releases/
export JULIA_PKG_SERVER=http://cn-southeast.pkg.juliacn.com

Install Juliaup and Julia with

curl -fsSL https://install.julialang.org | sh

If the above command fails, please run again. Type julia to start a julia REPL, and then install the following packages by typing

using Revise, Yao, CuYao, OMEinsum

submit an MPI job to LSF

git clone https://github.com/CodingThrust/CodingClub.git
cd mpi
bsub < julia-helloworld-lsf.job

To know more about how does MPI works, please read the README.md in your current working folder.

Single-GPU benchmark

git clone https://github.com/TensorBFS/TensorNetworkBenchmarks.git
cd TensorNetworkBenchmarks/scripts
julia -e "using Pkg; Pkg.instantiate()"
bsub < julia-gpu-lsf.job
bsub < pytorch-gpu-lsf.job

Multi-GPU test

mkdir jcode
cd jcode
mkdir lab
cd lab
wget https://gist.githubusercontent.com/GiggleLiu/d5b66c9883f0c5df41a440589983ab99/raw/defd440d6379d85c93034b1ccf75b9650804992c/tensorcontract_multigpu.jl

To be continue.

Problems

Julia startup takes 6s

[jinguoliu@mgt02 ~]$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

Using a julia package takes too long

julia> @time using Yao
  3.322434 seconds (776.64 k allocations: 51.804 MiB, 2.56% gc time, 0.68% compilation time)

For comparison, the time on my laptop

julia> @time using Yao
  0.937178 seconds (772.86 k allocations: 51.008 MiB, 10.43% gc time, 1.82% compilation time)

CUDA.jl can not download CUDA driver

[jinguoliu@mgt02 gpu]$ cat 143542.err 
 Downloading artifact: LLVMExtra
 Downloading artifact: LLVMExtra
ERROR: LoadError: InitError: Unable to automatically download/install artifact 'LLVMExtra' from sources listed in '/hpc/users/HKUST-GZ/jinguoliu/.julia/packages/LLVMExtra_jll/5xA91/Artifacts.toml'.
Sources attempted:
- http://cn-southeast.pkg.juliacn.com/artifact/0fee7c1c15f87bbf886d08a1a90ce0e85c6d2453
    Error: RequestError: Connection timeout after 30031 ms while requesting http://cn-southeast.pkg.juliacn.com/artifact/0fee7c1c15f87bbf886d08a1a90ce0e85c6d2453
- https://github.com/JuliaBinaryWrappers/LLVMExtra_jll.jl/releases/download/LLVMExtra-v0.0.23+0/LLVMExtra.v0.0.23.x86_64-linux-gnu-cxx11-llvm_version+14.tar.gz
    Error: RequestError: HTTP/2 302 (Failed to connect to objects.githubusercontent.com port 443 after 28133 ms: Connection timed out) while requesting https://github.com/JuliaBinaryWrappers/LLVMExtra_jll.jl/releases/download/LLVMExtra-v0.0.23+0/LLVMExtra.v0.0.23.x86_64-linux-gnu-cxx11-llvm_version+14.tar.gz

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] ensure_artifact_installed(name::String, meta::Dict{String, Any}, artifacts_toml::String; platform::Base.BinaryPlatforms.Platform, verbose::Bool, quiet_download::Bool, io::IOStream)
    @ Pkg.Artifacts ~/software/julia-1.9.0/share/julia/stdlib/v1.9/Pkg/src/Artifacts.jl:443
  [3] ensure_artifact_installed(name::String, artifacts_toml::String; platform::Base.BinaryPlatforms.Platform, pkg_uuid::Nothing, verbose::Bool, quiet_download::Bool, io::IOStream)
    @ Pkg.Artifacts ~/software/julia-1.9.0/share/julia/stdlib/v1.9/Pkg/src/Artifacts.jl:383
  [4] _artifact_str(__module__::Module, artifacts_toml::String, name::SubString{String}, path_tail::String, artifact_dict::Dict{String, Any}, hash::Base.SHA1, platform::Base.BinaryPlatforms.Platform, lazyartifacts::Any)
    @ Artifacts ~/software/julia-1.9.0/share/julia/stdlib/v1.9/Artifacts/src/Artifacts.jl:549
  [5] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:816
  [6] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:813
  [7] macro expansion
    @ ~/software/julia-1.9.0/share/julia/stdlib/v1.9/Artifacts/src/Artifacts.jl:710 [inlined]
  [8] find_artifact_dir()
    @ LLVMExtra_jll ~/.julia/packages/JLLWrappers/QpMQW/src/wrapper_generators.jl:13
  [9] __init__()
    @ LLVMExtra_jll ~/.julia/packages/LLVMExtra_jll/5xA91/src/wrappers/x86_64-linux-gnu-cxx11-llvm_version+14.jl:7
 [10] register_restored_modules(sv::Core.SimpleVector, pkg::Base.PkgId, path::String)
    @ Base ./loading.jl:1074
 [11] _include_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String, depmods::Vector{Any})
    @ Base ./loading.jl:1020
 [12] _tryrequire_from_serialized(pkg::Base.PkgId, path::String, ocachepath::String)
    @ Base ./loading.jl:1407
 [13] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1781
 [14] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1625
 [15] macro expansion
    @ ./loading.jl:1613 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1576
 [18] include
    @ ./Base.jl:457 [inlined]
 [19] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
    @ Base ./loading.jl:2010
 [20] top-level scope
    @ stdin:2
during initialization of module LLVMExtra_jll
in expression starting at /hpc/users/HKUST-GZ/jinguoliu/.julia/packages/LLVM/Od0DH/src/LLVM.jl:1
in expression starting at stdin:2
ERROR: LoadError: Failed to precompile LLVM [929cbde3-209d-540e-8aea-75f648917ca0] to "/hpc/users/HKUST-GZ/jinguoliu/.julia/compiled/v1.9/LLVM/jl_J2OmI1".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:2260
  [3] compilecache
    @ ./loading.jl:2127 [inlined]
  [4] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1770
  [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1625
  [6] macro expansion
    @ ./loading.jl:1613 [inlined]
  [7] macro expansion
    @ ./lock.jl:267 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1576
  [9] include
    @ ./Base.jl:457 [inlined]
 [10] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
    @ Base ./loading.jl:2010
 [11] top-level scope
    @ stdin:2
in expression starting at /hpc/users/HKUST-GZ/jinguoliu/.julia/packages/GPUCompiler/YO8Uj/src/GPUCompiler.jl:1
in expression starting at stdin:2
ERROR: LoadError: Failed to precompile GPUCompiler [61eb1bfa-7361-4325-ad38-22787b887f55] to "/hpc/users/HKUST-GZ/jinguoliu/.julia/compiled/v1.9/GPUCompiler/jl_RakosM".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:2260
  [3] compilecache
    @ ./loading.jl:2127 [inlined]
  [4] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1770
  [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1625
  [6] macro expansion
    @ ./loading.jl:1613 [inlined]
  [7] macro expansion
    @ ./lock.jl:267 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1576
  [9] include
    @ ./Base.jl:457 [inlined]
 [10] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
    @ Base ./loading.jl:2010
 [11] top-level scope
    @ stdin:2
in expression starting at /hpc/users/HKUST-GZ/jinguoliu/.julia/packages/CUDA/tVtYo/src/CUDA.jl:1
in expression starting at stdin:2
ERROR: LoadError: Failed to precompile CUDA [052768ef-5323-5732-b1bb-66c8b64840ba] to "/hpc/users/HKUST-GZ/jinguoliu/.julia/compiled/v1.9/CUDA/jl_gGumDP".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:2260
  [3] compilecache
    @ ./loading.jl:2127 [inlined]
  [4] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1770
  [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1625
  [6] macro expansion
    @ ./loading.jl:1613 [inlined]
  [7] macro expansion
    @ ./lock.jl:267 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1576
  [9] include
    @ ./Base.jl:457 [inlined]
 [10] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
    @ Base ./loading.jl:2010
 [11] top-level scope
    @ stdin:2
in expression starting at /hpc/users/HKUST-GZ/jinguoliu/.julia/packages/CuYao/xfcjV/src/CuYao.jl:1
in expression starting at stdin:2
ERROR: LoadError: Failed to precompile CuYao [b48ca7a8-dd42-11e8-2b8e-1b7706800275] to "/hpc/users/HKUST-GZ/jinguoliu/.julia/compiled/v1.9/CuYao/jl_8vWrJn".
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
   @ Base ./loading.jl:2260
 [3] compilecache
   @ ./loading.jl:2127 [inlined]
 [4] _require(pkg::Base.PkgId, env::String)
   @ Base ./loading.jl:1770
 [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
   @ Base ./loading.jl:1625
 [6] macro expansion
   @ ./loading.jl:1613 [inlined]
 [7] macro expansion
   @ ./lock.jl:267 [inlined]
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1576
in expression starting at /hpc/users/HKUST-GZ/jinguoliu/jcode/CodingClub/mpi/gpu/cuda.jl:1

Resolution

Configure CUDA runtime with.

julia> using CUDA

julia> CUDA.set_runtime_version!(v"11.4")

The version number should be consistent with the module load cuda-11.4 in your job file.

(NOTE: I am not sure about this resolution, please let me know if it does not work)

Juliaup installation command might fail due to the network problem

Now installing Juliaup
Error: Failed to run `run_command_add`.

Caused by:
    0: Failed to update versions db.
    1: Failed to download new version db from https://julialang-s3.julialang.org/juliaup/versiondb/versiondb-1.0.28-x86_64-unknown-linux-musl.json.
    2: Failed to download from url `https://julialang-s3.julialang.org/juliaup/versiondb/versiondb-1.0.28-x86_64-unknown-linux-musl.json`.
    3: error sending request for url (https://julialang-s3.julialang.org/juliaup/versiondb/versiondb-1.0.28-x86_64-unknown-linux-musl.json): operation timed out
    4: operation timed out

Resolution

Given that you have set the JULIAUP_SERVER environment variable correctly, try again.

CC BY-SA 4.0 GiggleLiu. Last modified: April 04, 2024. Website built with Franklin.jl and the Julia programming language.