Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Benchmarks

Real-world performance comparison with Julia implementation.

Benchmark Methodology

All benchmarks compare against OMEinsumContractionOrders.jl, the original Julia implementation.

Test Environment:

  • Benchmarks run via Python bindings (PyO3)
  • Configuration: ntrials=1, niters=50-100
  • See benchmarks/ directory for scripts

Rust vs Julia (TreeSA)

Comparison of TreeSA optimizer performance:

ProblemTensorsIndicesRust (ms)Julia (ms)tc (Rust)tc (Julia)Speedup
chain_10101118.531.623.1023.101.7x
grid_4x4162488.0150.79.189.261.7x
grid_5x52540155.4297.710.9610.961.9x
reg3_2502503722,4355,09948.0047.172.1x

Results:

  • Rust is 1.7-2.1x faster than Julia for TreeSA optimization
  • Both implementations find solutions with comparable time complexity (tc)
  • Solution quality is nearly identical between implementations

Greedy Method Performance

GreedyMethod is extremely fast for quick optimization:

ProblemTensorsTime (ms)tcsc
chain_10100.0423.1013.29
grid_4x4160.129.545.0
grid_5x5250.2311.286.0
reg3_2502507.569.0047.0

Key Observations:

  • Greedy is 100-300x faster than TreeSA
  • Greedy gives good results for small problems (chain_10, grids)
  • For large problems (reg3_250), TreeSA finds much better solutions:
    • Greedy: tc = 69.00
    • TreeSA: tc = 48.00
    • Improvement: 2^(69-48) = 2^21 = 2 million times faster execution!

Python Overhead

Python bindings via PyO3 add minimal overhead:

TreeSA (grid_4x4, 100 iterations):

  • Pure Rust backend: ~88ms
  • Python call overhead: <1ms (~1%)

Greedy (reg3_250):

  • Pure optimization: ~7.5ms
  • Python overhead: <0.5ms (~6%)

Conclusion: Python overhead is negligible, especially for TreeSA optimization.

Algorithm Comparison

When to use each algorithm:

ScenarioRecommendedReason
Quick prototypingGreedySub-millisecond optimization
Production (< 50 tensors)GreedyFast enough, good quality
Production (50-200 tensors)TreeSA.fast()Better quality, still reasonable time
Large networks (> 200 tensors)TreeSASignificant quality improvement
Time-criticalGreedyPredictable fast performance
Execution-criticalTreeSABetter contraction order = faster execution

Execution Time Savings

The time spent optimizing pays off during execution:

Example: reg3_250 network

  • Greedy optimization: 7.5ms
  • TreeSA optimization: 2,435ms (~2.4 seconds)
  • Extra optimization time: +2.4 seconds

But the execution improvement:

  • Greedy tc: 69.00 → ~2^69 FLOPs
  • TreeSA tc: 48.00 → ~2^48 FLOPs
  • Execution speedup: 2^21 = 2 million times faster

If executing even once, TreeSA optimization is worth it!

Hardware Recommendations

Tensor CountRAMCPURecommendation
< 204GBAnyGreedy is sufficient
20-1008GB4+ coresTreeSA.fast() for production
100-50016GB8+ coresTreeSA with multiple trials
> 50032GB+16+ coresTreeSA (may take minutes)

Running Benchmarks

Reproduce the benchmarks yourself:

cd benchmarks

# Python (Rust via PyO3)
python benchmark_python.py

# Julia (original implementation)
julia --project=. benchmark_julia.jl

# Compare results
python -c "
import json
with open('results_rust_treesa.json') as f:
    rust = json.load(f)
with open('results_julia_treesa.json') as f:
    julia = json.load(f)

for problem in rust['results']:
    r = rust['results'][problem]['avg_ms']
    j = julia['results'][problem]['avg_ms']
    print(f'{problem}: Rust {r:.1f}ms, Julia {j:.1f}ms, Speedup {j/r:.2f}x')
"

Profiling Your Code

Python Profiling

import time
from omeco import optimize_code, TreeSA

# Time optimization
start = time.time()
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
opt_time = time.time() - start

# Check complexity
comp = tree.complexity(ixs, sizes)

print(f"Optimization: {opt_time:.3f}s")
print(f"Time complexity: 2^{comp.tc:.2f} = {2**comp.tc:.2e} FLOPs")
print(f"Space complexity: 2^{comp.sc:.2f} elements")

# Estimate execution time (assuming 10 GFLOP/s CPU)
execution_seconds = 2**comp.tc / 1e10
print(f"Estimated execution: {execution_seconds:.1f}s @ 10 GFLOP/s")

Rust Profiling

# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --example your_example

# Criterion benchmarks (if available)
cargo bench

Performance Tips

  1. Start with Greedy: Always try Greedy first to get a baseline
  2. Use TreeSA for production: The extra optimization time pays off
  3. Increase ntrials for critical code: More trials = better solutions
  4. Profile execution time: Verify that better tc actually improves runtime
  5. Consider memory: Use sc_target if memory is constrained

Next Steps