Performance Benchmarks
Real-world performance comparison with Julia implementation.
Benchmark Methodology
All benchmarks compare against OMEinsumContractionOrders.jl, the original Julia implementation.
Test Environment:
- Benchmarks run via Python bindings (PyO3)
- Configuration:
ntrials=1,niters=50-100 - See
benchmarks/directory for scripts
Rust vs Julia (TreeSA)
Comparison of TreeSA optimizer performance:
| Problem | Tensors | Indices | Rust (ms) | Julia (ms) | tc (Rust) | tc (Julia) | Speedup |
|---|---|---|---|---|---|---|---|
| chain_10 | 10 | 11 | 18.5 | 31.6 | 23.10 | 23.10 | 1.7x |
| grid_4x4 | 16 | 24 | 88.0 | 150.7 | 9.18 | 9.26 | 1.7x |
| grid_5x5 | 25 | 40 | 155.4 | 297.7 | 10.96 | 10.96 | 1.9x |
| reg3_250 | 250 | 372 | 2,435 | 5,099 | 48.00 | 47.17 | 2.1x |
Results:
- Rust is 1.7-2.1x faster than Julia for TreeSA optimization
- Both implementations find solutions with comparable time complexity (tc)
- Solution quality is nearly identical between implementations
Greedy Method Performance
GreedyMethod is extremely fast for quick optimization:
| Problem | Tensors | Time (ms) | tc | sc |
|---|---|---|---|---|
| chain_10 | 10 | 0.04 | 23.10 | 13.29 |
| grid_4x4 | 16 | 0.12 | 9.54 | 5.0 |
| grid_5x5 | 25 | 0.23 | 11.28 | 6.0 |
| reg3_250 | 250 | 7.5 | 69.00 | 47.0 |
Key Observations:
- Greedy is 100-300x faster than TreeSA
- Greedy gives good results for small problems (chain_10, grids)
- For large problems (reg3_250), TreeSA finds much better solutions:
- Greedy: tc = 69.00
- TreeSA: tc = 48.00
- Improvement: 2^(69-48) = 2^21 = 2 million times faster execution!
Python Overhead
Python bindings via PyO3 add minimal overhead:
TreeSA (grid_4x4, 100 iterations):
- Pure Rust backend: ~88ms
- Python call overhead: <1ms (~1%)
Greedy (reg3_250):
- Pure optimization: ~7.5ms
- Python overhead: <0.5ms (~6%)
Conclusion: Python overhead is negligible, especially for TreeSA optimization.
Algorithm Comparison
When to use each algorithm:
| Scenario | Recommended | Reason |
|---|---|---|
| Quick prototyping | Greedy | Sub-millisecond optimization |
| Production (< 50 tensors) | Greedy | Fast enough, good quality |
| Production (50-200 tensors) | TreeSA.fast() | Better quality, still reasonable time |
| Large networks (> 200 tensors) | TreeSA | Significant quality improvement |
| Time-critical | Greedy | Predictable fast performance |
| Execution-critical | TreeSA | Better contraction order = faster execution |
Execution Time Savings
The time spent optimizing pays off during execution:
Example: reg3_250 network
- Greedy optimization: 7.5ms
- TreeSA optimization: 2,435ms (~2.4 seconds)
- Extra optimization time: +2.4 seconds
But the execution improvement:
- Greedy tc: 69.00 → ~2^69 FLOPs
- TreeSA tc: 48.00 → ~2^48 FLOPs
- Execution speedup: 2^21 = 2 million times faster
If executing even once, TreeSA optimization is worth it!
Hardware Recommendations
| Tensor Count | RAM | CPU | Recommendation |
|---|---|---|---|
| < 20 | 4GB | Any | Greedy is sufficient |
| 20-100 | 8GB | 4+ cores | TreeSA.fast() for production |
| 100-500 | 16GB | 8+ cores | TreeSA with multiple trials |
| > 500 | 32GB+ | 16+ cores | TreeSA (may take minutes) |
Running Benchmarks
Reproduce the benchmarks yourself:
cd benchmarks
# Python (Rust via PyO3)
python benchmark_python.py
# Julia (original implementation)
julia --project=. benchmark_julia.jl
# Compare results
python -c "
import json
with open('results_rust_treesa.json') as f:
rust = json.load(f)
with open('results_julia_treesa.json') as f:
julia = json.load(f)
for problem in rust['results']:
r = rust['results'][problem]['avg_ms']
j = julia['results'][problem]['avg_ms']
print(f'{problem}: Rust {r:.1f}ms, Julia {j:.1f}ms, Speedup {j/r:.2f}x')
"
Profiling Your Code
Python Profiling
import time
from omeco import optimize_code, TreeSA
# Time optimization
start = time.time()
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
opt_time = time.time() - start
# Check complexity
comp = tree.complexity(ixs, sizes)
print(f"Optimization: {opt_time:.3f}s")
print(f"Time complexity: 2^{comp.tc:.2f} = {2**comp.tc:.2e} FLOPs")
print(f"Space complexity: 2^{comp.sc:.2f} elements")
# Estimate execution time (assuming 10 GFLOP/s CPU)
execution_seconds = 2**comp.tc / 1e10
print(f"Estimated execution: {execution_seconds:.1f}s @ 10 GFLOP/s")
Rust Profiling
# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --example your_example
# Criterion benchmarks (if available)
cargo bench
Performance Tips
- Start with Greedy: Always try Greedy first to get a baseline
- Use TreeSA for production: The extra optimization time pays off
- Increase ntrials for critical code: More trials = better solutions
- Profile execution time: Verify that better tc actually improves runtime
- Consider memory: Use sc_target if memory is constrained
Next Steps
- Algorithm Comparison - Detailed algorithm trade-offs
- Quick Start - Get started optimizing
- Troubleshooting - Common performance issues