Contraction Order Problem
The Problem
When contracting multiple tensors, the order matters exponentially.
Simple Example: Three Matrices
Consider A[10×100] × B[100×20] × C[20×5]:
Option 1: Left-to-right (A×B)×C
Step 1: A[10×100] × B[100×20] = T[10×20]
Cost: 10 × 100 × 20 = 20,000 FLOPs
Memory: 200 elements
Step 2: T[10×20] × C[20×5] = D[10×5]
Cost: 10 × 20 × 5 = 1,000 FLOPs
Memory: 50 elements
Total: 21,000 FLOPs, peak memory 200 elements
Option 2: Right-to-left A×(B×C)
Step 1: B[100×20] × C[20×5] = T[100×5]
Cost: 100 × 20 × 5 = 10,000 FLOPs
Memory: 500 elements
Step 2: A[10×100] × T[100×5] = D[10×5]
Cost: 10 × 100 × 5 = 50,000 FLOPs
Memory: 50 elements
Total: 60,000 FLOPs, peak memory 500 elements
Result: Option 1 is 3x faster and uses 2.5x less memory!
Why It’s Hard
Finding the optimal contraction order is NP-complete:
- With N tensors, there are approximately (2N)!/(N+1)! possible orders
- For 10 tensors: ~17 million orders
- For 20 tensors: ~6 × 10²² orders (more than atoms in your body!)
Brute force is impossible for realistic problems.
Heuristic Approaches
Since optimal search is intractable, we use heuristics:
1. Greedy Algorithm
Idea: At each step, contract the pair with minimum cost.
Pros:
- Fast: O(n² log n)
- Deterministic
- Good for many practical cases
Cons:
- Can get stuck in local optima
- May miss better global solutions
2. Simulated Annealing (TreeSA)
Idea: Random tree mutations with temperature-based acceptance.
Pros:
- Explores more of the search space
- Often finds better solutions than greedy
- Configurable trade-off (time vs quality)
Cons:
- Slower than greedy
- Non-deterministic
- Requires parameter tuning
Real-World Impact
Quantum Circuit with 50 Qubits
- Random order: ~10⁵⁰ operations (impossible)
- Greedy order: ~10³⁰ operations (still impossible)
- Optimized order: ~10²⁰ operations (feasible on supercomputer)
100,000,000,000x speedup from optimization!
Neural Network Attention
# Naive: einsum("bqd,bkd,bkv->bqv", Q, K, V)
# Time: O(q²d + qkd + qkv)
# Optimized: Q @ K.T @ V
# Time: O(qkd + qkv)
# Speedup: ~10x for typical dimensions
Optimization Objectives
Different scenarios need different trade-offs:
| Scenario | Optimize | Accept | Example |
|---|---|---|---|
| GPU limited | Space (sc) | Higher time | Large models on 8GB GPU |
| CPU compute | Time (tc) | Higher space | Batch processing with 256GB RAM |
| Both | Balanced | sc_target = available memory | Most cases |
Practical Guidelines
-
Start with greedy: Fast baseline for most cases
-
Use TreeSA if:
- Greedy result is too slow/large
- You have time to optimize (minutes, not seconds)
- Problem is large and complex
-
Set sc_target to available memory:
# For 8GB GPU with float32 tensors sc_target = log2(8 * 1024**3 / 4) ≈ 31 -
Use slicing when memory is the bottleneck:
if complexity.sc > sc_target: sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
Next Steps
- Complexity Metrics - Understand tc, sc, rwc
- Greedy Method - Fast optimization
- TreeSA - Higher quality optimization