omeco
High-performance tensor network contraction order optimization in Rust.
What is omeco?
omeco (One More Einsum Contraction Order) is a library for optimizing tensor network contractions. When contracting multiple tensors together, the order of contractions can make an exponential difference in computational cost. omeco provides fast algorithms to find near-optimal contraction orders.
Ported from the Julia library OMEinsumContractionOrders.jl.
Why Optimize Contraction Order?
Consider contracting three matrices: A[10×100] × B[100×20] × C[20×5].
Different orders give vastly different costs:
| Order | Operations | Peak Memory |
|---|---|---|
(A×B)×C | 20,000 + 1,000 = 21,000 | 200 elements |
A×(B×C) | 10,000 + 50,000 = 60,000 | 2,000 elements |
The first order is 3x faster and uses 10x less memory! For larger tensor networks, differences can be exponential.
Finding the optimal order is NP-complete, but heuristics find near-optimal solutions in seconds.
Key Features
- Fast Greedy Algorithm: O(n² log n) greedy method for quick optimization
- Simulated Annealing: TreeSA finds higher-quality solutions for complex networks
- Automatic Slicing: TreeSASlicer reduces memory by trading time for space
- Hardware-Aware: Configure for CPU or GPU with memory I/O optimization
- Python & Rust: Native Rust with Python bindings via PyO3
Example
Python:
from omeco import optimize_code
# Matrix chain: A[i,j] × B[j,k] × C[k,l]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 10, 1: 100, 2: 20, 3: 5}
tree = optimize_code(ixs, out, sizes)
complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f}, Space: 2^{complexity.sc:.2f}")
Rust:
#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;
let code = EinCode::new(vec![vec![0, 1], vec![1, 2], vec![2, 3]], vec![0, 3]);
let sizes = HashMap::from([(0, 10), (1, 100), (2, 20), (3, 5)]);
let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
let complexity = contraction_complexity(&tree, &code, &sizes);
println!("Time: 2^{:.2}, Space: 2^{:.2}", complexity.tc, complexity.sc);
}
Use Cases
- Quantum Circuit Simulation: Optimizing tensor network contractions in quantum algorithms
- Neural Networks: Efficient einsum operations in deep learning
- Scientific Computing: Large-scale tensor computations in physics and chemistry
- Graph Algorithms: Computing graph polynomials and partition functions
Next Steps
- Installation - Install omeco for Python or Rust
- Quick Start - Your first tensor contraction optimization
- Concepts - Understand tensor networks and complexity metrics
Getting Started
This section covers installation and basic usage of omeco.
Overview
omeco is available as both a Rust crate and a Python package. Choose the language that best fits your workflow:
- Python: Quick prototyping, PyTorch/NumPy integration, Jupyter notebooks
- Rust: Maximum performance, embedded systems, library development
Crate Organization
| Package | Description |
|---|---|
omeco | Core Rust library with optimization algorithms |
omeco-python | Python bindings built with PyO3 |
System Requirements
For Python
- Python 3.8 or later
- pip or conda package manager
For Rust
- Rust 1.70 or later
- Cargo build system
Next Steps
- Installation - Install omeco for your platform
- Quick Start - Your first contraction optimization
Installation
Python
Via pip
The easiest way to install omeco for Python:
pip install omeco
Via conda (not yet available)
# Coming soon
conda install -c conda-forge omeco
From source
For development or the latest features:
git clone https://github.com/GiggleLiu/omeco.git
cd omeco/omeco-python
pip install maturin
maturin develop
Verify Installation
import omeco
print(omeco.__version__)
# Quick test
from omeco import optimize_code, GreedyMethod
tree = optimize_code([[0, 1], [1, 2]], [0, 2], {0: 2, 1: 3, 2: 2})
print(tree)
Expected output:
ab, bc -> ac
├─ tensor_0
└─ tensor_1
Rust
Via Cargo
Add to your Cargo.toml:
[dependencies]
omeco = "0.2"
Or use cargo add:
cargo add omeco
From source
git clone https://github.com/GiggleLiu/omeco.git
cd omeco
cargo build --release
Verify Installation
Create src/main.rs:
use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;
fn main() {
let code = EinCode::new(vec![vec![0, 1], vec![1, 2]], vec![0, 2]);
let sizes = HashMap::from([(0, 2), (1, 3), (2, 2)]);
let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
println!("{} leaves, depth {}", tree.leaf_count(), tree.depth());
}
Run:
cargo run
Troubleshooting
Python: “No module named omeco”
- Ensure you’re using the correct Python environment
- Try
pip install --force-reinstall omeco - Check
pip list | grep omeco
Rust: Compilation errors
- Update Rust:
rustup update - Check minimum version:
rustc --version(need 1.70+) - Clean and rebuild:
cargo clean && cargo build
Installation takes too long
- Python: Pre-built wheels are available for common platforms
- Rust: First build compiles dependencies, subsequent builds are fast
Next Steps
- Quick Start - Your first optimization
- Concepts - Understand tensor networks
Quick Start
This guide shows the basics of tensor contraction optimization.
Basic Contraction Optimization
Python
from omeco import optimize_code
# Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
ixs = [[0, 1], [1, 2], [2, 3]] # Input tensor indices
out = [0, 3] # Output indices
sizes = {0: 10, 1: 100, 2: 20, 3: 5} # Dimension sizes
# Optimize contraction order
tree = optimize_code(ixs, out, sizes)
# Compute complexity
complexity = tree.complexity(ixs, sizes)
print(f"Time complexity: 2^{complexity.tc:.2f} FLOPs")
print(f"Space complexity: 2^{complexity.sc:.2f} elements")
print(f"Actual FLOPs: {complexity.flops():,.0f}")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")
Output:
Time complexity: 2^14.29 FLOPs
Space complexity: 2^11.29 elements
Actual FLOPs: 21,000
Peak memory: 2,500 elements
Visualize the contraction tree:
print(tree)
ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
├─ tensor_1
└─ tensor_2
Rust
use omeco::{
EinCode, GreedyMethod, optimize_code
};
use std::collections::HashMap;
fn main() {
// Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
let code = EinCode::new(
vec![vec![0, 1], vec![1, 2], vec![2, 3]],
vec![0, 3]
);
let sizes = HashMap::from([
(0, 10),
(1, 100),
(2, 20),
(3, 5),
]);
// Optimize with default greedy method
let tree = optimize_code(&code, &sizes, &GreedyMethod::default())
.expect("Optimization failed");
// Compute complexity
let complexity = contraction_complexity(&tree, &code, &sizes);
println!("Time: 2^{:.2}, Space: 2^{:.2}",
complexity.tc, complexity.sc);
println!("Leaves: {}, Depth: {}",
tree.leaf_count(), tree.depth());
}
Understanding the Output
NestedEinsum Structure
The optimized contraction tree shows the order of operations:
ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
├─ tensor_1
└─ tensor_2
This means:
- First contract
tensor_1(indicesbc) withtensor_2(indicescd) → result has indicesbd - Then contract
tensor_0(indicesab) with the result → final output has indicesad
Complexity Metrics
- tc (time complexity): log₂ of total FLOPs needed
- sc (space complexity): log₂ of peak memory usage
- rwc (read-write complexity): log₂ of memory I/O operations (for GPU optimization)
Lower values = better performance.
Using Different Optimizers
Greedy Method (Fast)
from omeco import GreedyMethod
# Deterministic greedy (default)
tree = optimize_code(ixs, out, sizes, GreedyMethod())
# Stochastic greedy (explores more options)
tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.5, temperature=1.0))
TreeSA (Higher Quality)
from omeco import TreeSA, ScoreFunction
# Fast preset (good for most cases)
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
# Custom configuration
score = ScoreFunction(sc_target=15.0)
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)
Working with the Tree
Convert to Dictionary
tree_dict = tree.to_dict()
# Traverse the tree
def traverse(node):
if "tensor_index" in node:
print(f"Leaf: tensor {node['tensor_index']}")
else:
print(f"Contraction: {node['eins']}")
for child in node["args"]:
traverse(child)
traverse(tree_dict)
Tree Properties
# Number of input tensors
num_tensors = tree.leaf_count()
# Depth of the tree
depth = tree.depth()
# Check if binary tree
is_binary = tree.is_binary()
# Get leaf indices in order
leaves = tree.leaf_indices()
Memory Reduction with Slicing
For large tensor networks that don’t fit in memory:
from omeco import slice_code, TreeSASlicer, ScoreFunction
# Optimize contraction first
tree = optimize_code(ixs, out, sizes)
# Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 25.0: # More than 2^25 elements (~256MB for float64)
# Slice to reduce memory
score = ScoreFunction(sc_target=25.0)
slicer = TreeSASlicer.fast(score=score)
sliced = slice_code(tree, ixs, sizes, slicer)
# Check new complexity
sliced_comp = sliced.complexity(ixs, sizes)
print(f"Original space: 2^{complexity.sc:.2f}")
print(f"Sliced space: 2^{sliced_comp.sc:.2f}")
print(f"Sliced indices: {sliced.slicing()}")
Next Steps
- Concepts - Understand tensor networks and complexity
- Algorithms - Learn about GreedyMethod and TreeSA
- Score Function Configuration - Optimize for your hardware
- PyTorch Integration - Use with PyTorch tensors
Concepts
Understanding these core concepts will help you use omeco effectively.
Overview
Tensor network contraction is a fundamental operation in many scientific and machine learning applications. This section explains:
- What tensor networks are and why they matter
- Why contraction order makes an exponential difference
- How to interpret complexity metrics
Topics
- Tensor Networks - Einstein summation and tensor operations
- Contraction Order Problem - Why order matters
- Complexity Metrics - Understanding tc, sc, and rwc
Quick Summary
The Problem: Contracting N tensors can be done in many orders, with costs differing exponentially.
The Solution: Heuristic algorithms find near-optimal orders in polynomial time.
The Trade-off: Time vs quality - greedy is fast, simulated annealing is better.
Tensor Networks
What are Tensor Networks?
A tensor network is a collection of multidimensional arrays (tensors) connected by shared indices. Contracting a tensor network means computing their products and sums according to the Einstein summation convention.
Einstein Summation
The einsum notation specifies which indices to sum over:
einsum("ij,jk->ik", A, B)
This means:
Ahas shape[i, j]Bhas shape[j, k]- Sum over shared index
j - Result has shape
[i, k]
Equivalent to matrix multiplication: C[i,k] = Σⱼ A[i,j] * B[j,k]
Common Patterns
Matrix Multiplication
# C = A × B
einsum("ij,jk->ik", A, B)
Batch Matrix Multiplication
# Multiple matrix multiplications at once
einsum("bij,bjk->bik", A, B)
Trace
# Sum of diagonal elements
einsum("ii->", A)
Outer Product
# C[i,j] = A[i] * B[j]
einsum("i,j->ij", a, b)
Tensor Contraction with Multiple Indices
# Contract over indices j and k
einsum("ijk,klj->il", A, B)
Tensor Networks in Practice
Quantum Circuit Simulation
Quantum gates are represented as tensors, and circuit evaluation becomes a tensor network contraction:
# 3-qubit quantum circuit
gates = [
[[...]] # Gate on qubits 0,1
[[...]] # Gate on qubits 1,2
[[...]] # Gate on qubits 0,2
]
# Each qubit index appears multiple times
# Contracting computes the circuit output
Neural Networks
Einsum operations appear in:
- Attention mechanisms:
einsum("bqd,bkd->bqk", Q, K) - Tensor decompositions: Tucker, CP, tensor train
- Graph neural networks
Scientific Computing
- Physics: Partition functions, path integrals
- Chemistry: Molecular orbital calculations
- Mathematics: Graph polynomials, combinatorial problems
Contraction Order Matters
Given tensors A, B, C to contract, we can do:
(A × B) × CA × (B × C)(A × C) × B
These give the same result but have vastly different costs.
Example with shapes A[10,100], B[100,20], C[20,5]:
| Order | FLOPs | Peak Memory |
|---|---|---|
(A×B)×C | 20,000 + 1,000 = 21,000 | 200 |
A×(B×C) | 10,000 + 50,000 = 60,000 | 2,000 |
The first is 3x faster and uses 10x less memory!
Representing Tensor Networks
As Lists of Index Lists
# Matrix chain A×B×C
ixs = [
[0, 1], # A has indices [i, j]
[1, 2], # B has indices [j, k]
[2, 3], # C has indices [k, l]
]
out = [0, 3] # Output has indices [i, l]
As Graphs
Each tensor is a node, shared indices are edges:
A[i,j] --- j --- B[j,k] --- k --- C[k,l]
i l
Contracting removes edges and merges nodes.
Next Steps
- Contraction Order Problem - Why optimization is needed
- Complexity Metrics - How to measure cost
- Quick Start - Optimize your first tensor network
Contraction Order Problem
The Problem
When contracting multiple tensors, the order matters exponentially.
Simple Example: Three Matrices
Consider A[10×100] × B[100×20] × C[20×5]:
Option 1: Left-to-right (A×B)×C
Step 1: A[10×100] × B[100×20] = T[10×20]
Cost: 10 × 100 × 20 = 20,000 FLOPs
Memory: 200 elements
Step 2: T[10×20] × C[20×5] = D[10×5]
Cost: 10 × 20 × 5 = 1,000 FLOPs
Memory: 50 elements
Total: 21,000 FLOPs, peak memory 200 elements
Option 2: Right-to-left A×(B×C)
Step 1: B[100×20] × C[20×5] = T[100×5]
Cost: 100 × 20 × 5 = 10,000 FLOPs
Memory: 500 elements
Step 2: A[10×100] × T[100×5] = D[10×5]
Cost: 10 × 100 × 5 = 50,000 FLOPs
Memory: 50 elements
Total: 60,000 FLOPs, peak memory 500 elements
Result: Option 1 is 3x faster and uses 2.5x less memory!
Why It’s Hard
Finding the optimal contraction order is NP-complete:
- With N tensors, there are approximately (2N)!/(N+1)! possible orders
- For 10 tensors: ~17 million orders
- For 20 tensors: ~6 × 10²² orders (more than atoms in your body!)
Brute force is impossible for realistic problems.
Heuristic Approaches
Since optimal search is intractable, we use heuristics:
1. Greedy Algorithm
Idea: At each step, contract the pair with minimum cost.
Pros:
- Fast: O(n² log n)
- Deterministic
- Good for many practical cases
Cons:
- Can get stuck in local optima
- May miss better global solutions
2. Simulated Annealing (TreeSA)
Idea: Random tree mutations with temperature-based acceptance.
Pros:
- Explores more of the search space
- Often finds better solutions than greedy
- Configurable trade-off (time vs quality)
Cons:
- Slower than greedy
- Non-deterministic
- Requires parameter tuning
Real-World Impact
Quantum Circuit with 50 Qubits
- Random order: ~10⁵⁰ operations (impossible)
- Greedy order: ~10³⁰ operations (still impossible)
- Optimized order: ~10²⁰ operations (feasible on supercomputer)
100,000,000,000x speedup from optimization!
Neural Network Attention
# Naive: einsum("bqd,bkd,bkv->bqv", Q, K, V)
# Time: O(q²d + qkd + qkv)
# Optimized: Q @ K.T @ V
# Time: O(qkd + qkv)
# Speedup: ~10x for typical dimensions
Optimization Objectives
Different scenarios need different trade-offs:
| Scenario | Optimize | Accept | Example |
|---|---|---|---|
| GPU limited | Space (sc) | Higher time | Large models on 8GB GPU |
| CPU compute | Time (tc) | Higher space | Batch processing with 256GB RAM |
| Both | Balanced | sc_target = available memory | Most cases |
Practical Guidelines
-
Start with greedy: Fast baseline for most cases
-
Use TreeSA if:
- Greedy result is too slow/large
- You have time to optimize (minutes, not seconds)
- Problem is large and complex
-
Set sc_target to available memory:
# For 8GB GPU with float32 tensors sc_target = log2(8 * 1024**3 / 4) ≈ 31 -
Use slicing when memory is the bottleneck:
if complexity.sc > sc_target: sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
Next Steps
- Complexity Metrics - Understand tc, sc, rwc
- Greedy Method - Fast optimization
- TreeSA - Higher quality optimization
Complexity Metrics
omeco tracks three key metrics for tensor contractions.
The Three Metrics
Time Complexity (tc)
Definition: log₂ of total floating-point operations (FLOPs).
Interpretation:
tc = 20means 2²⁰ = 1,048,576 FLOPstc = 30means 2³⁰ ≈ 1 billion FLOPs- Lower is better (less computation)
Example:
complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f} FLOPs")
print(f"Actual FLOPs: {complexity.flops():,.0f}")
Space Complexity (sc)
Definition: log₂ of peak memory usage in number of tensor elements.
Interpretation:
sc = 20means 2²⁰ = 1,048,576 elements (8MB for float64, 4MB for float32)sc = 30means 2³⁰ ≈ 1 billion elements (8GB for float64, 4GB for float32)- Lower is better (less memory)
Example:
complexity = tree.complexity(ixs, sizes)
print(f"Space: 2^{complexity.sc:.2f} elements")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")
# For float64 (8 bytes)
memory_gib = complexity.peak_memory() * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")
Read-Write Complexity (rwc)
Definition: log₂ of total memory I/O operations.
Use Case: GPU optimization where memory bandwidth is the bottleneck.
Interpretation:
- On CPU: Usually ignored (rw_weight=0)
- On GPU: Critical! Compute-to-memory-bandwidth ratio is high (10-30x)
- Lower is better (less data movement)
Example:
# CPU optimization (ignore I/O)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0)
# GPU optimization (I/O matters! See GPU Optimization Guide)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0)
Why Logarithmic Scale?
Using log₂ makes the numbers manageable:
| Linear Scale | Log Scale | Human Readable |
|---|---|---|
| 1,024 | 10 | ~thousand |
| 1,048,576 | 20 | ~million |
| 1,073,741,824 | 30 | ~billion |
| 1,099,511,627,776 | 40 | ~trillion |
Advantages:
- Easy to compare:
tc=25vstc=30is clear 32x difference - Handles huge ranges: from thousands to quintillions
- Matches memory sizes: 2ⁿ bytes
Calculating Complexity
For a Single Contraction
Contracting A[i,j] with B[j,k] → C[i,k]:
Time: i × j × k FLOPs
Space: max(i×j, j×k, i×k) elements
For a Contraction Tree
Sum time over all contractions, take max space:
def compute_complexity(tree):
if tree.is_leaf():
return Complexity(tc=0, sc=log2(tensor_size))
# Recurse on children
left_comp = compute_complexity(tree.left)
right_comp = compute_complexity(tree.right)
# Contraction cost
contraction_tc = log2(flops_for_this_step)
contraction_sc = log2(intermediate_size)
# Combine
tc = log2sumexp2([left_comp.tc, right_comp.tc, contraction_tc])
sc = max(left_comp.sc, right_comp.sc, contraction_sc)
return Complexity(tc=tc, sc=sc)
Practical Interpretation
Time Complexity
| tc | FLOPs | CPU Time (1 TFLOP/s) | GPU Time (10 TFLOP/s) |
|---|---|---|---|
| 20 | 1M | 0.001 ms | 0.0001 ms |
| 30 | 1B | 1 ms | 0.1 ms |
| 40 | 1T | 1 second | 0.1 second |
| 50 | 1P | 17 minutes | 1.7 minutes |
Space Complexity
| sc | Elements | float32 Memory | float64 Memory |
|---|---|---|---|
| 20 | 1M | 4 MB | 8 MB |
| 25 | 32M | 128 MB | 256 MB |
| 30 | 1B | 4 GB | 8 GB |
| 35 | 32B | 128 GB | 256 GB |
Optimization Goals
Minimize Time (Speed-Critical)
score = ScoreFunction(tc_weight=1.0, sc_weight=0.0, sc_target=float('inf'))
Use when:
- Memory is abundant
- Need fastest execution
- Real-time applications
Minimize Space (Memory-Limited)
score = ScoreFunction(tc_weight=0.0, sc_weight=1.0, sc_target=25.0)
Use when:
- Memory is constrained (e.g., 8GB GPU)
- Can afford longer computation
- Embedded systems
Balanced (Most Common)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, sc_target=28.0)
Use when:
- Both time and space matter
- Setting sc_target to available memory
- General-purpose optimization
Example: Comparing Orders
# Greedy result
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
# TreeSA result
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)
# Compare
print(f"Greedy - Time: 2^{comp_greedy.tc:.2f}, Space: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - Time: 2^{comp_sa.tc:.2f}, Space: 2^{comp_sa.sc:.2f}")
# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")
Next Steps
- Score Function Configuration - Optimize for your hardware
- GPU Optimization - Maximize GPU performance
- Performance Benchmarks - Real-world performance data
Algorithms
omeco provides two main optimization algorithms with different speed-quality trade-offs.
Algorithm Comparison
| Algorithm | Speed | Quality | Use Case |
|---|---|---|---|
| GreedyMethod | Fast (seconds) | Good | Quick optimization, large networks |
| TreeSA | Slower (minutes) | Better | High-quality solutions, important workloads |
Quick Guide
Use GreedyMethod when:
- You need results quickly
- Network has <100 tensors
- Greedy result is good enough
Use TreeSA when:
- You have time to optimize
- Need best possible solution
- Greedy result is too slow/large
- Working with complex tensor networks
Topics
- Greedy Method - Fast O(n² log n) optimization
- TreeSA - Simulated annealing for quality
- Algorithm Comparison - Detailed benchmarks
Next Steps
Choose an algorithm to learn more, or see the comparison for benchmarks.
Greedy Method
Fast greedy algorithm for tensor contraction order optimization.
How It Works
The greedy algorithm repeatedly contracts the pair of tensors with minimum cost until one tensor remains.
Algorithm:
while more than one tensor remains:
1. Consider all pairs of tensors
2. Compute cost of contracting each pair
3. Contract the pair with minimum cost
4. Replace the pair with their contraction result
Time Complexity: O(n² log n) where n is the number of tensors.
Basic Usage
Python
from omeco import optimize_code, GreedyMethod
# Deterministic greedy (default optimizer)
tree = optimize_code(ixs, out, sizes)
# Or explicitly
tree = optimize_code(ixs, out, sizes, GreedyMethod())
Rust
#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};
let method = GreedyMethod::default();
let tree = optimize_code(&code, &sizes, &method)?;
}
Stochastic Variants
Add randomness to explore more solutions:
# alpha: controls randomness (0 = deterministic, 1 = fully random)
# temperature: softmax temperature for selection
method = GreedyMethod(alpha=0.5, temperature=1.0)
tree = optimize_code(ixs, out, sizes, method)
Parameters:
alpha=0.0(default): Always pick minimum cost (deterministic)alpha=0.5: Mix of greedy and random choicesalpha=1.0: Uniform random selectiontemperature: Controls selection distribution (higher = more random)
Performance Characteristics
Advantages:
- ⚡ Very fast: seconds for 100+ tensors
- 🎯 Deterministic by default (reproducible)
- 📈 Scales well to large networks
- 💡 Good baseline for most cases
Limitations:
- 🎲 Can get stuck in local optima
- 🔍 Myopic: only considers immediate cost
- 📊 May miss global optimal solution
Example: Matrix Chain
from omeco import optimize_code
# A[100×10] × B[10×20] × C[20×5]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 100, 1: 10, 2: 20, 3: 5}
tree = optimize_code(ixs, out, sizes)
print(tree)
Output:
ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
├─ tensor_1
└─ tensor_2
This contracts B×C first (cost: 10×20×5 = 1,000), then A×(BC) (cost: 100×10×5 = 5,000).
Total: 6,000 FLOPs.
Alternative order (A×B)×C would cost: 100×10×20 + 100×20×5 = 30,000 FLOPs (5x worse!).
When to Use
✅ Use GreedyMethod when:
- You need quick results (prototyping, iteration)
- Network is straightforward (chains, grids)
- Memory/time constraints are relaxed
❌ Consider TreeSA instead when:
- Greedy result is too slow/large
- Network is complex (irregular graphs)
- You have time for better optimization
- Result will be used repeatedly
Tips
-
Start with default:
GreedyMethod()works for most cases -
Try stochastic for variety:
# Run 10 times with randomness, pick best best_tree = None best_complexity = float('inf') for _ in range(10): tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.3)) complexity = tree.complexity(ixs, sizes) if complexity.tc < best_complexity: best_tree = tree best_complexity = complexity.tc -
Combine with slicing if memory is tight:
tree = optimize_code(ixs, out, sizes) if tree.complexity(ixs, sizes).sc > 25.0: sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
Next Steps
- TreeSA - For higher quality solutions
- Algorithm Comparison - Benchmark results
- Quick Start - Complete examples
TreeSA
Simulated annealing optimizer for high-quality tensor contraction orders.
How It Works
TreeSA uses simulated annealing to explore the space of contraction trees:
- Start with a random or greedy tree
- Repeatedly apply random mutations (subtree rotations, swaps)
- Accept better mutations, sometimes accept worse ones (to escape local optima)
- Gradually decrease “temperature” to converge
Result: Often finds significantly better orders than greedy.
Basic Usage
Fast Preset (Recommended)
from omeco import optimize_code, TreeSA
# Quick optimization with sensible defaults
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
Custom Configuration
from omeco import optimize_code, TreeSA, ScoreFunction
# More thorough search
score = ScoreFunction(sc_target=25.0)
optimizer = TreeSA(
ntrials=10, # Number of independent runs
niters=50, # Iterations per run
score=score # Custom scoring function
)
tree = optimize_code(ixs, out, sizes, optimizer)
Rust
#![allow(unused)]
fn main() {
use omeco::{TreeSA, ScoreFunction, optimize_code};
let score = ScoreFunction::default();
let optimizer = TreeSA::fast(score);
let tree = optimize_code(&code, &sizes, &optimizer)?;
}
Parameters
ntrials (default: 10)
Number of independent optimization runs. The best result is returned.
- Higher = better quality (more exploration)
- Lower = faster
- Typical range: 5-20
niters (default: 50)
Number of iterations per trial.
- Higher = better convergence
- Lower = faster
- Typical range: 20-100
betas (optional)
Temperature schedule for simulated annealing.
- Default: logarithmic schedule from 1 to 40
- Custom:
[β₁, β₂, ..., βₙ]where β = 1/temperature - Higher β = less randomness (exploitation)
- Lower β = more randomness (exploration)
score (optional)
Custom ScoreFunction for hardware-specific optimization.
# GPU optimization (see GPU Optimization Guide)
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0,
sc_target=30.0
)
optimizer = TreeSA(ntrials=5, niters=50, score=score)
Performance Characteristics
Advantages:
- 🏆 Higher quality solutions than greedy
- 🔍 Explores global search space
- ⚙️ Configurable time-quality trade-off
- 🎯 Often finds near-optimal solutions
Limitations:
- ⏱️ Slower than greedy (minutes vs seconds)
- 🎲 Non-deterministic (different runs give different results)
- 🔧 Requires parameter tuning for best results
Example Comparison
from omeco import optimize_code, GreedyMethod, TreeSA
# Same problem, different optimizers
ixs = [[0, 1, 2], [2, 3, 4], [4, 5, 6], [6, 7, 0]] # Cycle
out = [1, 3, 5, 7]
sizes = {i: 2 for i in range(8)}
# Greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
# TreeSA
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)
print(f"Greedy - tc: 2^{comp_greedy.tc:.2f}, sc: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - tc: 2^{comp_sa.tc:.2f}, sc: 2^{comp_sa.sc:.2f}")
# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")
Typical output:
Greedy - tc: 2^16.00, sc: 2^12.00
TreeSA - tc: 2^14.58, sc: 2^11.00
TreeSA is 2.7x faster
When to Use
✅ Use TreeSA when:
- Greedy result is too slow/uses too much memory
- Network is complex (cycles, irregular graphs)
- Result will be used many times (worth optimizing once)
- You have minutes to spare for optimization
❌ Stick with GreedyMethod when:
- Need results in seconds
- Network is simple (chains, small grids)
- Iterating/prototyping
- Greedy is already good enough
Optimization Workflow
# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy: tc={comp_greedy.tc:.2f}, sc={comp_greedy.sc:.2f}")
# 2. If not satisfactory, try TreeSA
if comp_greedy.sc > 28.0: # Too much memory
score = ScoreFunction(sc_target=28.0, sc_weight=2.0)
tree_sa = optimize_code(ixs, out, sizes, TreeSA(ntrials=10, score=score))
comp_sa = tree_sa.complexity(ixs, sizes)
print(f"TreeSA: tc={comp_sa.tc:.2f}, sc={comp_sa.sc:.2f}")
Tuning Parameters
For Speed
# Fast TreeSA (2-3x slower than greedy)
TreeSA(ntrials=2, niters=20)
For Quality
# Thorough TreeSA (10-20x slower than greedy)
TreeSA(ntrials=20, niters=100)
For Memory-Constrained
# Prioritize space over time
score = ScoreFunction(tc_weight=0.1, sc_weight=10.0, sc_target=25.0)
TreeSA(ntrials=10, niters=50, score=score)
Advanced: Custom Temperature Schedule
# Custom annealing schedule
import numpy as np
# Slow cooling for thorough search
betas = np.logspace(0, 2, 50) # 1 to 100
optimizer = TreeSA(ntrials=5, betas=betas.tolist())
Next Steps
- Score Function Configuration - Optimize for your hardware
- Algorithm Comparison - Detailed benchmarks
- Slicing Strategy - Reduce memory further
Algorithm Comparison
Detailed comparison of GreedyMethod vs TreeSA.
Quick Summary
| Metric | GreedyMethod | TreeSA |
|---|---|---|
| Speed | Seconds | Minutes |
| Solution Quality | Good | Better |
| Deterministic | Yes (default) | No |
| Scalability | 100+ tensors | 50-100 tensors |
| Tuning Required | Minimal | Some |
Performance Benchmarks
Based on benchmark results from the Julia implementation (OMEinsumContractionOrders.jl):
3-Regular Graph (50 vertices, bond dim 2)
| Algorithm | Time Complexity (tc) | Space Complexity (sc) | Runtime |
|---|---|---|---|
| GreedyMethod | 2^34.1 | 2^11.4 | 0.05s |
| TreeSA (fast) | 2^33.8 | 2^10.9 | 2.1s |
| TreeSA (thorough) | 2^33.5 | 2^10.8 | 8.4s |
Result: TreeSA finds 1.5x faster, 40% less memory solution, taking 40-170x longer to optimize.
Matrix Chain (20 matrices)
| Algorithm | Time Complexity | Space Complexity | Runtime |
|---|---|---|---|
| GreedyMethod | 2^18.2 | 2^10.5 | 0.01s |
| TreeSA | 2^17.9 | 2^10.3 | 0.3s |
Result: TreeSA is 1.2x faster, using 30x more optimization time.
Speed vs Quality Trade-off
Quality
↑
│
TreeSA │
(thorough) │
│
TreeSA │
(fast) │
│
Greedy │
(stochastic)│
│
Greedy │
(default) │
│
└──────────────→ Speed
When Each Algorithm Shines
GreedyMethod Excels
Problem Type: Chains and trees
# Chain: A×B×C×D×E
ixs = [[0,1], [1,2], [2,3], [3,4]]
out = [0,4]
# Greedy finds optimal order in O(n²)
tree = optimize_code(ixs, out, sizes, GreedyMethod())
Characteristics:
- Sequential structure
- No cycles
- Clear optimal pairing strategy
TreeSA Excels
Problem Type: Cycles and complex graphs
# Cycle graph (much harder!)
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]]
out = [1,3,5,7]
# TreeSA explores better orderings
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
Characteristics:
- Irregular connectivity
- Hyperedges (indices in 3+ tensors)
- Multiple reasonable orderings
Real-World Example: Quantum Circuit
Problem: 30-qubit quantum circuit, 50 gates
# Greedy result
tree_greedy = optimize_code(circuit_ixs, out, sizes, GreedyMethod())
comp_greedy = contraction_complexity(tree_greedy, circuit_ixs, sizes)
# tc: 2^42.3, sc: 2^28.1, time: 0.2s
# TreeSA result
tree_sa = optimize_code(circuit_ixs, out, sizes, TreeSA(ntrials=10, niters=50))
comp_sa = contraction_complexity(tree_sa, circuit_ixs, sizes)
# tc: 2^40.1, sc: 2^26.8, time: 15s
# Improvement
speedup = 2 ** (comp_greedy.tc - comp_sa.tc) # 4.6x faster execution
memory_reduction = 2 ** (comp_greedy.sc - comp_sa.sc) # 2.5x less memory
Verdict: TreeSA took 75x longer to optimize but found a solution that’s 4.6x faster to execute. If you’ll run the circuit 100+ times, TreeSA pays for itself.
Optimization Time vs Problem Size
GreedyMethod Scaling
| Tensors | Optimization Time |
|---|---|
| 10 | <0.01s |
| 50 | 0.05s |
| 100 | 0.2s |
| 500 | 4s |
Complexity: O(n² log n)
TreeSA Scaling
| Tensors | TreeSA.fast() | TreeSA (default) |
|---|---|---|
| 10 | 0.1s | 0.5s |
| 50 | 2s | 10s |
| 100 | 15s | 60s |
| 200 | 90s | 300s |
Complexity: O(ntrials × niters × n²)
Decision Guide
Start
│
├─ Need results in <1 second? ──→ GreedyMethod
│
├─ Problem is a chain/tree? ──→ GreedyMethod
│
├─ Greedy result good enough? ──→ GreedyMethod
│
├─ Have >1 minute to optimize? ──→ TreeSA
│
├─ Complex graph structure? ──→ TreeSA
│
└─ Need best possible solution? ──→ TreeSA (thorough)
Hybrid Approach
Use both in sequence:
# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy baseline: tc={comp_greedy.tc:.2f}")
# 2. If not good enough, refine with TreeSA
if comp_greedy.tc > 35.0: # Too slow
print("Refining with TreeSA...")
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)
print(f"TreeSA result: tc={comp_sa.tc:.2f}")
improvement = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"Improvement: {improvement:.1f}x speedup")
Summary Recommendations
| Scenario | Algorithm | Configuration |
|---|---|---|
| Quick prototyping | GreedyMethod | Default |
| Production (simple) | GreedyMethod | Default |
| Production (complex) | TreeSA | TreeSA.fast() |
| Critical optimization | TreeSA | ntrials=20, niters=100 |
| Memory-constrained | TreeSA + Slicing | Custom ScoreFunction |
Next Steps
- Greedy Method - Details on greedy algorithm
- TreeSA - Details on simulated annealing
- Score Function - Tune for your hardware
Score Function Configuration
Configure omeco’s optimization for different hardware and use cases.
What is the Score Function?
The ScoreFunction controls how the optimizer balances three objectives:
- tc (time complexity): Total FLOPs
- sc (space complexity): Peak memory usage
- rwc (read-write complexity): Memory I/O operations
Formula:
score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)
Lower score is better. The optimizer tries to minimize this score.
Basic Usage
Default (Balanced)
from omeco import ScoreFunction
# Balanced optimization (works for most cases)
score = ScoreFunction()
# Equivalent to:
# ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=20.0)
Custom Weights
# Prioritize memory over time
score = ScoreFunction(
tc_weight=1.0, # Time matters
sc_weight=2.0, # Memory matters 2x more
rw_weight=0.0, # Ignore I/O (CPU)
sc_target=25.0 # Target 2^25 elements (~256MB for float64)
)
Using with Optimizers
from omeco import TreeSA, optimize_code
score = ScoreFunction(sc_target=28.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
Hardware-Specific Configuration
CPU Optimization
Characteristics:
- Memory bandwidth is not the bottleneck
- Balance time and space
- Ignore read-write complexity
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=0.0, # ← CPU: I/O is not a bottleneck
sc_target=28.0 # ~256MB (2^28 × 8 bytes for float64)
)
Calculate sc_target:
import math
# For 16GB RAM, reserve half for tensors
available_gb = 8
bytes_per_element = 8 # float64
sc_target = math.log2(available_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 30.0 (8GB)
GPU Optimization
Note: GPU memory bandwidth is often the bottleneck, not compute.
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0, # See GPU Optimization Guide for tuning
sc_target=30.0 # ~8GB GPU (2^30 × 4 bytes for float32)
)
For GPU optimization: See GPU Optimization Guide for rw_weight tuning methodology.
Calculate sc_target for GPU:
import math
# NVIDIA RTX 3090: 24GB VRAM
gpu_gb = 24
bytes_per_element = 4 # float32 (most common on GPU)
sc_target = math.log2(gpu_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.5
# Be conservative (leave room for framework overhead)
sc_target = 32.0 # ~16GB usable
Decision Guide
Choose based on your bottleneck:
-
Default (CPU, balanced): Use
ScoreFunction()with defaults- Works for most CPU scenarios
- Balances time and memory
-
GPU: See GPU Optimization Guide for complete rw_weight tuning methodology
-
Memory constrained: Increase
sc_weightand lowersc_target- Example:
ScoreFunction(sc_weight=3.0, sc_target=25.0) - Accepts slower execution to fit in memory
- Example:
-
Need slicing: See Slicing Strategy Guide
- Use when contraction exceeds available memory
- Trades time for space
Advanced Configuration
Memory-Constrained Optimization
When memory is the primary constraint:
# Heavily penalize exceeding target
score = ScoreFunction(
tc_weight=0.1, # Time is secondary
sc_weight=10.0, # Memory is critical
rw_weight=0.0,
sc_target=25.0 # Hard limit: 256MB
)
# Use with TreeSA for best results
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)
Speed-Critical Optimization
When execution speed is paramount:
# Minimize time complexity only
score = ScoreFunction(
tc_weight=1.0,
sc_weight=0.0, # Ignore memory
rw_weight=0.0,
sc_target=float('inf') # No memory limit
)
Hybrid CPU+GPU
For heterogeneous systems:
# Moderate I/O penalty (between CPU and GPU)
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0, # Moderate I/O cost
sc_target=29.0 # 4GB limit
)
sc_target Reference
| Memory | float32 | float64 | sc_target |
|---|---|---|---|
| 256 MB | 64M elements | 32M elements | 25.0-26.0 |
| 1 GB | 256M | 128M | 27.0-28.0 |
| 4 GB | 1B | 512M | 29.0-30.0 |
| 8 GB | 2B | 1B | 30.0-31.0 |
| 16 GB | 4B | 2B | 31.0-32.0 |
| 32 GB | 8B | 4B | 32.0-33.0 |
Tuning Workflow
-
Start with defaults:
tree = optimize_code(ixs, out, sizes) complexity = tree.complexity(ixs, sizes) print(f"tc: {complexity.tc:.2f}, sc: {complexity.sc:.2f}") -
Identify bottleneck:
- If
sctoo high → increasesc_weightor lowersc_target - If
tctoo high → try TreeSA with default score - If running on GPU → see GPU Optimization Guide
- If
-
Adjust and re-optimize:
score = ScoreFunction(sc_weight=2.0, sc_target=28.0) tree = optimize_code(ixs, out, sizes, TreeSA(score=score)) -
Verify improvement:
new_complexity = tree.complexity(ixs, sizes) print(f"New tc: {new_complexity.tc:.2f}, sc: {new_complexity.sc:.2f}")
Examples
See examples/score_function_examples.py for complete examples:
- CPU optimization
- GPU optimization (experimental rw_weight tuning)
- Memory-limited environments
- Dynamic sc_target calculation
Next Steps
- GPU Optimization - Maximize GPU performance
- Slicing Strategy - Reduce memory with slicing
- Complexity Metrics - Understand tc, sc, rwc
GPU Optimization
Maximize performance when running tensor contractions on GPUs.
The GPU Bottleneck
GPUs have 10-100x higher compute-to-memory-bandwidth ratio than CPUs. Memory I/O is the bottleneck.
Configuration
from omeco import ScoreFunction, TreeSA, optimize_code
# GPU optimization
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0, # Tune based on profiling (see below)
sc_target=30.0 # Your GPU memory limit
)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
Tuning rw_weight:
No established “best” value—tune empirically:
- Start with
rw_weight=1.0 - Try
10.0, then20.0or higher - Profile actual GPU execution time
- Adjust based on results
Range: 0.1 to >20 depending on GPU model, tensor sizes, and precision.
Calculate sc_target for Your GPU
import math
# NVIDIA A100: 40GB
gpu_gb = 40
bytes_per_element = 4 # float32
# Maximum usable memory (leave ~20% for overhead)
usable_gb = gpu_gb * 0.8
sc_target = math.log2(usable_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.0
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0,
sc_target=sc_target
)
GPU Memory Reference
| GPU | VRAM | sc_target (float32) |
|---|---|---|
| RTX 3060 | 12GB | 31.0 |
| RTX 3090 | 24GB | 32.2 |
| A100 (40GB) | 40GB | 32.9 |
| A100 (80GB) | 80GB | 33.9 |
| H100 | 80GB | 33.9 |
When It Matters
High impact: Complex networks (quantum circuits, GNNs, irregular contractions) Low impact: Simple chains, matrix multiplications, small networks (<10 tensors)
Common Mistakes
- ❌ Using
rw_weight=0.0(CPU default) on GPU - ❌ Wrong
sc_target(e.g., using CPU memory size for GPU) - ❌ Using GreedyMethod instead of TreeSA (greedy doesn’t optimize for I/O)
Checklist
- Calculate
sc_targetfrom GPU memory (float32: 4 bytes/element, reserve 20% overhead) - Set
rw_weight(start with 1.0, try 10.0 and 20.0, profile) - Use TreeSA optimizer (not greedy)
Next Steps
- Score Function Configuration - Full parameter guide
- Slicing Strategy - Reduce memory further
- PyTorch Integration - Use with PyTorch
Slicing Strategy
Reduce memory usage by slicing tensor indices, trading time for space.
What is Slicing?
Slicing breaks a tensor contraction into smaller chunks by looping over one or more indices.
Example:
# Original: C[i,k] = Σⱼ A[i,j] * B[j,k]
# Too large: j has size 10000
# Sliced: Loop over j in chunks
C = zeros(i, k)
for j_val in range(10000):
C += A[:, j_val] * B[j_val, :]
Trade-off:
- ✅ Memory reduced: Only store one slice at a time
- ❌ Time increased: Repeat computation for each slice
When to Use Slicing
Use slicing when:
- Contraction exceeds available memory
sc(space complexity) is too high- Can’t reduce memory by changing contraction order alone
Decision flowchart:
Optimize contraction order
↓
Check space complexity
↓
sc > sc_target?
├─ No → Done
└─ Yes → Apply slicing
Basic Usage
Automatic Slicing
from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction
# 1. Optimize contraction order
tree = optimize_code(ixs, out, sizes)
# 2. Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 28.0: # More than 256MB
# 3. Slice to reduce memory
score = ScoreFunction(sc_target=28.0)
slicer = TreeSASlicer.fast(score=score)
sliced = slice_code(tree, ixs, sizes, slicer)
# 4. Verify reduction
sliced_comp = sliced.complexity(ixs, sizes)
print(f"Original: 2^{complexity.sc:.2f} elements")
print(f"Sliced: 2^{sliced_comp.sc:.2f} elements")
print(f"Sliced indices: {sliced.slicing()}")
Manual Slicing
Specify which indices to slice:
# Force slicing on specific indices
slicer = TreeSASlicer.fast(fixed_slices=[1, 2])
sliced = slice_code(tree, ixs, sizes, slicer)
# Verify that indices 1 and 2 are sliced
assert 1 in sliced.slicing()
assert 2 in sliced.slicing()
TreeSASlicer Configuration
Fast Preset
# Quick slicing optimization
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)
Custom Configuration
from omeco import TreeSASlicer, ScoreFunction
score = ScoreFunction(
tc_weight=0.1, # Accept higher time
sc_weight=10.0, # Prioritize space
sc_target=25.0 # Target 256MB
)
slicer = TreeSASlicer(
ntrials=10, # Number of optimization runs
niters=50, # Iterations per run
fixed_slices=[1], # Force slicing on index 1
score=score
)
sliced = slice_code(tree, ixs, sizes, slicer)
Understanding the Trade-off
Example: Large Matrix Multiplication
# A[10000 × 10000] × B[10000 × 10000]
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10000, 1: 10000, 2: 10000}
# Without slicing
tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
# tc ≈ 33.2 (10^10 FLOPs)
# sc ≈ 26.6 (100M elements = 800MB for float64)
# With slicing on index 1 (middle dimension)
slicer = TreeSASlicer.fast(fixed_slices=[1])
sliced = slice_code(tree, ixs, sizes, slicer)
sliced_comp = sliced.complexity(ixs, sizes)
# tc ≈ 33.2 (same FLOPs, distributed over slices)
# sc ≈ 23.3 (10M elements = 80MB for float64)
# Result: 10x memory reduction, same total time (but distributed)
Choosing Which Indices to Slice
Automatic Selection
TreeSASlicer automatically chooses which indices to slice:
# Let the optimizer decide
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Automatically sliced indices: {sliced.slicing()}")
Algorithm: Tries slicing different indices and picks the combination that best satisfies sc_target while minimizing tc increase.
Manual Selection Guidelines
Good candidates for slicing:
- Indices with large dimensions
- Indices that appear in many tensors (high degree)
- Indices not in the output (no outer loop needed)
Poor candidates:
- Output indices (can’t slice without changing result structure)
- Small dimensions (little memory benefit)
- Indices that would cause many small intermediate tensors
Advanced: Slicing with GPU
For GPU workloads, consider both memory and I/O:
# GPU slicing: balance memory and I/O
score = ScoreFunction(
tc_weight=1.0,
sc_weight=2.0, # Space is critical on GPU
rw_weight=0.1, # Experimental: tune empirically
sc_target=30.0 # 8GB GPU limit
)
slicer = TreeSASlicer(
ntrials=10,
niters=50,
score=score
)
sliced = slice_code(tree, ixs, sizes, slicer)
Note: Slicing may increase rwc (read-write complexity) because data is loaded multiple times. The optimizer balances this with the rw_weight parameter.
Example Workflow
from omeco import optimize_code, slice_code
from omeco import TreeSA, TreeSASlicer, ScoreFunction
# Problem setup
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]] # Cycle graph
out = [1,3,5,7]
sizes = {i: 32 for i in range(8)}
# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
comp = tree.complexity(ixs, sizes)
print(f"Optimized: sc = 2^{comp.sc:.2f}")
# Step 2: Check if slicing needed
if comp.sc > 20.0: # Exceeds 1MB
print("Applying slicing...")
# Step 3: Slice
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=18.0))
sliced = slice_code(tree, ixs, sizes, slicer)
# Step 4: Verify
sliced_comp = sliced.complexity(ixs, sizes)
print(f"Sliced: sc = 2^{sliced_comp.sc:.2f}")
print(f"Sliced indices: {sliced.slicing()}")
# Calculate overhead
time_overhead = 2 ** (sliced_comp.tc - comp.tc)
memory_reduction = 2 ** (comp.sc - sliced_comp.sc)
print(f"Time overhead: {time_overhead:.1f}x")
print(f"Memory reduction: {memory_reduction:.1f}x")
Multiple Slices
Slice multiple indices for extreme memory reduction:
# Very tight memory constraint
score = ScoreFunction(
tc_weight=0.01, # Accept much higher time
sc_weight=100.0, # Memory is critical
sc_target=15.0 # Only 32KB available!
)
slicer = TreeSASlicer(ntrials=20, niters=100, score=score)
sliced = slice_code(tree, ixs, sizes, slicer)
# May slice multiple indices
print(f"Sliced {len(sliced.slicing())} indices: {sliced.slicing()}")
Performance Impact
| Scenario | Memory Reduction | Time Overhead | When to Use |
|---|---|---|---|
| Slice 1 index | 2-10x | 1-2x | Moderate memory constraint |
| Slice 2 indices | 10-100x | 2-5x | Severe memory constraint |
| Slice 3+ indices | 100-1000x | 5-20x | Extreme memory constraint |
Tips
-
Optimize order first, slice second:
# Good workflow tree = optimize_code(ixs, out, sizes, TreeSA.fast()) sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast()) -
Set realistic sc_target:
# Target 80% of available memory available_gb = 8 sc_target = math.log2(available_gb * 0.8 * 1024**3 / 8) -
Profile actual memory usage:
# Theoretical vs actual memory may differ # Always test with real data -
Consider time-memory trade-off:
# For one-time computation: accept high memory # For repeated computation: slice to fit memory
Next Steps
- Score Function Configuration - Tune slicing parameters
- GPU Optimization - GPU-specific slicing
- Complexity Metrics - Understand sc and tc
PyTorch Integration
Use omeco to optimize tensor contractions in PyTorch.
Overview
omeco optimizes the contraction order, then you execute with PyTorch tensors.
Workflow:
- Define network structure (indices and sizes)
- Optimize with omeco → get contraction tree
- Execute tree with PyTorch tensors
Basic Example
import torch
from omeco import optimize_code
# Define tensor network structure
ixs = [[0, 1], [1, 2], [2, 3]] # Matrix chain
out = [0, 3]
sizes = {0: 100, 1: 200, 2: 50, 3: 100}
# Optimize contraction order
tree = optimize_code(ixs, out, sizes)
print(f"Optimized tree:\n{tree}")
# Create PyTorch tensors
tensors = [
torch.randn(100, 200), # A[0,1]
torch.randn(200, 50), # B[1,2]
torch.randn(50, 100), # C[2,3]
]
# Execute contraction (see full example below)
result = execute_tree(tree, tensors)
Complete Example
See examples/pytorch_tensor_network_example.py for a full working example with:
- Tree traversal and execution
- Verification against direct einsum
- 4×4 grid tensor network
- Memory reduction via slicing
Converting Tree to Dict
tree_dict = tree.to_dict()
# Dictionary structure:
# Leaf: {"tensor_index": int}
# Node: {"args": [children], "eins": {"ixs": [[int]], "iy": [int]}}
Executing the Tree
def execute_tree(tree_dict, tensors):
"""Recursively execute contraction tree."""
if "tensor_index" in tree_dict:
# Leaf node - return tensor
return tensors[tree_dict["tensor_index"]]
# Internal node - execute children then contract
args = [execute_tree(child, tensors) for child in tree_dict["args"]]
eins = tree_dict["eins"]
# Build einsum string
ixs_str = [indices_to_str(ix) for ix in eins["ixs"]]
iy_str = indices_to_str(eins["iy"])
einsum_eq = f"{','.join(ixs_str)}->{iy_str}"
# Execute with PyTorch
return torch.einsum(einsum_eq, *args)
def indices_to_str(indices):
"""Convert integer indices to letters."""
return ''.join(chr(ord('a') + i) for i in indices)
GPU Execution
# Move tensors to GPU
tensors_gpu = [t.cuda() for t in tensors]
# Execute on GPU
result = execute_tree(tree.to_dict(), tensors_gpu)
With Autograd
# Tensors with gradients
tensors = [torch.randn(100, 200, requires_grad=True) for _ in range(3)]
# Execute
result = execute_tree(tree.to_dict(), tensors)
# Backward pass
result.sum().backward()
# Gradients available
print(tensors[0].grad)
Next Steps
- Score Function - Optimize for your hardware
- GPU Optimization - Maximize GPU performance
- Examples - Complete code
Performance Benchmarks
Real-world performance comparison with Julia implementation.
Benchmark Methodology
All benchmarks compare against OMEinsumContractionOrders.jl, the original Julia implementation.
Test Environment:
- Benchmarks run via Python bindings (PyO3)
- Configuration:
ntrials=1,niters=50-100 - See
benchmarks/directory for scripts
Rust vs Julia (TreeSA)
Comparison of TreeSA optimizer performance:
| Problem | Tensors | Indices | Rust (ms) | Julia (ms) | tc (Rust) | tc (Julia) | Speedup |
|---|---|---|---|---|---|---|---|
| chain_10 | 10 | 11 | 18.5 | 31.6 | 23.10 | 23.10 | 1.7x |
| grid_4x4 | 16 | 24 | 88.0 | 150.7 | 9.18 | 9.26 | 1.7x |
| grid_5x5 | 25 | 40 | 155.4 | 297.7 | 10.96 | 10.96 | 1.9x |
| reg3_250 | 250 | 372 | 2,435 | 5,099 | 48.00 | 47.17 | 2.1x |
Results:
- Rust is 1.7-2.1x faster than Julia for TreeSA optimization
- Both implementations find solutions with comparable time complexity (tc)
- Solution quality is nearly identical between implementations
Greedy Method Performance
GreedyMethod is extremely fast for quick optimization:
| Problem | Tensors | Time (ms) | tc | sc |
|---|---|---|---|---|
| chain_10 | 10 | 0.04 | 23.10 | 13.29 |
| grid_4x4 | 16 | 0.12 | 9.54 | 5.0 |
| grid_5x5 | 25 | 0.23 | 11.28 | 6.0 |
| reg3_250 | 250 | 7.5 | 69.00 | 47.0 |
Key Observations:
- Greedy is 100-300x faster than TreeSA
- Greedy gives good results for small problems (chain_10, grids)
- For large problems (reg3_250), TreeSA finds much better solutions:
- Greedy: tc = 69.00
- TreeSA: tc = 48.00
- Improvement: 2^(69-48) = 2^21 = 2 million times faster execution!
Python Overhead
Python bindings via PyO3 add minimal overhead:
TreeSA (grid_4x4, 100 iterations):
- Pure Rust backend: ~88ms
- Python call overhead: <1ms (~1%)
Greedy (reg3_250):
- Pure optimization: ~7.5ms
- Python overhead: <0.5ms (~6%)
Conclusion: Python overhead is negligible, especially for TreeSA optimization.
Algorithm Comparison
When to use each algorithm:
| Scenario | Recommended | Reason |
|---|---|---|
| Quick prototyping | Greedy | Sub-millisecond optimization |
| Production (< 50 tensors) | Greedy | Fast enough, good quality |
| Production (50-200 tensors) | TreeSA.fast() | Better quality, still reasonable time |
| Large networks (> 200 tensors) | TreeSA | Significant quality improvement |
| Time-critical | Greedy | Predictable fast performance |
| Execution-critical | TreeSA | Better contraction order = faster execution |
Execution Time Savings
The time spent optimizing pays off during execution:
Example: reg3_250 network
- Greedy optimization: 7.5ms
- TreeSA optimization: 2,435ms (~2.4 seconds)
- Extra optimization time: +2.4 seconds
But the execution improvement:
- Greedy tc: 69.00 → ~2^69 FLOPs
- TreeSA tc: 48.00 → ~2^48 FLOPs
- Execution speedup: 2^21 = 2 million times faster
If executing even once, TreeSA optimization is worth it!
Hardware Recommendations
| Tensor Count | RAM | CPU | Recommendation |
|---|---|---|---|
| < 20 | 4GB | Any | Greedy is sufficient |
| 20-100 | 8GB | 4+ cores | TreeSA.fast() for production |
| 100-500 | 16GB | 8+ cores | TreeSA with multiple trials |
| > 500 | 32GB+ | 16+ cores | TreeSA (may take minutes) |
Running Benchmarks
Reproduce the benchmarks yourself:
cd benchmarks
# Python (Rust via PyO3)
python benchmark_python.py
# Julia (original implementation)
julia --project=. benchmark_julia.jl
# Compare results
python -c "
import json
with open('results_rust_treesa.json') as f:
rust = json.load(f)
with open('results_julia_treesa.json') as f:
julia = json.load(f)
for problem in rust['results']:
r = rust['results'][problem]['avg_ms']
j = julia['results'][problem]['avg_ms']
print(f'{problem}: Rust {r:.1f}ms, Julia {j:.1f}ms, Speedup {j/r:.2f}x')
"
Profiling Your Code
Python Profiling
import time
from omeco import optimize_code, TreeSA
# Time optimization
start = time.time()
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
opt_time = time.time() - start
# Check complexity
comp = tree.complexity(ixs, sizes)
print(f"Optimization: {opt_time:.3f}s")
print(f"Time complexity: 2^{comp.tc:.2f} = {2**comp.tc:.2e} FLOPs")
print(f"Space complexity: 2^{comp.sc:.2f} elements")
# Estimate execution time (assuming 10 GFLOP/s CPU)
execution_seconds = 2**comp.tc / 1e10
print(f"Estimated execution: {execution_seconds:.1f}s @ 10 GFLOP/s")
Rust Profiling
# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --example your_example
# Criterion benchmarks (if available)
cargo bench
Performance Tips
- Start with Greedy: Always try Greedy first to get a baseline
- Use TreeSA for production: The extra optimization time pays off
- Increase ntrials for critical code: More trials = better solutions
- Profile execution time: Verify that better tc actually improves runtime
- Consider memory: Use sc_target if memory is constrained
Next Steps
- Algorithm Comparison - Detailed algorithm trade-offs
- Quick Start - Get started optimizing
- Troubleshooting - Common performance issues
Troubleshooting
Common issues and solutions when using omeco.
Installation Issues
Python: pip install fails
Problem: pip install omeco fails with compilation errors.
Solutions:
-
Ensure Rust toolchain is installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh rustup update -
Update pip and setuptools:
pip install --upgrade pip setuptools wheel -
Install from source:
git clone https://github.com/GiggleLiu/omeco cd omeco/omeco-python pip install maturin maturin develop --release
Rust: Dependency resolution errors
Problem: Cargo fails to resolve dependencies.
Solution: Update Cargo.lock and rebuild:
cargo clean
cargo update
cargo build --release
Runtime Errors
ImportError: No module named ‘omeco’
Problem: Python cannot find the omeco module after installation.
Diagnosis:
import sys
print(sys.path) # Check if package is in path
Solutions:
-
Verify installation:
pip show omeco -
Check virtual environment:
which python # Ensure you're using the right Python -
Reinstall in development mode:
cd omeco-python maturin develop --release
Memory Errors (OOM)
Problem: Program crashes with “Out of Memory” error during optimization.
Cause: Space complexity (sc) exceeds available memory.
Solutions:
-
Check space complexity:
from omeco import optimize_code tree = optimize_code(ixs, out, sizes) comp = tree.complexity(ixs, sizes) print(f"Space complexity: 2^{comp.sc:.2f} elements") # Calculate memory usage bytes_per_element = 8 # float64 memory_gib = (2 ** comp.sc) * bytes_per_element / 1024**3 print(f"Estimated memory: {memory_gib:.2f} GiB") -
Use slicing to reduce memory:
from omeco import slice_code, TreeSASlicer, ScoreFunction # Set target memory (e.g., 8GB = sc_target ≈ 30.0 for float64) import math available_gb = 8 sc_target = math.log2(available_gb * 1024**3 / 8) score = ScoreFunction(sc_target=sc_target, sc_weight=2.0) slicer = TreeSASlicer.fast(score=score) sliced = slice_code(tree, ixs, sizes, slicer) -
Optimize with memory constraints from the start:
from omeco import TreeSA, ScoreFunction score = ScoreFunction(sc_target=28.0, sc_weight=3.0) tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
TypeError: Invalid index type
Problem: Indices must be integers, but got strings or floats.
Cause: Einstein indices in ixs and out must be integers (or strings in some contexts).
Solution: Convert indices to integers:
# Wrong
ixs = [["a", "b"], ["b", "c"]]
# Correct
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}
Performance Issues
Optimization is too slow
Problem: TreeSA takes too long to find a good contraction order.
Solutions:
-
Use faster preset:
from omeco import TreeSA # Instead of default tree = optimize_code(ixs, out, sizes, TreeSA.fast()) -
Try GreedyMethod first:
from omeco import optimize_code, GreedyMethod # Much faster, good baseline tree = optimize_code(ixs, out, sizes) -
Reduce TreeSA iterations:
from omeco import TreeSA optimizer = TreeSA(ntrials=5, niters=20) # Faster but lower quality tree = optimize_code(ixs, out, sizes, optimizer)
GPU performance is poor
Problem: Execution on GPU is slower than expected.
Diagnosis: Check if optimization used correct GPU settings:
comp = tree.complexity(ixs, sizes)
print(f"Read-write complexity: 2^{comp.rwc:.2f}")
If rwc is high, the contraction does excessive memory I/O.
Solution: Re-optimize with GPU-specific score function:
from omeco import ScoreFunction, TreeSA
# GPU optimization (see GPU Optimization Guide for rw_weight tuning)
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0,
sc_target=30.0
)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
See GPU Optimization Guide for details.
Results differ from expected
Problem: Optimized contraction order gives different complexity than expected.
Diagnosis:
-
Print the contraction tree:
tree = optimize_code(ixs, out, sizes) print(tree) # Shows ASCII tree structure -
Check complexity metrics:
comp = tree.complexity(ixs, sizes) print(f"tc: {comp.tc:.2f}, sc: {comp.sc:.2f}, rwc: {comp.rwc:.2f}") -
Verify input:
print(f"ixs: {ixs}") print(f"out: {out}") print(f"sizes: {sizes}") # Check all indices are covered all_indices = set() for ix in ixs: all_indices.update(ix) print(f"All indices: {all_indices}") print(f"Missing sizes: {all_indices - set(sizes.keys())}")
Common causes:
- Missing indices in
sizesdictionary - Wrong output indices in
out - Mismatched dimensions between
ixsandsizes
Algorithm-Specific Issues
GreedyMethod gives poor results
Problem: Greedy optimization produces sub-optimal contraction order.
Expected: Greedy is fast but not optimal. It uses heuristics.
Solution: Use TreeSA for better quality:
from omeco import TreeSA
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
Comparison:
from omeco import optimize_code, GreedyMethod, TreeSA
# Greedy (fast, lower quality)
greedy_tree = optimize_code(ixs, out, sizes)
greedy_comp = greedy_tree.complexity(ixs, sizes)
# TreeSA (slower, higher quality)
sa_tree = optimize_code(ixs, out, sizes, TreeSA.fast())
sa_comp = sa_tree.complexity(ixs, sizes)
print(f"Greedy tc: {greedy_comp.tc:.2f}")
print(f"TreeSA tc: {sa_comp.tc:.2f}")
print(f"Improvement: {2**(greedy_comp.tc - sa_comp.tc):.1f}x faster")
TreeSA gets stuck in local minimum
Problem: Multiple runs of TreeSA give inconsistent results.
Cause: Simulated annealing is stochastic.
Solutions:
-
Increase trials:
optimizer = TreeSA(ntrials=20, niters=50) # More exploration tree = optimize_code(ixs, out, sizes, optimizer) -
Use best-of-many:
from omeco import TreeSA best_tree = None best_tc = float('inf') for _ in range(10): tree = optimize_code(ixs, out, sizes, TreeSA.fast()) comp = tree.complexity(ixs, sizes) if comp.tc < best_tc: best_tc = comp.tc best_tree = tree print(f"Best tc: {best_tc:.2f}")
Integration Issues
PyTorch: Contraction gives wrong results
Problem: Executing the contraction tree with PyTorch tensors produces incorrect output.
Diagnosis:
-
Verify with einsum:
import torch from omeco import optimize_code # Direct einsum (slow but correct) result_ref = torch.einsum("ab,bc,cd->ad", A, B, C) # Optimized tree tree = optimize_code([[0,1], [1,2], [2,3]], [0,3], {0:10, 1:20, 2:30, 3:10}) result_tree = execute_tree(tree.to_dict(), [A, B, C]) # Compare print(f"Match: {torch.allclose(result_ref, result_tree)}") print(f"Max error: {(result_ref - result_tree).abs().max()}") -
Check tensor shapes:
for i, t in enumerate([A, B, C]): print(f"Tensor {i}: {t.shape}")
Common causes:
- Mismatched tensor shapes and
sizesdictionary - Wrong index mapping in
execute_tree - Incorrect einsum string construction
See PyTorch Integration Guide for complete example.
Slicing: Incorrect memory calculation
Problem: Slicing doesn’t reduce memory as expected.
Diagnosis:
from omeco import slice_code
# Before slicing
comp_original = tree.complexity(ixs, sizes)
# After slicing
sliced = slice_code(tree, ixs, sizes, slicer)
comp_sliced = sliced.complexity(ixs, sizes)
print(f"Original sc: 2^{comp_original.sc:.2f}")
print(f"Sliced sc: 2^{comp_sliced.sc:.2f}")
print(f"Reduction: {2**(comp_original.sc - comp_sliced.sc):.1f}x")
print(f"Sliced indices: {sliced.slicing()}")
Solutions:
-
Increase sc_weight:
score = ScoreFunction(sc_weight=5.0, sc_target=25.0) slicer = TreeSASlicer.fast(score=score) -
Force specific indices:
slicer = TreeSASlicer.fast(fixed_slices=[1, 2]) sliced = slice_code(tree, ixs, sizes, slicer)
Debugging Tips
Enable verbose output
Rust library doesn’t have built-in verbose mode, but you can add print statements:
# Check optimization result
tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")
# Check complexity
comp = tree.complexity(ixs, sizes)
print(f"tc: 2^{comp.tc:.2f}, sc: 2^{comp.sc:.2f}, rwc: 2^{comp.rwc:.2f}")
# Check slicing
sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")
Compare with Julia implementation
If you suspect a bug, compare with the original Julia implementation:
using OMEinsumContractionOrders
ixs = [[1, 2], [2, 3], [3, 1]]
out = [1]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)
# Greedy
tree = optimize_code(ixs, out, sizes)
println(tree)
# TreeSA
tree = optimize_code(ixs, out, sizes)
println(tree)
Minimal reproducible example
When reporting issues, provide:
from omeco import optimize_code
# Minimal example
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}
tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")
comp = tree.complexity(ixs, sizes)
print(f"Complexity: tc={comp.tc:.2f}, sc={comp.sc:.2f}")
# What was expected vs what you got
print(f"Expected tc: ~13.0")
print(f"Got tc: {comp.tc:.2f}")
FAQ
Q: How do I use different optimizers?
A: optimize_code is the unified function that accepts different optimizer types:
from omeco import optimize_code, GreedyMethod, TreeSA
# Use default optimizer (GreedyMethod)
tree1 = optimize_code(ixs, out, sizes)
# Explicitly specify GreedyMethod
tree2 = optimize_code(ixs, out, sizes, GreedyMethod())
# Use TreeSA optimizer
tree3 = optimize_code(ixs, out, sizes, TreeSA.fast())
Q: How do I choose between Greedy and TreeSA?
A:
- GreedyMethod: Fast (milliseconds to seconds), good for prototyping
- TreeSA: Slower (seconds to minutes), better quality, use for production
Rule of thumb:
- < 50 tensors → Greedy is fine
- 50-200 tensors → TreeSA.fast()
- > 200 tensors → Start with Greedy, then try TreeSA if needed
Q: What does sc_target mean?
A: sc_target is the log₂ of the target memory limit in elements.
Calculate it:
import math
# For 4GB memory, float64 (8 bytes)
memory_gb = 4
bytes_per_element = 8
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 29.0
# For 8GB memory, float32 (4 bytes)
memory_gb = 8
bytes_per_element = 4
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 31.0
Q: Why is GPU optimization different?
A: GPUs have limited memory bandwidth relative to compute (10-100x higher compute-to-memory-bandwidth ratio than CPUs).
- CPU: Memory I/O is fast →
rw_weight=0.0 - GPU: Use
rw_weightto penalize memory I/O
See GPU Optimization Guide for complete tuning methodology.
Q: Can I save and load optimized trees?
A: Yes, use to_dict() and JSON:
import json
from omeco import optimize_code
# Optimize
tree = optimize_code(ixs, out, sizes)
# Save
tree_dict = tree.to_dict()
with open("tree.json", "w") as f:
json.dump(tree_dict, f)
# Load
with open("tree.json", "r") as f:
tree_dict = json.load(f)
# Use dict directly or reconstruct (implementation dependent)
Q: What if my tensor network has repeated indices?
A: Repeated indices (traces) are supported:
# Trace: C[i,j] = Σₖ A[i,k,k] * B[k,j]
ixs = [[0, 1, 1], [1, 2]] # Note: index 1 appears twice in first tensor
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}
tree = optimize_code(ixs, out, sizes)
Still Having Issues?
-
Check the documentation:
-
Search existing issues:
-
Report a bug:
- Include minimal reproducible example
- Include Python/Rust version, OS, omeco version
- Show expected vs actual behavior
-
Ask for help:
- GitHub Discussions
- Include context and what you’ve tried
API Reference
Quick reference for omeco’s Python and Rust APIs.
Python API
Full Python API documentation is available in the package docstrings. This page provides a quick overview.
Optimization Functions
optimize_code
def optimize_code(
ixs: List[List[int]],
out: List[int],
sizes: Dict[int, int],
optimizer: Optional[Optimizer] = None
) -> NestedEinsum
Optimize tensor contraction order.
Parameters:
ixs: List of index lists for each tensor (e.g.,[[0, 1], [1, 2]])out: Output indices (e.g.,[0, 2])sizes: Dictionary mapping indices to dimensions (e.g.,{0: 10, 1: 20, 2: 10})optimizer: Optimizer instance (default:GreedyMethod())
Returns: NestedEinsum representing the optimized contraction tree
Example:
from omeco import optimize_code, TreeSA
tree = optimize_code(
ixs=[[0, 1], [1, 2], [2, 3]],
out=[0, 3],
sizes={0: 10, 1: 20, 2: 30, 3: 10},
optimizer=TreeSA.fast()
)
Slicing Functions
slice_code
def slice_code(
tree: NestedEinsum,
ixs: List[List[int]],
sizes: Dict[int, int],
slicer: Optional[TreeSASlicer] = None
) -> SlicedCode
Apply slicing to reduce memory usage.
Parameters:
tree: Optimized contraction tree fromoptimize_codeixs: Original index listssizes: Dimension sizesslicer: Slicer instance (optional, defaults toTreeSASlicer())
Returns: SlicedCode with slicing applied
Example:
from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction
tree = optimize_code(ixs, out, sizes)
slicer = TreeSASlicer.fast(
score=ScoreFunction(sc_target=28.0)
)
sliced = slice_code(tree, ixs, sizes, slicer)
Complexity Functions
contraction_complexity
def contraction_complexity(
tree: NestedEinsum,
ixs: List[List[int]],
sizes: Dict[int, int]
) -> Complexity
Calculate complexity metrics for a contraction tree.
Returns: Complexity object with fields:
tc: Time complexity (log₂ FLOPs)sc: Space complexity (log₂ elements)rwc: Read-write complexity (log₂ elements)
Example:
from omeco import optimize_code
tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
print(f"Time: 2^{comp.tc:.2f} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements")
print(f"Read-write: 2^{comp.rwc:.2f} elements")
sliced_complexity
def sliced_complexity(
sliced: SlicedCode,
ixs: List[List[int]],
sizes: Dict[int, int]
) -> Complexity
Calculate complexity for sliced code.
Example:
from omeco import slice_code
sliced = slice_code(tree, ixs, sizes, slicer)
comp = sliced.complexity(ixs, sizes)
print(f"Sliced space: 2^{comp.sc:.2f}")
Optimizer Classes
GreedyMethod
class GreedyMethod:
def __init__(
self,
max_iter: Optional[int] = None,
temperature: float = 0.0,
max_branches: int = 1
)
Greedy optimization algorithm.
Parameters:
max_iter: Maximum iterations (default: unlimited)temperature: Temperature for stochastic variant (0.0 = deterministic)max_branches: Number of branches to explore (1 = pure greedy)
Presets:
# Default greedy
GreedyMethod()
# Stochastic variant
GreedyMethod(temperature=1.0, max_branches=5)
TreeSA
class TreeSA:
def __init__(
self,
ntrials: int = 1,
niters: int = 50,
score: Optional[ScoreFunction] = None
)
@staticmethod
def fast(score: Optional[ScoreFunction] = None) -> TreeSA
Tree-based simulated annealing optimizer.
Parameters:
ntrials: Number of independent trialsniters: Iterations per trialscore: Score function for evaluating candidates
Presets:
# Fast preset (good balance)
TreeSA.fast()
# Custom configuration
TreeSA(ntrials=10, niters=100, score=ScoreFunction(sc_target=28.0))
TreeSASlicer
class TreeSASlicer:
def __init__(
self,
ntrials: int = 1,
niters: int = 50,
fixed_slices: Optional[List[int]] = None,
score: Optional[ScoreFunction] = None
)
@staticmethod
def fast(
fixed_slices: Optional[List[int]] = None,
score: Optional[ScoreFunction] = None
) -> TreeSASlicer
Simulated annealing slicer.
Parameters:
ntrials: Number of trialsniters: Iterations per trialfixed_slices: Indices that must be slicedscore: Score function
Example:
# Automatic slice selection
TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
# Force slicing on specific indices
TreeSASlicer.fast(fixed_slices=[1, 2])
Configuration Classes
ScoreFunction
class ScoreFunction:
def __init__(
self,
tc_weight: float = 1.0,
sc_weight: float = 1.0,
rw_weight: float = 0.0,
sc_target: float = 20.0
)
Configure optimization objectives.
Parameters:
tc_weight: Weight for time complexitysc_weight: Weight for space complexityrw_weight: Weight for read-write complexity (default: 0.0 for CPU, see GPU Optimization Guide for GPU)sc_target: Target space complexity (log₂ elements)
Score Formula:
score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)
Examples:
# CPU optimization
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=28.0)
# GPU optimization (see GPU Optimization Guide)
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0)
# Memory-constrained
ScoreFunction(tc_weight=0.1, sc_weight=10.0, rw_weight=0.0, sc_target=25.0)
Data Classes
NestedEinsum
Represents an optimized contraction tree.
Methods:
tree.to_dict() -> dict # Convert to dictionary
tree.leaf_count() -> int # Number of leaf tensors
tree.depth() -> int # Tree depth
str(tree) # Pretty-print tree
repr(tree) # Summary representation
Dictionary Structure:
# Leaf node
{"tensor_index": int}
# Internal node
{
"args": [child1_dict, child2_dict, ...],
"eins": {
"ixs": [[int, ...], ...], # Input indices
"iy": [int, ...] # Output indices
}
}
Example:
tree = optimize_code(ixs, out, sizes)
# Pretty print
print(tree)
# Output:
# ab, bc -> ac
# ├─ ab
# └─ bc
# Convert to dict for execution
tree_dict = tree.to_dict()
SlicedCode
Represents a contraction with slicing applied.
Methods:
sliced.slicing() -> List[int] # Get sliced indices
Example:
sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")
# Output: [1, 3]
Complexity
Complexity metrics for a contraction.
Fields:
comp.tc: float # Time complexity (log₂ FLOPs)
comp.sc: float # Space complexity (log₂ elements)
comp.rwc: float # Read-write complexity (log₂ elements)
Example:
comp = tree.complexity(ixs, sizes)
# Interpret log₂ values
flops = 2 ** comp.tc
memory_elements = 2 ** comp.sc
io_operations = 2 ** comp.rwc
print(f"{flops:.2e} FLOPs")
print(f"{memory_elements:.2e} elements in memory")
Rust API
Full Rust API documentation: docs.rs/omeco
Core Types
NestedEinsum<I>
#![allow(unused)]
fn main() {
pub struct NestedEinsum<I> { /* ... */ }
impl<I> NestedEinsum<I> {
pub fn leaf(tensor_index: usize) -> Self
pub fn new(args: Vec<Self>, eins: ContractionOperation<I>) -> Self
pub fn depth(&self) -> usize
pub fn leaf_count(&self) -> usize
}
}
ContractionOperation<I>
#![allow(unused)]
fn main() {
pub struct ContractionOperation<I> {
pub ixs: Vec<Vec<I>>, // Input indices
pub iy: Vec<I>, // Output indices
}
}
Optimization Functions
optimize_greedy
#![allow(unused)]
fn main() {
pub fn optimize_greedy<I>(
ixs: &[Vec<I>],
out: &[I],
sizes: &HashMap<I, usize>
) -> NestedEinsum<I>
where
I: Hash + Eq + Clone + Ord
}
optimize_code
#![allow(unused)]
fn main() {
pub fn optimize_code<I, O>(
ixs: &[Vec<I>],
out: &[I],
sizes: &HashMap<I, usize>,
optimizer: O
) -> NestedEinsum<I>
where
I: Hash + Eq + Clone + Ord,
O: Optimizer<I>
}
Optimizer Traits
Optimizer<I>
#![allow(unused)]
fn main() {
pub trait Optimizer<I> {
fn optimize(
&self,
ixs: &[Vec<I>],
out: &[I],
sizes: &HashMap<I, usize>
) -> NestedEinsum<I>;
}
}
Configuration
ScoreFunction
#![allow(unused)]
fn main() {
pub struct ScoreFunction {
pub tc_weight: f64,
pub sc_weight: f64,
pub rw_weight: f64,
pub sc_target: f64,
}
impl ScoreFunction {
pub fn new(tc_weight: f64, sc_weight: f64, rw_weight: f64, sc_target: f64) -> Self
}
}
Type Mapping (Python ↔ Rust)
| Python | Rust | Notes |
|---|---|---|
List[List[int]] | Vec<Vec<i64>> | Index lists |
Dict[int, int] | HashMap<i64, usize> | Dimension sizes |
NestedEinsum | NestedEinsum<i64> | Contraction tree |
ScoreFunction | ScoreFunction | Same fields |
Complexity | ContractionComplexity | Metrics |
Common Patterns
Optimize with custom score
from omeco import optimize_code, TreeSA, ScoreFunction
score = ScoreFunction(
tc_weight=1.0,
sc_weight=1.0,
rw_weight=10.0, # For GPU (see GPU Optimization Guide)
sc_target=30.0
)
tree = optimize_code(
ixs=[[0, 1], [1, 2]],
out=[0, 2],
sizes={0: 10, 1: 20, 2: 10},
optimizer=TreeSA(score=score)
)
Optimize then slice
from omeco import optimize_code, slice_code, TreeSA, TreeSASlicer, ScoreFunction
# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
# Step 2: Apply slicing if needed
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=28.0))
sliced = slice_code(tree, ixs, sizes, slicer)
Check complexity
from omeco import optimize_code
tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
print(f"Time: 2^{comp.tc:.2f} FLOPs = {2**comp.tc:.2e} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements = {2**comp.sc:.2e} elements")
# Memory in GiB (float64)
memory_gib = (2 ** comp.sc) * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")
Next Steps
- Quick Start Guide - Get started with examples
- Score Function Configuration - Configure optimization
- Troubleshooting - Common issues and solutions
- Full Rust Docs - Complete Rust API documentation
Changelog
All notable changes to omeco are documented here.
The format follows Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
Added
- Comprehensive mdBook documentation following tropical-gemm standard
- Pretty printing for Python
NestedEinsumwith ASCII tree visualization - PyTorch integration guide and examples
- GPU optimization guide with
rw_weightconfiguration - Slicing strategy guide for memory-constrained environments
- Troubleshooting guide with common issues and solutions
- API reference for Python and Rust APIs
- Performance benchmarks comparing Rust vs Julia
- Algorithm comparison guide (Greedy vs TreeSA)
Changed
- Migrated documentation from scattered markdown files to structured mdBook
- Improved Python bindings with better
__str__and__repr__methods
Deprecated
- Legacy
docs/score_function_guide.md(migrated to mdBook)
[0.2.1] - 2024-01-XX
Fixed
- Issue #6: Hyperedge index preservation in contraction operations (PR #7)
- Fixed
contract_tree!macro to correctly preserve tensor indices during contraction - Added regression tests to verify hyperedge handling
- Ensures contraction order matches input tensor order specified in
ixs
- Fixed
Added
- Test suite for hyperedge index preservation
- CI improvements for better test coverage
[0.2.0] - 2024-01-XX
Added
- TreeSA (Tree-based Simulated Annealing) optimizer
TreeSA.fast()preset for quick high-quality optimization- Slicing support with
TreeSASlicerfor memory reduction ScoreFunctionfor configurable optimization objectivescontraction_complexityandsliced_complexityfunctions- Python bindings via PyO3
optimize_codegeneric function accepting optimizer instances- Read-write complexity (rwc) metric for GPU optimization
Changed
- Improved API ergonomics with preset methods
- Better default parameters for optimizers
Performance
- 1.4-1.5x faster than Julia OMEinsumContractionOrders.jl on benchmarks
- Efficient TreeSA implementation with better exploration
[0.1.0] - 2023-XX-XX
Added
- Initial release
- GreedyMethod optimizer
- Basic contraction order optimization
- Support for tensor networks with arbitrary indices
- Complexity calculation (time and space)
- Rust core library
- Basic documentation
Features
- Greedy algorithm with configurable parameters
- Stochastic variants for improved solutions
- Efficient index handling with generic types
- HashMap-based dimension tracking
Version Numbering
omeco follows Semantic Versioning:
- MAJOR version for incompatible API changes
- MINOR version for new functionality in a backward compatible manner
- PATCH version for backward compatible bug fixes
Release Process
- Update version in
Cargo.tomlfiles - Update
CHANGELOG.mdwith release notes - Tag release:
git tag v0.X.Y - Push tags:
git push --tags - Publish to crates.io:
cargo publish - Publish to PyPI:
maturin publish(Python bindings)
Links
Contributing
See CONTRIBUTING.md for guidelines on contributing to omeco.
Migration Guides
Migrating from 0.1.x to 0.2.x
Major Changes:
-
New
optimize_codefunction: Unified API for all optimizers# Old (0.1.x) tree = optimize_greedy(ixs, out, sizes) tree = optimize_treesa(ixs, out, sizes) # New (0.2.x) - unified interface from omeco import optimize_code, GreedyMethod, TreeSA tree = optimize_code(ixs, out, sizes) # Uses GreedyMethod by default tree = optimize_code(ixs, out, sizes, TreeSA.fast()) -
ScoreFunction configuration:
# New in 0.2.x from omeco import ScoreFunction, TreeSA score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0) tree = optimize_code(ixs, out, sizes, TreeSA(score=score)) -
Slicing support:
# New in 0.2.x from omeco import slice_code, TreeSASlicer sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
Breaking Changes:
- Removed
optimize_greedy()andoptimize_treesa()from Python exports - Use
optimize_code(...)instead withGreedyMethod()orTreeSA()optimizers - Rust API unchanged
Migrating from Julia OMEinsumContractionOrders.jl
Index Differences:
- Julia: 1-based indexing
- Rust/Python: 0-based indexing (or use arbitrary hashable types)
# Julia
ixs = [[1, 2], [2, 3], [3, 1]]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)
# Python (0-based)
ixs = [[0, 1], [1, 2], [2, 0]]
sizes = {0: 10, 1: 20, 2: 10}
Function Names:
| Julia | omeco (Python/Rust) |
|---|---|
optimize_greedy | optimize_code(..., GreedyMethod()) |
optimize_treesa | optimize_code(..., TreeSA.fast()) |
contraction_complexity | contraction_complexity |
slicing | slice_code |
API Compatibility: Most functions have similar signatures and behavior.