omeco

High-performance tensor network contraction order optimization in Rust.

What is omeco?

omeco (One More Einsum Contraction Order) is a library for optimizing tensor network contractions. When contracting multiple tensors together, the order of contractions can make an exponential difference in computational cost. omeco provides fast algorithms to find near-optimal contraction orders.

Ported from the Julia library OMEinsumContractionOrders.jl.

Why Optimize Contraction Order?

Consider contracting three matrices: A[10×100] × B[100×20] × C[20×5].

Different orders give vastly different costs:

Order	Operations	Peak Memory
`(A×B)×C`	20,000 + 1,000 = 21,000	200 elements
`A×(B×C)`	10,000 + 50,000 = 60,000	2,000 elements

The first order is 3x faster and uses 10x less memory! For larger tensor networks, differences can be exponential.

Finding the optimal order is NP-complete, but heuristics find near-optimal solutions in seconds.

Key Features

Fast Greedy Algorithm: O(n² log n) greedy method for quick optimization
Simulated Annealing: TreeSA finds higher-quality solutions for complex networks
Automatic Slicing: TreeSASlicer reduces memory by trading time for space
Hardware-Aware: Configure for CPU or GPU with memory I/O optimization
Python & Rust: Native Rust with Python bindings via PyO3

Example

Python:

from omeco import optimize_code

# Matrix chain: A[i,j] × B[j,k] × C[k,l]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 10, 1: 100, 2: 20, 3: 5}

tree = optimize_code(ixs, out, sizes)
complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f}, Space: 2^{complexity.sc:.2f}")

Rust:

#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;

let code = EinCode::new(vec![vec![0, 1], vec![1, 2], vec![2, 3]], vec![0, 3]);
let sizes = HashMap::from([(0, 10), (1, 100), (2, 20), (3, 5)]);

let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
let complexity = contraction_complexity(&tree, &code, &sizes);
println!("Time: 2^{:.2}, Space: 2^{:.2}", complexity.tc, complexity.sc);
}

Use Cases

Quantum Circuit Simulation: Optimizing tensor network contractions in quantum algorithms
Neural Networks: Efficient einsum operations in deep learning
Scientific Computing: Large-scale tensor computations in physics and chemistry
Graph Algorithms: Computing graph polynomials and partition functions

Next Steps

Installation - Install omeco for Python or Rust
Quick Start - Your first tensor contraction optimization
Concepts - Understand tensor networks and complexity metrics

Getting Started

This section covers installation and basic usage of omeco.

Overview

omeco is available as both a Rust crate and a Python package. Choose the language that best fits your workflow:

Python: Quick prototyping, PyTorch/NumPy integration, Jupyter notebooks
Rust: Maximum performance, embedded systems, library development

Crate Organization

Package	Description
`omeco`	Core Rust library with optimization algorithms
`omeco-python`	Python bindings built with PyO3

System Requirements

For Python

Python 3.8 or later
pip or conda package manager

For Rust

Rust 1.70 or later
Cargo build system

Next Steps

Installation - Install omeco for your platform
Quick Start - Your first contraction optimization

Installation

Python

Via pip

The easiest way to install omeco for Python:

pip install omeco

Via conda (not yet available)

# Coming soon
conda install -c conda-forge omeco

From source

For development or the latest features:

git clone https://github.com/GiggleLiu/omeco.git
cd omeco/omeco-python
pip install maturin
maturin develop

Verify Installation

import omeco
print(omeco.__version__)

# Quick test
from omeco import optimize_code, GreedyMethod
tree = optimize_code([[0, 1], [1, 2]], [0, 2], {0: 2, 1: 3, 2: 2})
print(tree)

Expected output:

ab, bc -> ac
├─ tensor_0
└─ tensor_1

Rust

Via Cargo

Add to your Cargo.toml:

[dependencies]
omeco = "0.2"

Or use cargo add:

cargo add omeco

From source

git clone https://github.com/GiggleLiu/omeco.git
cd omeco
cargo build --release

Verify Installation

Create src/main.rs:

use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;

fn main() {
    let code = EinCode::new(vec![vec![0, 1], vec![1, 2]], vec![0, 2]);
    let sizes = HashMap::from([(0, 2), (1, 3), (2, 2)]);
    
    let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
    println!("{} leaves, depth {}", tree.leaf_count(), tree.depth());
}

Run:

cargo run

Troubleshooting

Python: “No module named omeco”

Ensure you’re using the correct Python environment
Try pip install --force-reinstall omeco
Check pip list | grep omeco

Rust: Compilation errors

Update Rust: rustup update
Check minimum version: rustc --version (need 1.70+)
Clean and rebuild: cargo clean && cargo build

Installation takes too long

Python: Pre-built wheels are available for common platforms
Rust: First build compiles dependencies, subsequent builds are fast

Next Steps

Quick Start - Your first optimization
Concepts - Understand tensor networks

Quick Start

This guide shows the basics of tensor contraction optimization.

Basic Contraction Optimization

Python

from omeco import optimize_code

# Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
ixs = [[0, 1], [1, 2], [2, 3]]  # Input tensor indices
out = [0, 3]                     # Output indices
sizes = {0: 10, 1: 100, 2: 20, 3: 5}  # Dimension sizes

# Optimize contraction order
tree = optimize_code(ixs, out, sizes)

# Compute complexity
complexity = tree.complexity(ixs, sizes)
print(f"Time complexity: 2^{complexity.tc:.2f} FLOPs")
print(f"Space complexity: 2^{complexity.sc:.2f} elements")
print(f"Actual FLOPs: {complexity.flops():,.0f}")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")

Output:

Time complexity: 2^14.29 FLOPs
Space complexity: 2^11.29 elements
Actual FLOPs: 21,000
Peak memory: 2,500 elements

Visualize the contraction tree:

print(tree)

ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

Rust

use omeco::{
    EinCode, GreedyMethod, optimize_code
};
use std::collections::HashMap;

fn main() {
    // Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
    let code = EinCode::new(
        vec![vec![0, 1], vec![1, 2], vec![2, 3]],
        vec![0, 3]
    );
    let sizes = HashMap::from([
        (0, 10),
        (1, 100),
        (2, 20),
        (3, 5),
    ]);

    // Optimize with default greedy method
    let tree = optimize_code(&code, &sizes, &GreedyMethod::default())
        .expect("Optimization failed");

    // Compute complexity
    let complexity = contraction_complexity(&tree, &code, &sizes);
    
    println!("Time: 2^{:.2}, Space: 2^{:.2}", 
             complexity.tc, complexity.sc);
    println!("Leaves: {}, Depth: {}", 
             tree.leaf_count(), tree.depth());
}

Understanding the Output

NestedEinsum Structure

The optimized contraction tree shows the order of operations:

ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

This means:

First contract tensor_1 (indices bc) with tensor_2 (indices cd) → result has indices bd
Then contract tensor_0 (indices ab) with the result → final output has indices ad

Complexity Metrics

tc (time complexity): log₂ of total FLOPs needed
sc (space complexity): log₂ of peak memory usage
rwc (read-write complexity): log₂ of memory I/O operations (for GPU optimization)

Lower values = better performance.

Using Different Optimizers

Greedy Method (Fast)

from omeco import GreedyMethod

# Deterministic greedy (default)
tree = optimize_code(ixs, out, sizes, GreedyMethod())

# Stochastic greedy (explores more options)
tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.5, temperature=1.0))

TreeSA (Higher Quality)

from omeco import TreeSA, ScoreFunction

# Fast preset (good for most cases)
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

# Custom configuration
score = ScoreFunction(sc_target=15.0)
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)

Working with the Tree

Convert to Dictionary

tree_dict = tree.to_dict()

# Traverse the tree
def traverse(node):
    if "tensor_index" in node:
        print(f"Leaf: tensor {node['tensor_index']}")
    else:
        print(f"Contraction: {node['eins']}")
        for child in node["args"]:
            traverse(child)

traverse(tree_dict)

Tree Properties

# Number of input tensors
num_tensors = tree.leaf_count()

# Depth of the tree
depth = tree.depth()

# Check if binary tree
is_binary = tree.is_binary()

# Get leaf indices in order
leaves = tree.leaf_indices()

Memory Reduction with Slicing

For large tensor networks that don’t fit in memory:

from omeco import slice_code, TreeSASlicer, ScoreFunction

# Optimize contraction first
tree = optimize_code(ixs, out, sizes)

# Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 25.0:  # More than 2^25 elements (~256MB for float64)
    # Slice to reduce memory
    score = ScoreFunction(sc_target=25.0)
    slicer = TreeSASlicer.fast(score=score)
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # Check new complexity
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Original space: 2^{complexity.sc:.2f}")
    print(f"Sliced space: 2^{sliced_comp.sc:.2f}")
    print(f"Sliced indices: {sliced.slicing()}")

Next Steps

Concepts - Understand tensor networks and complexity
Algorithms - Learn about GreedyMethod and TreeSA
Score Function Configuration - Optimize for your hardware
PyTorch Integration - Use with PyTorch tensors

Concepts

Understanding these core concepts will help you use omeco effectively.

Overview

Tensor network contraction is a fundamental operation in many scientific and machine learning applications. This section explains:

What tensor networks are and why they matter
Why contraction order makes an exponential difference
How to interpret complexity metrics

Topics

Tensor Networks - Einstein summation and tensor operations
Contraction Order Problem - Why order matters
Complexity Metrics - Understanding tc, sc, and rwc

Quick Summary

The Problem: Contracting N tensors can be done in many orders, with costs differing exponentially.

The Solution: Heuristic algorithms find near-optimal orders in polynomial time.

The Trade-off: Time vs quality - greedy is fast, simulated annealing is better.

Tensor Networks

What are Tensor Networks?

A tensor network is a collection of multidimensional arrays (tensors) connected by shared indices. Contracting a tensor network means computing their products and sums according to the Einstein summation convention.

Einstein Summation

The einsum notation specifies which indices to sum over:

einsum("ij,jk->ik", A, B)

This means:

A has shape [i, j]
B has shape [j, k]
Sum over shared index j
Result has shape [i, k]

Equivalent to matrix multiplication: C[i,k] = Σⱼ A[i,j] * B[j,k]

Common Patterns

Matrix Multiplication

# C = A × B
einsum("ij,jk->ik", A, B)

Batch Matrix Multiplication

# Multiple matrix multiplications at once
einsum("bij,bjk->bik", A, B)

Trace

# Sum of diagonal elements
einsum("ii->", A)

Outer Product

# C[i,j] = A[i] * B[j]
einsum("i,j->ij", a, b)

Tensor Contraction with Multiple Indices

# Contract over indices j and k
einsum("ijk,klj->il", A, B)

Tensor Networks in Practice

Quantum Circuit Simulation

Quantum gates are represented as tensors, and circuit evaluation becomes a tensor network contraction:

# 3-qubit quantum circuit
gates = [
    [[...]]  # Gate on qubits 0,1
    [[...]]  # Gate on qubits 1,2
    [[...]]  # Gate on qubits 0,2
]

# Each qubit index appears multiple times
# Contracting computes the circuit output

Neural Networks

Einsum operations appear in:

Attention mechanisms: einsum("bqd,bkd->bqk", Q, K)
Tensor decompositions: Tucker, CP, tensor train
Graph neural networks

Scientific Computing

Physics: Partition functions, path integrals
Chemistry: Molecular orbital calculations
Mathematics: Graph polynomials, combinatorial problems

Contraction Order Matters

Given tensors A, B, C to contract, we can do:

(A × B) × C
A × (B × C)
(A × C) × B

These give the same result but have vastly different costs.

Example with shapes A[10,100], B[100,20], C[20,5]:

Order	FLOPs	Peak Memory
`(A×B)×C`	20,000 + 1,000 = 21,000	200
`A×(B×C)`	10,000 + 50,000 = 60,000	2,000

The first is 3x faster and uses 10x less memory!

Representing Tensor Networks

As Lists of Index Lists

# Matrix chain A×B×C
ixs = [
    [0, 1],  # A has indices [i, j]
    [1, 2],  # B has indices [j, k]
    [2, 3],  # C has indices [k, l]
]
out = [0, 3]  # Output has indices [i, l]

As Graphs

Each tensor is a node, shared indices are edges:

A[i,j] --- j --- B[j,k] --- k --- C[k,l]
  i                                   l

Contracting removes edges and merges nodes.

Next Steps

Contraction Order Problem - Why optimization is needed
Complexity Metrics - How to measure cost
Quick Start - Optimize your first tensor network

Contraction Order Problem

The Problem

When contracting multiple tensors, the order matters exponentially.

Simple Example: Three Matrices

Consider A[10×100] × B[100×20] × C[20×5]:

Option 1: Left-to-right (A×B)×C

Step 1: A[10×100] × B[100×20] = T[10×20]
        Cost: 10 × 100 × 20 = 20,000 FLOPs
        Memory: 200 elements

Step 2: T[10×20] × C[20×5] = D[10×5]
        Cost: 10 × 20 × 5 = 1,000 FLOPs
        Memory: 50 elements

Total: 21,000 FLOPs, peak memory 200 elements

Option 2: Right-to-left A×(B×C)

Step 1: B[100×20] × C[20×5] = T[100×5]
        Cost: 100 × 20 × 5 = 10,000 FLOPs
        Memory: 500 elements

Step 2: A[10×100] × T[100×5] = D[10×5]
        Cost: 10 × 100 × 5 = 50,000 FLOPs
        Memory: 50 elements

Total: 60,000 FLOPs, peak memory 500 elements

Result: Option 1 is 3x faster and uses 2.5x less memory!

Why It’s Hard

Finding the optimal contraction order is NP-complete:

With N tensors, there are approximately (2N)!/(N+1)! possible orders
For 10 tensors: ~17 million orders
For 20 tensors: ~6 × 10²² orders (more than atoms in your body!)

Brute force is impossible for realistic problems.

Heuristic Approaches

Since optimal search is intractable, we use heuristics:

1. Greedy Algorithm

Idea: At each step, contract the pair with minimum cost.

Pros:

Fast: O(n² log n)
Deterministic
Good for many practical cases

Cons:

Can get stuck in local optima
May miss better global solutions

2. Simulated Annealing (TreeSA)

Idea: Random tree mutations with temperature-based acceptance.

Pros:

Explores more of the search space
Often finds better solutions than greedy
Configurable trade-off (time vs quality)

Cons:

Slower than greedy
Non-deterministic
Requires parameter tuning

Real-World Impact

Quantum Circuit with 50 Qubits

Random order: ~10⁵⁰ operations (impossible)
Greedy order: ~10³⁰ operations (still impossible)
Optimized order: ~10²⁰ operations (feasible on supercomputer)

100,000,000,000x speedup from optimization!

Neural Network Attention

# Naive: einsum("bqd,bkd,bkv->bqv", Q, K, V)
# Time: O(q²d + qkd + qkv)

# Optimized: Q @ K.T @ V
# Time: O(qkd + qkv)
# Speedup: ~10x for typical dimensions

Optimization Objectives

Different scenarios need different trade-offs:

Scenario	Optimize	Accept	Example
GPU limited	Space (sc)	Higher time	Large models on 8GB GPU
CPU compute	Time (tc)	Higher space	Batch processing with 256GB RAM
Both	Balanced	sc_target = available memory	Most cases

Practical Guidelines

Start with greedy: Fast baseline for most cases
Use TreeSA if:
- Greedy result is too slow/large
- You have time to optimize (minutes, not seconds)
- Problem is large and complex

Set sc_target to available memory:

# For 8GB GPU with float32 tensors
sc_target = log2(8 * 1024**3 / 4) ≈ 31

Use slicing when memory is the bottleneck:

if complexity.sc > sc_target:
    sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())

Next Steps

Complexity Metrics - Understand tc, sc, rwc
Greedy Method - Fast optimization
TreeSA - Higher quality optimization

Complexity Metrics

omeco tracks three key metrics for tensor contractions.

The Three Metrics

Time Complexity (tc)

Definition: log₂ of total floating-point operations (FLOPs).

Interpretation:

tc = 20 means 2²⁰ = 1,048,576 FLOPs
tc = 30 means 2³⁰ ≈ 1 billion FLOPs
Lower is better (less computation)

Example:

complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f} FLOPs")
print(f"Actual FLOPs: {complexity.flops():,.0f}")

Space Complexity (sc)

Definition: log₂ of peak memory usage in number of tensor elements.

Interpretation:

sc = 20 means 2²⁰ = 1,048,576 elements (8MB for float64, 4MB for float32)
sc = 30 means 2³⁰ ≈ 1 billion elements (8GB for float64, 4GB for float32)
Lower is better (less memory)

Example:

complexity = tree.complexity(ixs, sizes)
print(f"Space: 2^{complexity.sc:.2f} elements")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")

# For float64 (8 bytes)
memory_gib = complexity.peak_memory() * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")

Read-Write Complexity (rwc)

Definition: log₂ of total memory I/O operations.

Use Case: GPU optimization where memory bandwidth is the bottleneck.

Interpretation:

On CPU: Usually ignored (rw_weight=0)
On GPU: Critical! Compute-to-memory-bandwidth ratio is high (10-30x)
Lower is better (less data movement)

Example:

# CPU optimization (ignore I/O)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0)

# GPU optimization (I/O matters! See GPU Optimization Guide)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0)

Why Logarithmic Scale?

Using log₂ makes the numbers manageable:

Linear Scale	Log Scale	Human Readable
1,024	10	~thousand
1,048,576	20	~million
1,073,741,824	30	~billion
1,099,511,627,776	40	~trillion

Advantages:

Easy to compare: tc=25 vs tc=30 is clear 32x difference
Handles huge ranges: from thousands to quintillions
Matches memory sizes: 2ⁿ bytes

Calculating Complexity

For a Single Contraction

Contracting A[i,j] with B[j,k] → C[i,k]:

Time: i × j × k FLOPs
Space: max(i×j, j×k, i×k) elements

For a Contraction Tree

Sum time over all contractions, take max space:

def compute_complexity(tree):
    if tree.is_leaf():
        return Complexity(tc=0, sc=log2(tensor_size))
    
    # Recurse on children
    left_comp = compute_complexity(tree.left)
    right_comp = compute_complexity(tree.right)
    
    # Contraction cost
    contraction_tc = log2(flops_for_this_step)
    contraction_sc = log2(intermediate_size)
    
    # Combine
    tc = log2sumexp2([left_comp.tc, right_comp.tc, contraction_tc])
    sc = max(left_comp.sc, right_comp.sc, contraction_sc)
    
    return Complexity(tc=tc, sc=sc)

Practical Interpretation

Time Complexity

tc	FLOPs	CPU Time (1 TFLOP/s)	GPU Time (10 TFLOP/s)
20	1M	0.001 ms	0.0001 ms
30	1B	1 ms	0.1 ms
40	1T	1 second	0.1 second
50	1P	17 minutes	1.7 minutes

Space Complexity

sc	Elements	float32 Memory	float64 Memory
20	1M	4 MB	8 MB
25	32M	128 MB	256 MB
30	1B	4 GB	8 GB
35	32B	128 GB	256 GB

Optimization Goals

Minimize Time (Speed-Critical)

score = ScoreFunction(tc_weight=1.0, sc_weight=0.0, sc_target=float('inf'))

Use when:

Memory is abundant
Need fastest execution
Real-time applications

Minimize Space (Memory-Limited)

score = ScoreFunction(tc_weight=0.0, sc_weight=1.0, sc_target=25.0)

Use when:

Memory is constrained (e.g., 8GB GPU)
Can afford longer computation
Embedded systems

Balanced (Most Common)

score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, sc_target=28.0)

Use when:

Both time and space matter
Setting sc_target to available memory
General-purpose optimization

Example: Comparing Orders

# Greedy result
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)

# TreeSA result
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)

# Compare
print(f"Greedy - Time: 2^{comp_greedy.tc:.2f}, Space: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - Time: 2^{comp_sa.tc:.2f}, Space: 2^{comp_sa.sc:.2f}")

# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")

Next Steps

Score Function Configuration - Optimize for your hardware
GPU Optimization - Maximize GPU performance
Performance Benchmarks - Real-world performance data

Algorithms

omeco provides three optimization algorithms with different speed-quality trade-offs.

Algorithm Comparison

Algorithm	Speed	Quality	Use Case
GreedyMethod	Fast (seconds)	Good	Quick optimization, large networks
ExhaustiveSearch	Exponential	Optimal	Small networks, exact baselines
TreeSA	Slower (minutes)	Better	High-quality solutions, important workloads

Quick Guide

Use GreedyMethod when:

You need results quickly
Network has <100 tensors
Greedy result is good enough

Use TreeSA when:

You have time to optimize
Need best possible solution
Greedy result is too slow/large
Working with complex tensor networks

Use ExhaustiveSearch when:

The network is small enough for exact dynamic programming
You need a known-optimal FLOP-count baseline
You are testing or benchmarking heuristic optimizers

Topics

Greedy Method - Fast O(n² log n) optimization
Exhaustive Search - Exact dynamic programming for small networks
TreeSA - Simulated annealing for quality
Algorithm Comparison - Detailed benchmarks

Next Steps

Choose an algorithm to learn more, or see the comparison for benchmarks.

Greedy Method

Fast greedy algorithm for tensor contraction order optimization.

How It Works

The greedy algorithm repeatedly contracts the pair of tensors with minimum cost until one tensor remains.

Algorithm:

while more than one tensor remains:
    1. Consider all pairs of tensors
    2. Compute cost of contracting each pair
    3. Contract the pair with minimum cost
    4. Replace the pair with their contraction result

Time Complexity: O(n² log n) where n is the number of tensors.

Basic Usage

Python

from omeco import optimize_code, GreedyMethod

# Deterministic greedy (default optimizer)
tree = optimize_code(ixs, out, sizes)

# Or explicitly
tree = optimize_code(ixs, out, sizes, GreedyMethod())

Rust

#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};

let method = GreedyMethod::default();
let tree = optimize_code(&code, &sizes, &method)?;
}

Stochastic Variants

Add randomness to explore more solutions:

# alpha: controls randomness (0 = deterministic, 1 = fully random)
# temperature: softmax temperature for selection
method = GreedyMethod(alpha=0.5, temperature=1.0)
tree = optimize_code(ixs, out, sizes, method)

Parameters:

alpha=0.0 (default): Always pick minimum cost (deterministic)
alpha=0.5: Mix of greedy and random choices
alpha=1.0: Uniform random selection
temperature: Controls selection distribution (higher = more random)

Performance Characteristics

Advantages:

⚡ Very fast: seconds for 100+ tensors
🎯 Deterministic by default (reproducible)
📈 Scales well to large networks
💡 Good baseline for most cases

Limitations:

🎲 Can get stuck in local optima
🔍 Myopic: only considers immediate cost
📊 May miss global optimal solution

Example: Matrix Chain

from omeco import optimize_code

# A[100×10] × B[10×20] × C[20×5]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 100, 1: 10, 2: 20, 3: 5}

tree = optimize_code(ixs, out, sizes)
print(tree)

Output:

ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

This contracts B×C first (cost: 10×20×5 = 1,000), then A×(BC) (cost: 100×10×5 = 5,000). Total: 6,000 FLOPs.

Alternative order (A×B)×C would cost: 100×10×20 + 100×20×5 = 30,000 FLOPs (5x worse!).

When to Use

✅ Use GreedyMethod when:

You need quick results (prototyping, iteration)
Network is straightforward (chains, grids)
Memory/time constraints are relaxed

❌ Consider TreeSA instead when:

Greedy result is too slow/large
Network is complex (irregular graphs)
You have time for better optimization
Result will be used repeatedly

Tips

Start with default: GreedyMethod() works for most cases

Try stochastic for variety:

# Run 10 times with randomness, pick best
best_tree = None
best_complexity = float('inf')

for _ in range(10):
    tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.3))
    complexity = tree.complexity(ixs, sizes)
    if complexity.tc < best_complexity:
        best_tree = tree
        best_complexity = complexity.tc

Combine with slicing if memory is tight:

tree = optimize_code(ixs, out, sizes)
if tree.complexity(ixs, sizes).sc > 25.0:
    sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())

Next Steps

TreeSA - For higher quality solutions
Algorithm Comparison - Benchmark results
Quick Start - Complete examples

Exhaustive Search

ExhaustiveSearch is an exact contraction-order optimizer for small tensor networks. It uses dynamic programming over tensor subsets and minimizes total contraction FLOP count within each connected component.

Usage

Python:

from omeco import ExhaustiveSearch, optimize_code

ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 2, 1: 3, 2: 4, 3: 5}

tree = optimize_code(ixs, out, sizes, ExhaustiveSearch())

Rust:

#![allow(unused)]
fn main() {
use omeco::{EinCode, ExhaustiveSearch, optimize_code};

let code = EinCode::new(
    vec![vec!['i', 'j'], vec!['j', 'k'], vec!['k', 'l']],
    vec!['i', 'l'],
);
let sizes = omeco::uniform_size_dict(&code, 10);

let tree = optimize_code(&code, &sizes, &ExhaustiveSearch::default()).unwrap();
}

Scope

The exact search supports hyperedges and shared output indices, including batch or diagonal-style indices that appear in multiple tensors and remain in the output. Disconnected networks are optimized component by component and combined with outer products.

For nontrivial networks, ExhaustiveSearch rejects partial traces and dangling summed indices because those require unary tensor operations outside the binary contraction tree search. One- and two-tensor inputs are returned directly.

When To Use

Use ExhaustiveSearch for small networks, regression tests, and exact baselines when comparing heuristic optimizers. For larger networks, use GreedyMethod for speed or TreeSA for a higher-quality heuristic search.

TreeSA

Simulated annealing optimizer for high-quality tensor contraction orders.

How It Works

TreeSA uses simulated annealing to explore the space of contraction trees:

Start with a random or greedy tree
Repeatedly apply random mutations (subtree rotations, swaps)
Accept better mutations, sometimes accept worse ones (to escape local optima)
Gradually decrease “temperature” to converge

Result: Often finds significantly better orders than greedy.

Basic Usage

Fast Preset (Recommended)

from omeco import optimize_code, TreeSA

# Quick optimization with sensible defaults
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Custom Configuration

from omeco import optimize_code, TreeSA, ScoreFunction

# More thorough search
score = ScoreFunction(sc_target=25.0)
optimizer = TreeSA(
    ntrials=10,      # Number of independent runs
    niters=50,       # Iterations per run
    score=score      # Custom scoring function
)
tree = optimize_code(ixs, out, sizes, optimizer)

Rust

#![allow(unused)]
fn main() {
use omeco::{TreeSA, ScoreFunction, optimize_code};

let score = ScoreFunction::default();
let optimizer = TreeSA::fast(score);
let tree = optimize_code(&code, &sizes, &optimizer)?;
}

Parameters

`ntrials` (default: 10)

Number of independent optimization runs. The best result is returned.

Higher = better quality (more exploration)
Lower = faster
Typical range: 5-20

`niters` (default: 50)

Number of iterations per trial.

Higher = better convergence
Lower = faster
Typical range: 20-100

`betas` (optional)

Temperature schedule for simulated annealing.

Default: logarithmic schedule from 1 to 40
Custom: [β₁, β₂, ..., βₙ] where β = 1/temperature
Higher β = less randomness (exploitation)
Lower β = more randomness (exploration)

`score` (optional)

Custom ScoreFunction for hardware-specific optimization.

# GPU optimization (see GPU Optimization Guide)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=30.0
)
optimizer = TreeSA(ntrials=5, niters=50, score=score)

Performance Characteristics

Advantages:

🏆 Higher quality solutions than greedy
🔍 Explores global search space
⚙️ Configurable time-quality trade-off
🎯 Often finds near-optimal solutions

Limitations:

⏱️ Slower than greedy (minutes vs seconds)
🎲 Non-deterministic (different runs give different results)
🔧 Requires parameter tuning for best results

Example Comparison

from omeco import optimize_code, GreedyMethod, TreeSA

# Same problem, different optimizers
ixs = [[0, 1, 2], [2, 3, 4], [4, 5, 6], [6, 7, 0]]  # Cycle
out = [1, 3, 5, 7]
sizes = {i: 2 for i in range(8)}

# Greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)

# TreeSA
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)

print(f"Greedy - tc: 2^{comp_greedy.tc:.2f}, sc: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - tc: 2^{comp_sa.tc:.2f}, sc: 2^{comp_sa.sc:.2f}")

# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")

Typical output:

Greedy - tc: 2^16.00, sc: 2^12.00
TreeSA - tc: 2^14.58, sc: 2^11.00
TreeSA is 2.7x faster

When to Use

✅ Use TreeSA when:

Greedy result is too slow/uses too much memory
Network is complex (cycles, irregular graphs)
Result will be used many times (worth optimizing once)
You have minutes to spare for optimization

❌ Stick with GreedyMethod when:

Need results in seconds
Network is simple (chains, small grids)
Iterating/prototyping
Greedy is already good enough

Optimization Workflow

# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy: tc={comp_greedy.tc:.2f}, sc={comp_greedy.sc:.2f}")

# 2. If not satisfactory, try TreeSA
if comp_greedy.sc > 28.0:  # Too much memory
    score = ScoreFunction(sc_target=28.0, sc_weight=2.0)
    tree_sa = optimize_code(ixs, out, sizes, TreeSA(ntrials=10, score=score))
    comp_sa = tree_sa.complexity(ixs, sizes)
    print(f"TreeSA: tc={comp_sa.tc:.2f}, sc={comp_sa.sc:.2f}")

Tuning Parameters

For Speed

# Fast TreeSA (2-3x slower than greedy)
TreeSA(ntrials=2, niters=20)

For Quality

# Thorough TreeSA (10-20x slower than greedy)
TreeSA(ntrials=20, niters=100)

For Memory-Constrained

# Prioritize space over time
score = ScoreFunction(tc_weight=0.1, sc_weight=10.0, sc_target=25.0)
TreeSA(ntrials=10, niters=50, score=score)

Advanced: Custom Temperature Schedule

# Custom annealing schedule
import numpy as np

# Slow cooling for thorough search
betas = np.logspace(0, 2, 50)  # 1 to 100

optimizer = TreeSA(ntrials=5, betas=betas.tolist())

Next Steps

Score Function Configuration - Optimize for your hardware
Algorithm Comparison - Detailed benchmarks
Slicing Strategy - Reduce memory further

Algorithm Comparison

Detailed comparison of GreedyMethod, ExhaustiveSearch, and TreeSA.

Quick Summary

Metric	GreedyMethod	ExhaustiveSearch	TreeSA
Speed	Seconds	Exponential	Minutes
Solution Quality	Good	Optimal FLOP count	Better
Deterministic	Yes (default)	Yes	No
Scalability	100+ tensors	Small networks	50-100 tensors
Tuning Required	Minimal	None	Some

Performance Benchmarks

Based on benchmark results from the Julia implementation (OMEinsumContractionOrders.jl):

3-Regular Graph (50 vertices, bond dim 2)

Algorithm	Time Complexity (tc)	Space Complexity (sc)	Runtime
GreedyMethod	2^34.1	2^11.4	0.05s
TreeSA (fast)	2^33.8	2^10.9	2.1s
TreeSA (thorough)	2^33.5	2^10.8	8.4s

Result: TreeSA finds 1.5x faster, 40% less memory solution, taking 40-170x longer to optimize.

Matrix Chain (20 matrices)

Algorithm	Time Complexity	Space Complexity	Runtime
GreedyMethod	2^18.2	2^10.5	0.01s
TreeSA	2^17.9	2^10.3	0.3s

Result: TreeSA is 1.2x faster, using 30x more optimization time.

Speed vs Quality Trade-off

                Quality
                   ↑
                   │
         TreeSA    │
        (thorough) │
                   │
         TreeSA    │
         (fast)    │
                   │
        Greedy     │
       (stochastic)│
                   │
        Greedy     │
      (default)    │
                   │
                   └──────────────→ Speed

When Each Algorithm Shines

GreedyMethod Excels

Problem Type: Chains and trees

# Chain: A×B×C×D×E
ixs = [[0,1], [1,2], [2,3], [3,4]]
out = [0,4]

# Greedy finds optimal order in O(n²)
tree = optimize_code(ixs, out, sizes, GreedyMethod())

Characteristics:

Sequential structure
No cycles
Clear optimal pairing strategy

TreeSA Excels

Problem Type: Cycles and complex graphs

# Cycle graph (much harder!)
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]]
out = [1,3,5,7]

# TreeSA explores better orderings
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Characteristics:

Irregular connectivity
Hyperedges (indices in 3+ tensors)
Multiple reasonable orderings

Real-World Example: Quantum Circuit

Problem: 30-qubit quantum circuit, 50 gates

# Greedy result
tree_greedy = optimize_code(circuit_ixs, out, sizes, GreedyMethod())
comp_greedy = contraction_complexity(tree_greedy, circuit_ixs, sizes)
# tc: 2^42.3, sc: 2^28.1, time: 0.2s

# TreeSA result
tree_sa = optimize_code(circuit_ixs, out, sizes, TreeSA(ntrials=10, niters=50))
comp_sa = contraction_complexity(tree_sa, circuit_ixs, sizes)
# tc: 2^40.1, sc: 2^26.8, time: 15s

# Improvement
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)  # 4.6x faster execution
memory_reduction = 2 ** (comp_greedy.sc - comp_sa.sc)  # 2.5x less memory

Verdict: TreeSA took 75x longer to optimize but found a solution that’s 4.6x faster to execute. If you’ll run the circuit 100+ times, TreeSA pays for itself.

Optimization Time vs Problem Size

GreedyMethod Scaling

Tensors	Optimization Time
10	<0.01s
50	0.05s
100	0.2s
500	4s

Complexity: O(n² log n)

TreeSA Scaling

Tensors	TreeSA.fast()	TreeSA (default)
10	0.1s	0.5s
50	2s	10s
100	15s	60s
200	90s	300s

Complexity: O(ntrials × niters × n²)

Decision Guide

Start
  │
  ├─ Need a small exact baseline? ──→ ExhaustiveSearch
  │
  ├─ Need results in <1 second? ──→ GreedyMethod
  │
  ├─ Problem is a chain/tree? ──→ GreedyMethod
  │
  ├─ Greedy result good enough? ──→ GreedyMethod
  │
  ├─ Have >1 minute to optimize? ──→ TreeSA
  │
  ├─ Complex graph structure? ──→ TreeSA
  │
  └─ Need best possible solution? ──→ TreeSA (thorough)

Hybrid Approach

Use both in sequence:

# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy baseline: tc={comp_greedy.tc:.2f}")

# 2. If not good enough, refine with TreeSA
if comp_greedy.tc > 35.0:  # Too slow
    print("Refining with TreeSA...")
    tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
    comp_sa = tree_sa.complexity(ixs, sizes)
    print(f"TreeSA result: tc={comp_sa.tc:.2f}")

    improvement = 2 ** (comp_greedy.tc - comp_sa.tc)
    print(f"Improvement: {improvement:.1f}x speedup")

Summary Recommendations

Scenario	Algorithm	Configuration
Exact small-network baseline	ExhaustiveSearch	Default
Quick prototyping	GreedyMethod	Default
Production (simple)	GreedyMethod	Default
Production (complex)	TreeSA	`TreeSA.fast()`
Critical optimization	TreeSA	`ntrials=20, niters=100`
Memory-constrained	TreeSA + Slicing	Custom ScoreFunction

Next Steps

Greedy Method - Details on greedy algorithm
Exhaustive Search - Exact optimizer for small networks
TreeSA - Details on simulated annealing
Score Function - Tune for your hardware

Score Function Configuration

Configure omeco’s optimization for different hardware and use cases.

What is the Score Function?

The ScoreFunction controls how the optimizer balances three objectives:

tc (time complexity): Total FLOPs
sc (space complexity): Peak memory usage
rwc (read-write complexity): Memory I/O operations

Formula:

score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)

Lower score is better. The optimizer tries to minimize this score.

Basic Usage

Default (Balanced)

from omeco import ScoreFunction

# Balanced optimization (works for most cases)
score = ScoreFunction()
# Equivalent to:
# ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=20.0)

Custom Weights

# Prioritize memory over time
score = ScoreFunction(
    tc_weight=1.0,      # Time matters
    sc_weight=2.0,      # Memory matters 2x more
    rw_weight=0.0,      # Ignore I/O (CPU)
    sc_target=25.0      # Target 2^25 elements (~256MB for float64)
)

Using with Optimizers

from omeco import TreeSA, optimize_code

score = ScoreFunction(sc_target=28.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Hardware-Specific Configuration

CPU Optimization

Characteristics:

Memory bandwidth is not the bottleneck
Balance time and space
Ignore read-write complexity

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=0.0,      # ← CPU: I/O is not a bottleneck
    sc_target=28.0      # ~256MB (2^28 × 8 bytes for float64)
)

Calculate sc_target:

import math

# For 16GB RAM, reserve half for tensors
available_gb = 8
bytes_per_element = 8  # float64
sc_target = math.log2(available_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 30.0 (8GB)

GPU Optimization

Note: GPU memory bandwidth is often the bottleneck, not compute.

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # See GPU Optimization Guide for tuning
    sc_target=30.0      # ~8GB GPU (2^30 × 4 bytes for float32)
)

For GPU optimization: See GPU Optimization Guide for rw_weight tuning methodology.

Calculate sc_target for GPU:

import math

# NVIDIA RTX 3090: 24GB VRAM
gpu_gb = 24
bytes_per_element = 4  # float32 (most common on GPU)
sc_target = math.log2(gpu_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.5

# Be conservative (leave room for framework overhead)
sc_target = 32.0  # ~16GB usable

Decision Guide

Choose based on your bottleneck:

Default (CPU, balanced): Use ScoreFunction() with defaults
- Works for most CPU scenarios
- Balances time and memory
GPU: See GPU Optimization Guide for complete rw_weight tuning methodology
Memory constrained: Increase sc_weight and lower sc_target
- Example: ScoreFunction(sc_weight=3.0, sc_target=25.0)
- Accepts slower execution to fit in memory
Need slicing: See Slicing Strategy Guide
- Use when contraction exceeds available memory
- Trades time for space

Advanced Configuration

Memory-Constrained Optimization

When memory is the primary constraint:

# Heavily penalize exceeding target
score = ScoreFunction(
    tc_weight=0.1,      # Time is secondary
    sc_weight=10.0,     # Memory is critical
    rw_weight=0.0,
    sc_target=25.0      # Hard limit: 256MB
)

# Use with TreeSA for best results
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)

Speed-Critical Optimization

When execution speed is paramount:

# Minimize time complexity only
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=0.0,      # Ignore memory
    rw_weight=0.0,
    sc_target=float('inf')  # No memory limit
)

Hybrid CPU+GPU

For heterogeneous systems:

# Moderate I/O penalty (between CPU and GPU)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # Moderate I/O cost
    sc_target=29.0      # 4GB limit
)

sc_target Reference

Memory	float32	float64	sc_target
256 MB	64M elements	32M elements	25.0-26.0
1 GB	256M	128M	27.0-28.0
4 GB	1B	512M	29.0-30.0
8 GB	2B	1B	30.0-31.0
16 GB	4B	2B	31.0-32.0
32 GB	8B	4B	32.0-33.0

Tuning Workflow

Start with defaults:

tree = optimize_code(ixs, out, sizes)
complexity = tree.complexity(ixs, sizes)
print(f"tc: {complexity.tc:.2f}, sc: {complexity.sc:.2f}")

Identify bottleneck:
- If sc too high → increase sc_weight or lower sc_target
- If tc too high → try TreeSA with default score
- If running on GPU → see GPU Optimization Guide

Adjust and re-optimize:

score = ScoreFunction(sc_weight=2.0, sc_target=28.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Verify improvement:

new_complexity = tree.complexity(ixs, sizes)
print(f"New tc: {new_complexity.tc:.2f}, sc: {new_complexity.sc:.2f}")

Examples

See examples/score_function_examples.py for complete examples:

CPU optimization
GPU optimization (experimental rw_weight tuning)
Memory-limited environments
Dynamic sc_target calculation

Next Steps

GPU Optimization - Maximize GPU performance
Slicing Strategy - Reduce memory with slicing
Complexity Metrics - Understand tc, sc, rwc

GPU Optimization

Maximize performance when running tensor contractions on GPUs.

The GPU Bottleneck

GPUs have 10-100x higher compute-to-memory-bandwidth ratio than CPUs. Memory I/O is the bottleneck.

Configuration

from omeco import ScoreFunction, TreeSA, optimize_code

# GPU optimization
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # Tune based on profiling (see below)
    sc_target=30.0      # Your GPU memory limit
)

tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Tuning rw_weight:

No established “best” value—tune empirically:

Start with rw_weight=1.0
Try 10.0, then 20.0 or higher
Profile actual GPU execution time
Adjust based on results

Range: 0.1 to >20 depending on GPU model, tensor sizes, and precision.

Calculate sc_target for Your GPU

import math

# NVIDIA A100: 40GB
gpu_gb = 40
bytes_per_element = 4  # float32

# Maximum usable memory (leave ~20% for overhead)
usable_gb = gpu_gb * 0.8
sc_target = math.log2(usable_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.0

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=sc_target
)

GPU Memory Reference

GPU	VRAM	sc_target (float32)
RTX 3060	12GB	31.0
RTX 3090	24GB	32.2
A100 (40GB)	40GB	32.9
A100 (80GB)	80GB	33.9
H100	80GB	33.9

When It Matters

High impact: Complex networks (quantum circuits, GNNs, irregular contractions) Low impact: Simple chains, matrix multiplications, small networks (<10 tensors)

Common Mistakes

❌ Using rw_weight=0.0 (CPU default) on GPU
❌ Wrong sc_target (e.g., using CPU memory size for GPU)
❌ Using GreedyMethod instead of TreeSA (greedy doesn’t optimize for I/O)

Checklist

Calculate sc_target from GPU memory (float32: 4 bytes/element, reserve 20% overhead)
Set rw_weight (start with 1.0, try 10.0 and 20.0, profile)
Use TreeSA optimizer (not greedy)

Next Steps

Score Function Configuration - Full parameter guide
Slicing Strategy - Reduce memory further
PyTorch Integration - Use with PyTorch

Slicing Strategy

Reduce memory usage by slicing tensor indices, trading time for space.

What is Slicing?

Slicing breaks a tensor contraction into smaller chunks by looping over one or more indices.

Example:

# Original: C[i,k] = Σⱼ A[i,j] * B[j,k]
# Too large: j has size 10000

# Sliced: Loop over j in chunks
C = zeros(i, k)
for j_val in range(10000):
    C += A[:, j_val] * B[j_val, :]

Trade-off:

✅ Memory reduced: Only store one slice at a time
❌ Time increased: Repeat computation for each slice

When to Use Slicing

Use slicing when:

Contraction exceeds available memory
sc (space complexity) is too high
Can’t reduce memory by changing contraction order alone

Decision flowchart:

Optimize contraction order
    ↓
Check space complexity
    ↓
sc > sc_target?
  ├─ No  → Done
  └─ Yes → Apply slicing

Basic Usage

Automatic Slicing

from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction

# 1. Optimize contraction order
tree = optimize_code(ixs, out, sizes)

# 2. Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 28.0:  # More than 256MB
    # 3. Slice to reduce memory
    score = ScoreFunction(sc_target=28.0)
    slicer = TreeSASlicer.fast(score=score)
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # 4. Verify reduction
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Original: 2^{complexity.sc:.2f} elements")
    print(f"Sliced: 2^{sliced_comp.sc:.2f} elements")
    print(f"Sliced indices: {sliced.slicing()}")

Manual Slicing

Specify which indices to slice:

# Force slicing on specific indices
slicer = TreeSASlicer.fast(fixed_slices=[1, 2])
sliced = slice_code(tree, ixs, sizes, slicer)

# Verify that indices 1 and 2 are sliced
assert 1 in sliced.slicing()
assert 2 in sliced.slicing()

TreeSASlicer Configuration

Fast Preset

# Quick slicing optimization
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)

Custom Configuration

from omeco import TreeSASlicer, ScoreFunction

score = ScoreFunction(
    tc_weight=0.1,      # Accept higher time
    sc_weight=10.0,     # Prioritize space
    sc_target=25.0      # Target 256MB
)

slicer = TreeSASlicer(
    ntrials=10,          # Number of optimization runs
    niters=50,           # Iterations per run
    fixed_slices=[1],    # Force slicing on index 1
    score=score
)

sliced = slice_code(tree, ixs, sizes, slicer)

Understanding the Trade-off

Example: Large Matrix Multiplication

# A[10000 × 10000] × B[10000 × 10000]
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10000, 1: 10000, 2: 10000}

# Without slicing
tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
# tc ≈ 33.2 (10^10 FLOPs)
# sc ≈ 26.6 (100M elements = 800MB for float64)

# With slicing on index 1 (middle dimension)
slicer = TreeSASlicer.fast(fixed_slices=[1])
sliced = slice_code(tree, ixs, sizes, slicer)
sliced_comp = sliced.complexity(ixs, sizes)
# tc ≈ 33.2 (same FLOPs, distributed over slices)
# sc ≈ 23.3 (10M elements = 80MB for float64)

# Result: 10x memory reduction, same total time (but distributed)

Choosing Which Indices to Slice

Automatic Selection

TreeSASlicer automatically chooses which indices to slice:

# Let the optimizer decide
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)

print(f"Automatically sliced indices: {sliced.slicing()}")

Algorithm: Tries slicing different indices and picks the combination that best satisfies sc_target while minimizing tc increase.

Manual Selection Guidelines

Good candidates for slicing:

Indices with large dimensions
Indices that appear in many tensors (high degree)
Indices not in the output (no outer loop needed)

Poor candidates:

Output indices (can’t slice without changing result structure)
Small dimensions (little memory benefit)
Indices that would cause many small intermediate tensors

Advanced: Slicing with GPU

For GPU workloads, consider both memory and I/O:

# GPU slicing: balance memory and I/O
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=2.0,      # Space is critical on GPU
    rw_weight=0.1,      # Experimental: tune empirically
    sc_target=30.0      # 8GB GPU limit
)

slicer = TreeSASlicer(
    ntrials=10,
    niters=50,
    score=score
)

sliced = slice_code(tree, ixs, sizes, slicer)

Note: Slicing may increase rwc (read-write complexity) because data is loaded multiple times. The optimizer balances this with the rw_weight parameter.

Example Workflow

from omeco import optimize_code, slice_code
from omeco import TreeSA, TreeSASlicer, ScoreFunction

# Problem setup
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]]  # Cycle graph
out = [1,3,5,7]
sizes = {i: 32 for i in range(8)}

# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
comp = tree.complexity(ixs, sizes)
print(f"Optimized: sc = 2^{comp.sc:.2f}")

# Step 2: Check if slicing needed
if comp.sc > 20.0:  # Exceeds 1MB
    print("Applying slicing...")
    
    # Step 3: Slice
    slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=18.0))
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # Step 4: Verify
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Sliced: sc = 2^{sliced_comp.sc:.2f}")
    print(f"Sliced indices: {sliced.slicing()}")
    
    # Calculate overhead
    time_overhead = 2 ** (sliced_comp.tc - comp.tc)
    memory_reduction = 2 ** (comp.sc - sliced_comp.sc)
    print(f"Time overhead: {time_overhead:.1f}x")
    print(f"Memory reduction: {memory_reduction:.1f}x")

Multiple Slices

Slice multiple indices for extreme memory reduction:

# Very tight memory constraint
score = ScoreFunction(
    tc_weight=0.01,     # Accept much higher time
    sc_weight=100.0,    # Memory is critical
    sc_target=15.0      # Only 32KB available!
)

slicer = TreeSASlicer(ntrials=20, niters=100, score=score)
sliced = slice_code(tree, ixs, sizes, slicer)

# May slice multiple indices
print(f"Sliced {len(sliced.slicing())} indices: {sliced.slicing()}")

Performance Impact

Scenario	Memory Reduction	Time Overhead	When to Use
Slice 1 index	2-10x	1-2x	Moderate memory constraint
Slice 2 indices	10-100x	2-5x	Severe memory constraint
Slice 3+ indices	100-1000x	5-20x	Extreme memory constraint

Tips

Optimize order first, slice second:

# Good workflow
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())

Set realistic sc_target:

# Target 80% of available memory
available_gb = 8
sc_target = math.log2(available_gb * 0.8 * 1024**3 / 8)

Profile actual memory usage:

# Theoretical vs actual memory may differ
# Always test with real data

Consider time-memory trade-off:

# For one-time computation: accept high memory
# For repeated computation: slice to fit memory

Next Steps

Score Function Configuration - Tune slicing parameters
GPU Optimization - GPU-specific slicing
Complexity Metrics - Understand sc and tc

PyTorch Integration

Use omeco to optimize tensor contractions in PyTorch.

Overview

omeco optimizes the contraction order, then you execute with PyTorch tensors.

Workflow:

Define network structure (indices and sizes)
Optimize with omeco → get contraction tree
Execute tree with PyTorch tensors

Basic Example

import torch
from omeco import optimize_code

# Define tensor network structure
ixs = [[0, 1], [1, 2], [2, 3]]  # Matrix chain
out = [0, 3]
sizes = {0: 100, 1: 200, 2: 50, 3: 100}

# Optimize contraction order
tree = optimize_code(ixs, out, sizes)
print(f"Optimized tree:\n{tree}")

# Create PyTorch tensors
tensors = [
    torch.randn(100, 200),  # A[0,1]
    torch.randn(200, 50),   # B[1,2]
    torch.randn(50, 100),   # C[2,3]
]

# Execute contraction (see full example below)
result = execute_tree(tree, tensors)

Complete Example

See examples/pytorch_tensor_network_example.py for a full working example with:

Tree traversal and execution
Verification against direct einsum
4×4 grid tensor network
Memory reduction via slicing

Converting Tree to Dict

tree_dict = tree.to_dict()

# Dictionary structure:
# Leaf: {"tensor_index": int}
# Node: {"args": [children], "eins": {"ixs": [[int]], "iy": [int]}}

Executing the Tree

def execute_tree(tree_dict, tensors):
    """Recursively execute contraction tree."""
    if "tensor_index" in tree_dict:
        # Leaf node - return tensor
        return tensors[tree_dict["tensor_index"]]
    
    # Internal node - execute children then contract
    args = [execute_tree(child, tensors) for child in tree_dict["args"]]
    eins = tree_dict["eins"]
    
    # Build einsum string
    ixs_str = [indices_to_str(ix) for ix in eins["ixs"]]
    iy_str = indices_to_str(eins["iy"])
    einsum_eq = f"{','.join(ixs_str)}->{iy_str}"
    
    # Execute with PyTorch
    return torch.einsum(einsum_eq, *args)

def indices_to_str(indices):
    """Convert integer indices to letters."""
    return ''.join(chr(ord('a') + i) for i in indices)

GPU Execution

# Move tensors to GPU
tensors_gpu = [t.cuda() for t in tensors]

# Execute on GPU
result = execute_tree(tree.to_dict(), tensors_gpu)

With Autograd

# Tensors with gradients
tensors = [torch.randn(100, 200, requires_grad=True) for _ in range(3)]

# Execute
result = execute_tree(tree.to_dict(), tensors)

# Backward pass
result.sum().backward()

# Gradients available
print(tensors[0].grad)

Next Steps

Score Function - Optimize for your hardware
GPU Optimization - Maximize GPU performance
Examples - Complete code

Performance Benchmarks

Real-world performance comparison with Julia implementation.

Benchmark Methodology

All benchmarks compare against OMEinsumContractionOrders.jl, the original Julia implementation.

Test Environment:

Benchmarks run via Python bindings (PyO3)
Configuration: ntrials=1, niters=50-100
See benchmarks/ directory for scripts

Rust vs Julia (TreeSA)

Comparison of TreeSA optimizer performance:

Problem	Tensors	Indices	Rust (ms)	Julia (ms)	tc (Rust)	tc (Julia)	Speedup
chain_10	10	11	18.5	31.6	23.10	23.10	1.7x
grid_4x4	16	24	88.0	150.7	9.18	9.26	1.7x
grid_5x5	25	40	155.4	297.7	10.96	10.96	1.9x
reg3_250	250	372	2,435	5,099	48.00	47.17	2.1x

Results:

Rust is 1.7-2.1x faster than Julia for TreeSA optimization
Both implementations find solutions with comparable time complexity (tc)
Solution quality is nearly identical between implementations

Greedy Method Performance

GreedyMethod is extremely fast for quick optimization:

Problem	Tensors	Time (ms)	tc	sc
chain_10	10	0.04	23.10	13.29
grid_4x4	16	0.12	9.54	5.0
grid_5x5	25	0.23	11.28	6.0
reg3_250	250	7.5	69.00	47.0

Key Observations:

Greedy is 100-300x faster than TreeSA
Greedy gives good results for small problems (chain_10, grids)
For large problems (reg3_250), TreeSA finds much better solutions:
- Greedy: tc = 69.00
- TreeSA: tc = 48.00
- Improvement: 2^(69-48) = 2^21 = 2 million times faster execution!

Python Overhead

Python bindings via PyO3 add minimal overhead:

TreeSA (grid_4x4, 100 iterations):

Pure Rust backend: ~88ms
Python call overhead: <1ms (~1%)

Greedy (reg3_250):

Pure optimization: ~7.5ms
Python overhead: <0.5ms (~6%)

Conclusion: Python overhead is negligible, especially for TreeSA optimization.

Algorithm Comparison

When to use each algorithm:

Scenario	Recommended	Reason
Quick prototyping	Greedy	Sub-millisecond optimization
Production (< 50 tensors)	Greedy	Fast enough, good quality
Production (50-200 tensors)	TreeSA.fast()	Better quality, still reasonable time
Large networks (> 200 tensors)	TreeSA	Significant quality improvement
Time-critical	Greedy	Predictable fast performance
Execution-critical	TreeSA	Better contraction order = faster execution

Execution Time Savings

The time spent optimizing pays off during execution:

Example: reg3_250 network

Greedy optimization: 7.5ms
TreeSA optimization: 2,435ms (~2.4 seconds)
Extra optimization time: +2.4 seconds

But the execution improvement:

Greedy tc: 69.00 → ~2^69 FLOPs
TreeSA tc: 48.00 → ~2^48 FLOPs
Execution speedup: 2^21 = 2 million times faster

If executing even once, TreeSA optimization is worth it!

Hardware Recommendations

Tensor Count	RAM	CPU	Recommendation
< 20	4GB	Any	Greedy is sufficient
20-100	8GB	4+ cores	TreeSA.fast() for production
100-500	16GB	8+ cores	TreeSA with multiple trials
> 500	32GB+	16+ cores	TreeSA (may take minutes)

Running Benchmarks

Reproduce the benchmarks yourself:

cd benchmarks

# Python (Rust via PyO3)
python benchmark_python.py

# Julia (original implementation)
julia --project=. benchmark_julia.jl

# Compare results
python -c "
import json
with open('results_rust_treesa.json') as f:
    rust = json.load(f)
with open('results_julia_treesa.json') as f:
    julia = json.load(f)

for problem in rust['results']:
    r = rust['results'][problem]['avg_ms']
    j = julia['results'][problem]['avg_ms']
    print(f'{problem}: Rust {r:.1f}ms, Julia {j:.1f}ms, Speedup {j/r:.2f}x')
"

Profiling Your Code

Python Profiling

import time
from omeco import optimize_code, TreeSA

# Time optimization
start = time.time()
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
opt_time = time.time() - start

# Check complexity
comp = tree.complexity(ixs, sizes)

print(f"Optimization: {opt_time:.3f}s")
print(f"Time complexity: 2^{comp.tc:.2f} = {2**comp.tc:.2e} FLOPs")
print(f"Space complexity: 2^{comp.sc:.2f} elements")

# Estimate execution time (assuming 10 GFLOP/s CPU)
execution_seconds = 2**comp.tc / 1e10
print(f"Estimated execution: {execution_seconds:.1f}s @ 10 GFLOP/s")

Rust Profiling

# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --example your_example

# Criterion benchmarks (if available)
cargo bench

Performance Tips

Start with Greedy: Always try Greedy first to get a baseline
Use TreeSA for production: The extra optimization time pays off
Increase ntrials for critical code: More trials = better solutions
Profile execution time: Verify that better tc actually improves runtime
Consider memory: Use sc_target if memory is constrained

Next Steps

Algorithm Comparison - Detailed algorithm trade-offs
Quick Start - Get started optimizing
Troubleshooting - Common performance issues

Troubleshooting

Common issues and solutions when using omeco.

Installation Issues

Python: pip install fails

Problem: pip install omeco fails with compilation errors.

Solutions:

Ensure Rust toolchain is installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup update

Update pip and setuptools:

pip install --upgrade pip setuptools wheel

Install from source:

git clone https://github.com/GiggleLiu/omeco
cd omeco/omeco-python
pip install maturin
maturin develop --release

Rust: Dependency resolution errors

Problem: Cargo fails to resolve dependencies.

Solution: Update Cargo.lock and rebuild:

cargo clean
cargo update
cargo build --release

Runtime Errors

ImportError: No module named ‘omeco’

Problem: Python cannot find the omeco module after installation.

Diagnosis:

import sys
print(sys.path)  # Check if package is in path

Solutions:

Verify installation:
```
pip show omeco
```

Check virtual environment:

which python  # Ensure you're using the right Python

Reinstall in development mode:

cd omeco-python
maturin develop --release

Memory Errors (OOM)

Problem: Program crashes with “Out of Memory” error during optimization.

Cause: Space complexity (sc) exceeds available memory.

Solutions:

Check space complexity:

from omeco import optimize_code

tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
print(f"Space complexity: 2^{comp.sc:.2f} elements")

# Calculate memory usage
bytes_per_element = 8  # float64
memory_gib = (2 ** comp.sc) * bytes_per_element / 1024**3
print(f"Estimated memory: {memory_gib:.2f} GiB")

Use slicing to reduce memory:

from omeco import slice_code, TreeSASlicer, ScoreFunction

# Set target memory (e.g., 8GB = sc_target ≈ 30.0 for float64)
import math
available_gb = 8
sc_target = math.log2(available_gb * 1024**3 / 8)

score = ScoreFunction(sc_target=sc_target, sc_weight=2.0)
slicer = TreeSASlicer.fast(score=score)
sliced = slice_code(tree, ixs, sizes, slicer)

Optimize with memory constraints from the start:

from omeco import TreeSA, ScoreFunction

score = ScoreFunction(sc_target=28.0, sc_weight=3.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

TypeError: Invalid index type

Problem: Indices must be integers, but got strings or floats.

Cause: Einstein indices in ixs and out must be integers (or strings in some contexts).

Solution: Convert indices to integers:

# Wrong
ixs = [["a", "b"], ["b", "c"]]

# Correct
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

Performance Issues

Optimization is too slow

Problem: TreeSA takes too long to find a good contraction order.

Solutions:

Use faster preset:

from omeco import TreeSA

# Instead of default
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Try GreedyMethod first:

from omeco import optimize_code, GreedyMethod

# Much faster, good baseline
tree = optimize_code(ixs, out, sizes)

Reduce TreeSA iterations:

from omeco import TreeSA

optimizer = TreeSA(ntrials=5, niters=20)  # Faster but lower quality
tree = optimize_code(ixs, out, sizes, optimizer)

GPU performance is poor

Problem: Execution on GPU is slower than expected.

Diagnosis: Check if optimization used correct GPU settings:

comp = tree.complexity(ixs, sizes)
print(f"Read-write complexity: 2^{comp.rwc:.2f}")

If rwc is high, the contraction does excessive memory I/O.

Solution: Re-optimize with GPU-specific score function:

from omeco import ScoreFunction, TreeSA

# GPU optimization (see GPU Optimization Guide for rw_weight tuning)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=30.0
)

tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

See GPU Optimization Guide for details.

Results differ from expected

Problem: Optimized contraction order gives different complexity than expected.

Diagnosis:

Print the contraction tree:

tree = optimize_code(ixs, out, sizes)
print(tree)  # Shows ASCII tree structure

Check complexity metrics:

comp = tree.complexity(ixs, sizes)
print(f"tc: {comp.tc:.2f}, sc: {comp.sc:.2f}, rwc: {comp.rwc:.2f}")

Verify input:

print(f"ixs: {ixs}")
print(f"out: {out}")
print(f"sizes: {sizes}")

# Check all indices are covered
all_indices = set()
for ix in ixs:
    all_indices.update(ix)
print(f"All indices: {all_indices}")
print(f"Missing sizes: {all_indices - set(sizes.keys())}")

Common causes:

Missing indices in sizes dictionary
Wrong output indices in out
Mismatched dimensions between ixs and sizes

Algorithm-Specific Issues

GreedyMethod gives poor results

Problem: Greedy optimization produces sub-optimal contraction order.

Expected: Greedy is fast but not optimal. It uses heuristics.

Solution: Use TreeSA for better quality:

from omeco import TreeSA

tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Comparison:

from omeco import optimize_code, GreedyMethod, TreeSA

# Greedy (fast, lower quality)
greedy_tree = optimize_code(ixs, out, sizes)
greedy_comp = greedy_tree.complexity(ixs, sizes)

# TreeSA (slower, higher quality)
sa_tree = optimize_code(ixs, out, sizes, TreeSA.fast())
sa_comp = sa_tree.complexity(ixs, sizes)

print(f"Greedy tc: {greedy_comp.tc:.2f}")
print(f"TreeSA tc: {sa_comp.tc:.2f}")
print(f"Improvement: {2**(greedy_comp.tc - sa_comp.tc):.1f}x faster")

TreeSA gets stuck in local minimum

Problem: Multiple runs of TreeSA give inconsistent results.

Cause: Simulated annealing is stochastic.

Solutions:

Increase trials:

optimizer = TreeSA(ntrials=20, niters=50)  # More exploration
tree = optimize_code(ixs, out, sizes, optimizer)

Use best-of-many:

from omeco import TreeSA

best_tree = None
best_tc = float('inf')

for _ in range(10):
    tree = optimize_code(ixs, out, sizes, TreeSA.fast())
    comp = tree.complexity(ixs, sizes)
    if comp.tc < best_tc:
        best_tc = comp.tc
        best_tree = tree

print(f"Best tc: {best_tc:.2f}")

Integration Issues

PyTorch: Contraction gives wrong results

Problem: Executing the contraction tree with PyTorch tensors produces incorrect output.

Diagnosis:

Verify with einsum:

import torch
from omeco import optimize_code

# Direct einsum (slow but correct)
result_ref = torch.einsum("ab,bc,cd->ad", A, B, C)

# Optimized tree
tree = optimize_code([[0,1], [1,2], [2,3]], [0,3], {0:10, 1:20, 2:30, 3:10})
result_tree = execute_tree(tree.to_dict(), [A, B, C])

# Compare
print(f"Match: {torch.allclose(result_ref, result_tree)}")
print(f"Max error: {(result_ref - result_tree).abs().max()}")

Check tensor shapes:

for i, t in enumerate([A, B, C]):
    print(f"Tensor {i}: {t.shape}")

Common causes:

Mismatched tensor shapes and sizes dictionary
Wrong index mapping in execute_tree
Incorrect einsum string construction

See PyTorch Integration Guide for complete example.

Slicing: Incorrect memory calculation

Problem: Slicing doesn’t reduce memory as expected.

Diagnosis:

from omeco import slice_code

# Before slicing
comp_original = tree.complexity(ixs, sizes)

# After slicing
sliced = slice_code(tree, ixs, sizes, slicer)
comp_sliced = sliced.complexity(ixs, sizes)

print(f"Original sc: 2^{comp_original.sc:.2f}")
print(f"Sliced sc: 2^{comp_sliced.sc:.2f}")
print(f"Reduction: {2**(comp_original.sc - comp_sliced.sc):.1f}x")
print(f"Sliced indices: {sliced.slicing()}")

Solutions:

Increase sc_weight:

score = ScoreFunction(sc_weight=5.0, sc_target=25.0)
slicer = TreeSASlicer.fast(score=score)

Force specific indices:

slicer = TreeSASlicer.fast(fixed_slices=[1, 2])
sliced = slice_code(tree, ixs, sizes, slicer)

Debugging Tips

Enable verbose output

Rust library doesn’t have built-in verbose mode, but you can add print statements:

# Check optimization result
tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")

# Check complexity
comp = tree.complexity(ixs, sizes)
print(f"tc: 2^{comp.tc:.2f}, sc: 2^{comp.sc:.2f}, rwc: 2^{comp.rwc:.2f}")

# Check slicing
sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")

Compare with Julia implementation

If you suspect a bug, compare with the original Julia implementation:

using OMEinsumContractionOrders

ixs = [[1, 2], [2, 3], [3, 1]]
out = [1]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)

# Greedy
tree = optimize_code(ixs, out, sizes)
println(tree)

# TreeSA
tree = optimize_code(ixs, out, sizes)
println(tree)

Minimal reproducible example

When reporting issues, provide:

from omeco import optimize_code

# Minimal example
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")

comp = tree.complexity(ixs, sizes)
print(f"Complexity: tc={comp.tc:.2f}, sc={comp.sc:.2f}")

# What was expected vs what you got
print(f"Expected tc: ~13.0")
print(f"Got tc: {comp.tc:.2f}")

FAQ

Q: How do I use different optimizers?

A: optimize_code is the unified function that accepts different optimizer types:

from omeco import optimize_code, GreedyMethod, TreeSA

# Use default optimizer (GreedyMethod)
tree1 = optimize_code(ixs, out, sizes)

# Explicitly specify GreedyMethod
tree2 = optimize_code(ixs, out, sizes, GreedyMethod())

# Use TreeSA optimizer
tree3 = optimize_code(ixs, out, sizes, TreeSA.fast())

Q: How do I choose between Greedy and TreeSA?

GreedyMethod: Fast (milliseconds to seconds), good for prototyping
TreeSA: Slower (seconds to minutes), better quality, use for production

Rule of thumb:

< 50 tensors → Greedy is fine
50-200 tensors → TreeSA.fast()
> 200 tensors → Start with Greedy, then try TreeSA if needed

Q: What does sc_target mean?

A: sc_target is the log₂ of the target memory limit in elements.

Calculate it:

import math

# For 4GB memory, float64 (8 bytes)
memory_gb = 4
bytes_per_element = 8
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 29.0

# For 8GB memory, float32 (4 bytes)
memory_gb = 8
bytes_per_element = 4
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 31.0

Q: Why is GPU optimization different?

A: GPUs have limited memory bandwidth relative to compute (10-100x higher compute-to-memory-bandwidth ratio than CPUs).

CPU: Memory I/O is fast → rw_weight=0.0
GPU: Use rw_weight to penalize memory I/O

See GPU Optimization Guide for complete tuning methodology.

Q: Can I save and load optimized trees?

A: Yes, use to_dict() and JSON:

import json
from omeco import optimize_code

# Optimize
tree = optimize_code(ixs, out, sizes)

# Save
tree_dict = tree.to_dict()
with open("tree.json", "w") as f:
    json.dump(tree_dict, f)

# Load
with open("tree.json", "r") as f:
    tree_dict = json.load(f)

# Use dict directly or reconstruct (implementation dependent)

Q: What if my tensor network has repeated indices?

A: Repeated indices (traces) are supported:

# Trace: C[i,j] = Σₖ A[i,k,k] * B[k,j]
ixs = [[0, 1, 1], [1, 2]]  # Note: index 1 appears twice in first tensor
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

tree = optimize_code(ixs, out, sizes)

Still Having Issues?

Check the documentation:
Search existing issues:
- GitHub Issues
Report a bug:
- Include minimal reproducible example
- Include Python/Rust version, OS, omeco version
- Show expected vs actual behavior
Ask for help:
- GitHub Discussions
- Include context and what you’ve tried

API Reference

Quick reference for omeco’s Python and Rust APIs.

def optimize_code(
    ixs: List[List[int]],
    out: List[int],
    sizes: Dict[int, int],
    optimizer: Optional[Optimizer] = None
) -> NestedEinsum

Optimize tensor contraction order.

Parameters:

ixs: List of index lists for each tensor (e.g., [[0, 1], [1, 2]])
out: Output indices (e.g., [0, 2])
sizes: Dictionary mapping indices to dimensions (e.g., {0: 10, 1: 20, 2: 10})
optimizer: Optimizer instance (GreedyMethod, ExhaustiveSearch, or TreeSA; default: GreedyMethod())

Returns: NestedEinsum representing the optimized contraction tree

Example:

from omeco import optimize_code, TreeSA

tree = optimize_code(
    ixs=[[0, 1], [1, 2], [2, 3]],
    out=[0, 3],
    sizes={0: 10, 1: 20, 2: 30, 3: 10},
    optimizer=TreeSA.fast()
)

Slicing Functions

`slice_code`

def slice_code(
    tree: NestedEinsum,
    ixs: List[List[int]],
    sizes: Dict[int, int],
    slicer: Optional[TreeSASlicer] = None
) -> SlicedEinsum

Apply slicing to reduce memory usage.

Parameters:

tree: Optimized contraction tree from optimize_code
ixs: Original index lists
sizes: Dimension sizes
slicer: Slicer instance (optional, defaults to TreeSASlicer())

Returns: SlicedEinsum with slicing applied and an optimized contraction tree for each slice

Example:

from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction

tree = optimize_code(ixs, out, sizes)

slicer = TreeSASlicer.fast(
    score=ScoreFunction(sc_target=28.0)
)
sliced = slice_code(tree, ixs, sizes, slicer)
sliced_tree = sliced.tree()
sliced_tree_dict = sliced.to_dict()

Complexity Functions

`contraction_complexity`

def contraction_complexity(
    tree: NestedEinsum,
    ixs: List[List[int]],
    sizes: Dict[int, int]
) -> Complexity

Calculate complexity metrics for a contraction tree.

Returns: Complexity object with fields:

tc: Time complexity (log₂ FLOPs)
sc: Space complexity (log₂ elements)
rwc: Read-write complexity (log₂ elements)

Example:

from omeco import optimize_code

tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)

print(f"Time: 2^{comp.tc:.2f} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements")
print(f"Read-write: 2^{comp.rwc:.2f} elements")

`sliced_complexity`

def sliced_complexity(
    sliced: SlicedCode,
    ixs: List[List[int]],
    sizes: Dict[int, int]
) -> Complexity

Calculate complexity for sliced code.

Example:

from omeco import slice_code

sliced = slice_code(tree, ixs, sizes, slicer)
comp = sliced.complexity(ixs, sizes)
print(f"Sliced space: 2^{comp.sc:.2f}")

Optimizer Classes

`GreedyMethod`

class GreedyMethod:
    def __init__(
        self,
        max_iter: Optional[int] = None,
        temperature: float = 0.0,
        max_branches: int = 1
    )

Greedy optimization algorithm.

Parameters:

max_iter: Maximum iterations (default: unlimited)
temperature: Temperature for stochastic variant (0.0 = deterministic)
max_branches: Number of branches to explore (1 = pure greedy)

Presets:

# Default greedy
GreedyMethod()

# Stochastic variant
GreedyMethod(temperature=1.0, max_branches=5)

`ExhaustiveSearch`

class ExhaustiveSearch:
    def __init__(self, verbose: bool = False)

Exact dynamic-programming optimizer for small tensor networks. It minimizes total FLOP count within each connected component.

Parameters:

verbose: Print progress information during the search

Example:

from omeco import ExhaustiveSearch, optimize_code

tree = optimize_code(ixs, out, sizes, ExhaustiveSearch())

`TreeSA`

class TreeSA:
    def __init__(
        self,
        ntrials: int = 1,
        niters: int = 50,
        score: Optional[ScoreFunction] = None
    )
    
    @staticmethod
    def fast(score: Optional[ScoreFunction] = None) -> TreeSA

Tree-based simulated annealing optimizer.

Parameters:

ntrials: Number of independent trials
niters: Iterations per trial
score: Score function for evaluating candidates

Presets:

# Fast preset (good balance)
TreeSA.fast()

# Custom configuration
TreeSA(ntrials=10, niters=100, score=ScoreFunction(sc_target=28.0))

`TreeSASlicer`

class TreeSASlicer:
    def __init__(
        self,
        ntrials: int = 1,
        niters: int = 50,
        fixed_slices: Optional[List[int]] = None,
        score: Optional[ScoreFunction] = None
    )
    
    @staticmethod
    def fast(
        fixed_slices: Optional[List[int]] = None,
        score: Optional[ScoreFunction] = None
    ) -> TreeSASlicer

Simulated annealing slicer.

Parameters:

ntrials: Number of trials
niters: Iterations per trial
fixed_slices: Indices that must be sliced
score: Score function

Example:

# Automatic slice selection
TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))

# Force slicing on specific indices
TreeSASlicer.fast(fixed_slices=[1, 2])

Configuration Classes

`ScoreFunction`

class ScoreFunction:
    def __init__(
        self,
        tc_weight: float = 1.0,
        sc_weight: float = 1.0,
        rw_weight: float = 0.0,
        sc_target: float = 20.0
    )

Configure optimization objectives.

Parameters:

tc_weight: Weight for time complexity
sc_weight: Weight for space complexity
rw_weight: Weight for read-write complexity (default: 0.0 for CPU, see GPU Optimization Guide for GPU)
sc_target: Target space complexity (log₂ elements)

Score Formula:

score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)

Examples:

# CPU optimization
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=28.0)

# GPU optimization (see GPU Optimization Guide)
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0)

# Memory-constrained
ScoreFunction(tc_weight=0.1, sc_weight=10.0, rw_weight=0.0, sc_target=25.0)

Data Classes

`NestedEinsum`

Represents an optimized contraction tree.

Methods:

tree.to_dict() -> dict          # Convert to dictionary
tree.leaf_count() -> int        # Number of leaf tensors
tree.depth() -> int             # Tree depth
str(tree)                       # Pretty-print tree
repr(tree)                      # Summary representation

Dictionary Structure:

# Leaf node
{"tensor_index": int}

# Internal node
{
    "args": [child1_dict, child2_dict, ...],
    "eins": {
        "ixs": [[int, ...], ...],  # Input indices
        "iy": [int, ...]            # Output indices
    }
}

Example:

tree = optimize_code(ixs, out, sizes)

# Pretty print
print(tree)
# Output:
# ab, bc -> ac
# ├─ ab
# └─ bc

# Convert to dict for execution
tree_dict = tree.to_dict()

`SlicedEinsum`

Represents a contraction with slicing applied.

Methods:

sliced.slicing() -> List[int]     # Get sliced indices
sliced.num_slices() -> int        # Number of sliced indices
sliced.tree() -> NestedEinsum     # Optimized tree for each slice
sliced.to_dict() -> dict          # Optimized tree as a dict

Example:

sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")
sliced_tree = sliced.tree()
tree_dict = sliced.to_dict()
# Output: [1, 3]

`Complexity`

Complexity metrics for a contraction.

Fields:

comp.tc: float   # Time complexity (log₂ FLOPs)
comp.sc: float   # Space complexity (log₂ elements)
comp.rwc: float  # Read-write complexity (log₂ elements)

Example:

comp = tree.complexity(ixs, sizes)

# Interpret log₂ values
flops = 2 ** comp.tc
memory_elements = 2 ** comp.sc
io_operations = 2 ** comp.rwc

print(f"{flops:.2e} FLOPs")
print(f"{memory_elements:.2e} elements in memory")

#![allow(unused)]
fn main() {
pub struct NestedEinsum<I> { /* ... */ }

impl<I> NestedEinsum<I> {
    pub fn leaf(tensor_index: usize) -> Self
    pub fn new(args: Vec<Self>, eins: ContractionOperation<I>) -> Self
    pub fn depth(&self) -> usize
    pub fn leaf_count(&self) -> usize
}
}

`ContractionOperation<I>`

#![allow(unused)]
fn main() {
pub struct ContractionOperation<I> {
    pub ixs: Vec<Vec<I>>,  // Input indices
    pub iy: Vec<I>,        // Output indices
}
}

Optimization Functions

`optimize_greedy`

#![allow(unused)]
fn main() {
pub fn optimize_greedy<I>(
    ixs: &[Vec<I>],
    out: &[I],
    sizes: &HashMap<I, usize>
) -> NestedEinsum<I>
where
    I: Hash + Eq + Clone + Ord
}

`optimize_code`

#![allow(unused)]
fn main() {
pub fn optimize_code<I, O>(
    ixs: &[Vec<I>],
    out: &[I],
    sizes: &HashMap<I, usize>,
    optimizer: O
) -> NestedEinsum<I>
where
    I: Hash + Eq + Clone + Ord,
    O: Optimizer<I>
}

Optimizer Traits

`Optimizer<I>`

#![allow(unused)]
fn main() {
pub trait Optimizer<I> {
    fn optimize(
        &self,
        ixs: &[Vec<I>],
        out: &[I],
        sizes: &HashMap<I, usize>
    ) -> NestedEinsum<I>;
}
}

Configuration

`ScoreFunction`

#![allow(unused)]
fn main() {
pub struct ScoreFunction {
    pub tc_weight: f64,
    pub sc_weight: f64,
    pub rw_weight: f64,
    pub sc_target: f64,
}

impl ScoreFunction {
    pub fn new(tc_weight: f64, sc_weight: f64, rw_weight: f64, sc_target: f64) -> Self
}
}

Type Mapping (Python ↔ Rust)

Python	Rust	Notes
`List[List[int]]`	`Vec<Vec<i64>>`	Index lists
`Dict[int, int]`	`HashMap<i64, usize>`	Dimension sizes
`NestedEinsum`	`NestedEinsum<i64>`	Contraction tree
`ScoreFunction`	`ScoreFunction`	Same fields
`Complexity`	`ContractionComplexity`	Metrics

Common Patterns

Optimize with custom score

from omeco import optimize_code, TreeSA, ScoreFunction

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,  # For GPU (see GPU Optimization Guide)
    sc_target=30.0
)

tree = optimize_code(
    ixs=[[0, 1], [1, 2]],
    out=[0, 2],
    sizes={0: 10, 1: 20, 2: 10},
    optimizer=TreeSA(score=score)
)

Optimize then slice

from omeco import optimize_code, slice_code, TreeSA, TreeSASlicer, ScoreFunction

# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

# Step 2: Apply slicing if needed
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=28.0))
sliced = slice_code(tree, ixs, sizes, slicer)

Check complexity

from omeco import optimize_code

tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)

print(f"Time: 2^{comp.tc:.2f} FLOPs = {2**comp.tc:.2e} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements = {2**comp.sc:.2e} elements")

# Memory in GiB (float64)
memory_gib = (2 ** comp.sc) * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")

Next Steps

Quick Start Guide - Get started with examples
Score Function Configuration - Configure optimization
Troubleshooting - Common issues and solutions
Full Rust Docs - Complete Rust API documentation

Changelog

All notable changes to omeco are documented here.

The format follows Keep a Changelog, and this project adheres to Semantic Versioning.

Comprehensive mdBook documentation following tropical-gemm standard
Pretty printing for Python NestedEinsum with ASCII tree visualization
PyTorch integration guide and examples
GPU optimization guide with rw_weight configuration
Slicing strategy guide for memory-constrained environments
Troubleshooting guide with common issues and solutions
API reference for Python and Rust APIs
Performance benchmarks comparing Rust vs Julia
Algorithm comparison guide (Greedy vs TreeSA)

Changed

Migrated documentation from scattered markdown files to structured mdBook
Improved Python bindings with better __str__ and __repr__ methods

Deprecated

Legacy docs/score_function_guide.md (migrated to mdBook)

[0.2.1] - 2024-01-XX

Fixed

Issue #6: Hyperedge index preservation in contraction operations (PR #7)
- Fixed contract_tree! macro to correctly preserve tensor indices during contraction
- Added regression tests to verify hyperedge handling
- Ensures contraction order matches input tensor order specified in ixs

Added

Test suite for hyperedge index preservation
CI improvements for better test coverage

[0.2.0] - 2024-01-XX

Added

TreeSA (Tree-based Simulated Annealing) optimizer
TreeSA.fast() preset for quick high-quality optimization
Slicing support with TreeSASlicer for memory reduction
ScoreFunction for configurable optimization objectives
contraction_complexity and sliced_complexity functions
Python bindings via PyO3
optimize_code generic function accepting optimizer instances
Read-write complexity (rwc) metric for GPU optimization

Changed

Improved API ergonomics with preset methods
Better default parameters for optimizers

Performance

1.4-1.5x faster than Julia OMEinsumContractionOrders.jl on benchmarks
Efficient TreeSA implementation with better exploration

[0.1.0] - 2023-XX-XX

Added

Initial release
GreedyMethod optimizer
Basic contraction order optimization
Support for tensor networks with arbitrary indices
Complexity calculation (time and space)
Rust core library
Basic documentation

Features

Greedy algorithm with configurable parameters
Stochastic variants for improved solutions
Efficient index handling with generic types
HashMap-based dimension tracking

Version Numbering

omeco follows Semantic Versioning:

MAJOR version for incompatible API changes
MINOR version for new functionality in a backward compatible manner
PATCH version for backward compatible bug fixes

Release Process

Update version in Cargo.toml files
Update CHANGELOG.md with release notes
Tag release: git tag v0.X.Y
Push tags: git push --tags
Publish to crates.io: cargo publish
Publish to PyPI: maturin publish (Python bindings)

# Old (0.1.x)
tree = optimize_greedy(ixs, out, sizes)
tree = optimize_treesa(ixs, out, sizes)

# New (0.2.x) - unified interface
from omeco import optimize_code, GreedyMethod, TreeSA
tree = optimize_code(ixs, out, sizes)  # Uses GreedyMethod by default
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

ScoreFunction configuration:

# New in 0.2.x
from omeco import ScoreFunction, TreeSA

score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Slicing support:

# New in 0.2.x
from omeco import slice_code, TreeSASlicer

sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())

Breaking Changes:

Removed optimize_greedy() and optimize_treesa() from Python exports
Use optimize_code(...) instead with GreedyMethod() or TreeSA() optimizers
Rust API unchanged

Migrating from Julia OMEinsumContractionOrders.jl

Index Differences:

Julia: 1-based indexing
Rust/Python: 0-based indexing (or use arbitrary hashable types)

# Julia
ixs = [[1, 2], [2, 3], [3, 1]]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)

# Python (0-based)
ixs = [[0, 1], [1, 2], [2, 0]]
sizes = {0: 10, 1: 20, 2: 10}

Function Names:

Julia	omeco (Python/Rust)
`optimize_greedy`	`optimize_code(..., GreedyMethod())`
`optimize_treesa`	`optimize_code(..., TreeSA.fast())`
`contraction_complexity`	`contraction_complexity`
`slicing`	`slice_code`

API Compatibility: Most functions have similar signatures and behavior.

Keyboard shortcuts

omeco