Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

omeco

High-performance tensor network contraction order optimization in Rust.

What is omeco?

omeco (One More Einsum Contraction Order) is a library for optimizing tensor network contractions. When contracting multiple tensors together, the order of contractions can make an exponential difference in computational cost. omeco provides fast algorithms to find near-optimal contraction orders.

Ported from the Julia library OMEinsumContractionOrders.jl.

Why Optimize Contraction Order?

Consider contracting three matrices: A[10×100] × B[100×20] × C[20×5].

Different orders give vastly different costs:

OrderOperationsPeak Memory
(A×B)×C20,000 + 1,000 = 21,000200 elements
A×(B×C)10,000 + 50,000 = 60,0002,000 elements

The first order is 3x faster and uses 10x less memory! For larger tensor networks, differences can be exponential.

Finding the optimal order is NP-complete, but heuristics find near-optimal solutions in seconds.

Key Features

  • Fast Greedy Algorithm: O(n² log n) greedy method for quick optimization
  • Simulated Annealing: TreeSA finds higher-quality solutions for complex networks
  • Automatic Slicing: TreeSASlicer reduces memory by trading time for space
  • Hardware-Aware: Configure for CPU or GPU with memory I/O optimization
  • Python & Rust: Native Rust with Python bindings via PyO3

Example

Python:

from omeco import optimize_code

# Matrix chain: A[i,j] × B[j,k] × C[k,l]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 10, 1: 100, 2: 20, 3: 5}

tree = optimize_code(ixs, out, sizes)
complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f}, Space: 2^{complexity.sc:.2f}")

Rust:

#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;

let code = EinCode::new(vec![vec![0, 1], vec![1, 2], vec![2, 3]], vec![0, 3]);
let sizes = HashMap::from([(0, 10), (1, 100), (2, 20), (3, 5)]);

let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
let complexity = contraction_complexity(&tree, &code, &sizes);
println!("Time: 2^{:.2}, Space: 2^{:.2}", complexity.tc, complexity.sc);
}

Use Cases

  • Quantum Circuit Simulation: Optimizing tensor network contractions in quantum algorithms
  • Neural Networks: Efficient einsum operations in deep learning
  • Scientific Computing: Large-scale tensor computations in physics and chemistry
  • Graph Algorithms: Computing graph polynomials and partition functions

Next Steps

  • Installation - Install omeco for Python or Rust
  • Quick Start - Your first tensor contraction optimization
  • Concepts - Understand tensor networks and complexity metrics

Getting Started

This section covers installation and basic usage of omeco.

Overview

omeco is available as both a Rust crate and a Python package. Choose the language that best fits your workflow:

  • Python: Quick prototyping, PyTorch/NumPy integration, Jupyter notebooks
  • Rust: Maximum performance, embedded systems, library development

Crate Organization

PackageDescription
omecoCore Rust library with optimization algorithms
omeco-pythonPython bindings built with PyO3

System Requirements

For Python

  • Python 3.8 or later
  • pip or conda package manager

For Rust

  • Rust 1.70 or later
  • Cargo build system

Next Steps

Installation

Python

Via pip

The easiest way to install omeco for Python:

pip install omeco

Via conda (not yet available)

# Coming soon
conda install -c conda-forge omeco

From source

For development or the latest features:

git clone https://github.com/GiggleLiu/omeco.git
cd omeco/omeco-python
pip install maturin
maturin develop

Verify Installation

import omeco
print(omeco.__version__)

# Quick test
from omeco import optimize_code, GreedyMethod
tree = optimize_code([[0, 1], [1, 2]], [0, 2], {0: 2, 1: 3, 2: 2})
print(tree)

Expected output:

ab, bc -> ac
├─ tensor_0
└─ tensor_1

Rust

Via Cargo

Add to your Cargo.toml:

[dependencies]
omeco = "0.2"

Or use cargo add:

cargo add omeco

From source

git clone https://github.com/GiggleLiu/omeco.git
cd omeco
cargo build --release

Verify Installation

Create src/main.rs:

use omeco::{EinCode, GreedyMethod, optimize_code};
use std::collections::HashMap;

fn main() {
    let code = EinCode::new(vec![vec![0, 1], vec![1, 2]], vec![0, 2]);
    let sizes = HashMap::from([(0, 2), (1, 3), (2, 2)]);
    
    let tree = optimize_code(&code, &sizes, &GreedyMethod::default()).unwrap();
    println!("{} leaves, depth {}", tree.leaf_count(), tree.depth());
}

Run:

cargo run

Troubleshooting

Python: “No module named omeco”

  • Ensure you’re using the correct Python environment
  • Try pip install --force-reinstall omeco
  • Check pip list | grep omeco

Rust: Compilation errors

  • Update Rust: rustup update
  • Check minimum version: rustc --version (need 1.70+)
  • Clean and rebuild: cargo clean && cargo build

Installation takes too long

  • Python: Pre-built wheels are available for common platforms
  • Rust: First build compiles dependencies, subsequent builds are fast

Next Steps

Quick Start

This guide shows the basics of tensor contraction optimization.

Basic Contraction Optimization

Python

from omeco import optimize_code

# Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
ixs = [[0, 1], [1, 2], [2, 3]]  # Input tensor indices
out = [0, 3]                     # Output indices
sizes = {0: 10, 1: 100, 2: 20, 3: 5}  # Dimension sizes

# Optimize contraction order
tree = optimize_code(ixs, out, sizes)

# Compute complexity
complexity = tree.complexity(ixs, sizes)
print(f"Time complexity: 2^{complexity.tc:.2f} FLOPs")
print(f"Space complexity: 2^{complexity.sc:.2f} elements")
print(f"Actual FLOPs: {complexity.flops():,.0f}")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")

Output:

Time complexity: 2^14.29 FLOPs
Space complexity: 2^11.29 elements
Actual FLOPs: 21,000
Peak memory: 2,500 elements

Visualize the contraction tree:

print(tree)
ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

Rust

use omeco::{
    EinCode, GreedyMethod, optimize_code
};
use std::collections::HashMap;

fn main() {
    // Matrix chain: A[i,j] × B[j,k] × C[k,l] → D[i,l]
    let code = EinCode::new(
        vec![vec![0, 1], vec![1, 2], vec![2, 3]],
        vec![0, 3]
    );
    let sizes = HashMap::from([
        (0, 10),
        (1, 100),
        (2, 20),
        (3, 5),
    ]);

    // Optimize with default greedy method
    let tree = optimize_code(&code, &sizes, &GreedyMethod::default())
        .expect("Optimization failed");

    // Compute complexity
    let complexity = contraction_complexity(&tree, &code, &sizes);
    
    println!("Time: 2^{:.2}, Space: 2^{:.2}", 
             complexity.tc, complexity.sc);
    println!("Leaves: {}, Depth: {}", 
             tree.leaf_count(), tree.depth());
}

Understanding the Output

NestedEinsum Structure

The optimized contraction tree shows the order of operations:

ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

This means:

  1. First contract tensor_1 (indices bc) with tensor_2 (indices cd) → result has indices bd
  2. Then contract tensor_0 (indices ab) with the result → final output has indices ad

Complexity Metrics

  • tc (time complexity): log₂ of total FLOPs needed
  • sc (space complexity): log₂ of peak memory usage
  • rwc (read-write complexity): log₂ of memory I/O operations (for GPU optimization)

Lower values = better performance.

Using Different Optimizers

Greedy Method (Fast)

from omeco import GreedyMethod

# Deterministic greedy (default)
tree = optimize_code(ixs, out, sizes, GreedyMethod())

# Stochastic greedy (explores more options)
tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.5, temperature=1.0))

TreeSA (Higher Quality)

from omeco import TreeSA, ScoreFunction

# Fast preset (good for most cases)
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

# Custom configuration
score = ScoreFunction(sc_target=15.0)
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)

Working with the Tree

Convert to Dictionary

tree_dict = tree.to_dict()

# Traverse the tree
def traverse(node):
    if "tensor_index" in node:
        print(f"Leaf: tensor {node['tensor_index']}")
    else:
        print(f"Contraction: {node['eins']}")
        for child in node["args"]:
            traverse(child)

traverse(tree_dict)

Tree Properties

# Number of input tensors
num_tensors = tree.leaf_count()

# Depth of the tree
depth = tree.depth()

# Check if binary tree
is_binary = tree.is_binary()

# Get leaf indices in order
leaves = tree.leaf_indices()

Memory Reduction with Slicing

For large tensor networks that don’t fit in memory:

from omeco import slice_code, TreeSASlicer, ScoreFunction

# Optimize contraction first
tree = optimize_code(ixs, out, sizes)

# Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 25.0:  # More than 2^25 elements (~256MB for float64)
    # Slice to reduce memory
    score = ScoreFunction(sc_target=25.0)
    slicer = TreeSASlicer.fast(score=score)
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # Check new complexity
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Original space: 2^{complexity.sc:.2f}")
    print(f"Sliced space: 2^{sliced_comp.sc:.2f}")
    print(f"Sliced indices: {sliced.slicing()}")

Next Steps

Concepts

Understanding these core concepts will help you use omeco effectively.

Overview

Tensor network contraction is a fundamental operation in many scientific and machine learning applications. This section explains:

  • What tensor networks are and why they matter
  • Why contraction order makes an exponential difference
  • How to interpret complexity metrics

Topics

Quick Summary

The Problem: Contracting N tensors can be done in many orders, with costs differing exponentially.

The Solution: Heuristic algorithms find near-optimal orders in polynomial time.

The Trade-off: Time vs quality - greedy is fast, simulated annealing is better.

Tensor Networks

What are Tensor Networks?

A tensor network is a collection of multidimensional arrays (tensors) connected by shared indices. Contracting a tensor network means computing their products and sums according to the Einstein summation convention.

Einstein Summation

The einsum notation specifies which indices to sum over:

einsum("ij,jk->ik", A, B)

This means:

  • A has shape [i, j]
  • B has shape [j, k]
  • Sum over shared index j
  • Result has shape [i, k]

Equivalent to matrix multiplication: C[i,k] = Σⱼ A[i,j] * B[j,k]

Common Patterns

Matrix Multiplication

# C = A × B
einsum("ij,jk->ik", A, B)

Batch Matrix Multiplication

# Multiple matrix multiplications at once
einsum("bij,bjk->bik", A, B)

Trace

# Sum of diagonal elements
einsum("ii->", A)

Outer Product

# C[i,j] = A[i] * B[j]
einsum("i,j->ij", a, b)

Tensor Contraction with Multiple Indices

# Contract over indices j and k
einsum("ijk,klj->il", A, B)

Tensor Networks in Practice

Quantum Circuit Simulation

Quantum gates are represented as tensors, and circuit evaluation becomes a tensor network contraction:

# 3-qubit quantum circuit
gates = [
    [[...]]  # Gate on qubits 0,1
    [[...]]  # Gate on qubits 1,2
    [[...]]  # Gate on qubits 0,2
]

# Each qubit index appears multiple times
# Contracting computes the circuit output

Neural Networks

Einsum operations appear in:

  • Attention mechanisms: einsum("bqd,bkd->bqk", Q, K)
  • Tensor decompositions: Tucker, CP, tensor train
  • Graph neural networks

Scientific Computing

  • Physics: Partition functions, path integrals
  • Chemistry: Molecular orbital calculations
  • Mathematics: Graph polynomials, combinatorial problems

Contraction Order Matters

Given tensors A, B, C to contract, we can do:

  1. (A × B) × C
  2. A × (B × C)
  3. (A × C) × B

These give the same result but have vastly different costs.

Example with shapes A[10,100], B[100,20], C[20,5]:

OrderFLOPsPeak Memory
(A×B)×C20,000 + 1,000 = 21,000200
A×(B×C)10,000 + 50,000 = 60,0002,000

The first is 3x faster and uses 10x less memory!

Representing Tensor Networks

As Lists of Index Lists

# Matrix chain A×B×C
ixs = [
    [0, 1],  # A has indices [i, j]
    [1, 2],  # B has indices [j, k]
    [2, 3],  # C has indices [k, l]
]
out = [0, 3]  # Output has indices [i, l]

As Graphs

Each tensor is a node, shared indices are edges:

A[i,j] --- j --- B[j,k] --- k --- C[k,l]
  i                                   l

Contracting removes edges and merges nodes.

Next Steps

Contraction Order Problem

The Problem

When contracting multiple tensors, the order matters exponentially.

Simple Example: Three Matrices

Consider A[10×100] × B[100×20] × C[20×5]:

Option 1: Left-to-right (A×B)×C

Step 1: A[10×100] × B[100×20] = T[10×20]
        Cost: 10 × 100 × 20 = 20,000 FLOPs
        Memory: 200 elements

Step 2: T[10×20] × C[20×5] = D[10×5]
        Cost: 10 × 20 × 5 = 1,000 FLOPs
        Memory: 50 elements

Total: 21,000 FLOPs, peak memory 200 elements

Option 2: Right-to-left A×(B×C)

Step 1: B[100×20] × C[20×5] = T[100×5]
        Cost: 100 × 20 × 5 = 10,000 FLOPs
        Memory: 500 elements

Step 2: A[10×100] × T[100×5] = D[10×5]
        Cost: 10 × 100 × 5 = 50,000 FLOPs
        Memory: 50 elements

Total: 60,000 FLOPs, peak memory 500 elements

Result: Option 1 is 3x faster and uses 2.5x less memory!

Why It’s Hard

Finding the optimal contraction order is NP-complete:

  • With N tensors, there are approximately (2N)!/(N+1)! possible orders
  • For 10 tensors: ~17 million orders
  • For 20 tensors: ~6 × 10²² orders (more than atoms in your body!)

Brute force is impossible for realistic problems.

Heuristic Approaches

Since optimal search is intractable, we use heuristics:

1. Greedy Algorithm

Idea: At each step, contract the pair with minimum cost.

Pros:

  • Fast: O(n² log n)
  • Deterministic
  • Good for many practical cases

Cons:

  • Can get stuck in local optima
  • May miss better global solutions

2. Simulated Annealing (TreeSA)

Idea: Random tree mutations with temperature-based acceptance.

Pros:

  • Explores more of the search space
  • Often finds better solutions than greedy
  • Configurable trade-off (time vs quality)

Cons:

  • Slower than greedy
  • Non-deterministic
  • Requires parameter tuning

Real-World Impact

Quantum Circuit with 50 Qubits

  • Random order: ~10⁵⁰ operations (impossible)
  • Greedy order: ~10³⁰ operations (still impossible)
  • Optimized order: ~10²⁰ operations (feasible on supercomputer)

100,000,000,000x speedup from optimization!

Neural Network Attention

# Naive: einsum("bqd,bkd,bkv->bqv", Q, K, V)
# Time: O(q²d + qkd + qkv)

# Optimized: Q @ K.T @ V
# Time: O(qkd + qkv)
# Speedup: ~10x for typical dimensions

Optimization Objectives

Different scenarios need different trade-offs:

ScenarioOptimizeAcceptExample
GPU limitedSpace (sc)Higher timeLarge models on 8GB GPU
CPU computeTime (tc)Higher spaceBatch processing with 256GB RAM
BothBalancedsc_target = available memoryMost cases

Practical Guidelines

  1. Start with greedy: Fast baseline for most cases

  2. Use TreeSA if:

    • Greedy result is too slow/large
    • You have time to optimize (minutes, not seconds)
    • Problem is large and complex
  3. Set sc_target to available memory:

    # For 8GB GPU with float32 tensors
    sc_target = log2(8 * 1024**3 / 4) ≈ 31
    
  4. Use slicing when memory is the bottleneck:

    if complexity.sc > sc_target:
        sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
    

Next Steps

Complexity Metrics

omeco tracks three key metrics for tensor contractions.

The Three Metrics

Time Complexity (tc)

Definition: log₂ of total floating-point operations (FLOPs).

Interpretation:

  • tc = 20 means 2²⁰ = 1,048,576 FLOPs
  • tc = 30 means 2³⁰ ≈ 1 billion FLOPs
  • Lower is better (less computation)

Example:

complexity = tree.complexity(ixs, sizes)
print(f"Time: 2^{complexity.tc:.2f} FLOPs")
print(f"Actual FLOPs: {complexity.flops():,.0f}")

Space Complexity (sc)

Definition: log₂ of peak memory usage in number of tensor elements.

Interpretation:

  • sc = 20 means 2²⁰ = 1,048,576 elements (8MB for float64, 4MB for float32)
  • sc = 30 means 2³⁰ ≈ 1 billion elements (8GB for float64, 4GB for float32)
  • Lower is better (less memory)

Example:

complexity = tree.complexity(ixs, sizes)
print(f"Space: 2^{complexity.sc:.2f} elements")
print(f"Peak memory: {complexity.peak_memory():,.0f} elements")

# For float64 (8 bytes)
memory_gib = complexity.peak_memory() * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")

Read-Write Complexity (rwc)

Definition: log₂ of total memory I/O operations.

Use Case: GPU optimization where memory bandwidth is the bottleneck.

Interpretation:

  • On CPU: Usually ignored (rw_weight=0)
  • On GPU: Critical! Compute-to-memory-bandwidth ratio is high (10-30x)
  • Lower is better (less data movement)

Example:

# CPU optimization (ignore I/O)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0)

# GPU optimization (I/O matters! See GPU Optimization Guide)
score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0)

Why Logarithmic Scale?

Using log₂ makes the numbers manageable:

Linear ScaleLog ScaleHuman Readable
1,02410~thousand
1,048,57620~million
1,073,741,82430~billion
1,099,511,627,77640~trillion

Advantages:

  • Easy to compare: tc=25 vs tc=30 is clear 32x difference
  • Handles huge ranges: from thousands to quintillions
  • Matches memory sizes: 2ⁿ bytes

Calculating Complexity

For a Single Contraction

Contracting A[i,j] with B[j,k]C[i,k]:

Time: i × j × k FLOPs
Space: max(i×j, j×k, i×k) elements

For a Contraction Tree

Sum time over all contractions, take max space:

def compute_complexity(tree):
    if tree.is_leaf():
        return Complexity(tc=0, sc=log2(tensor_size))
    
    # Recurse on children
    left_comp = compute_complexity(tree.left)
    right_comp = compute_complexity(tree.right)
    
    # Contraction cost
    contraction_tc = log2(flops_for_this_step)
    contraction_sc = log2(intermediate_size)
    
    # Combine
    tc = log2sumexp2([left_comp.tc, right_comp.tc, contraction_tc])
    sc = max(left_comp.sc, right_comp.sc, contraction_sc)
    
    return Complexity(tc=tc, sc=sc)

Practical Interpretation

Time Complexity

tcFLOPsCPU Time (1 TFLOP/s)GPU Time (10 TFLOP/s)
201M0.001 ms0.0001 ms
301B1 ms0.1 ms
401T1 second0.1 second
501P17 minutes1.7 minutes

Space Complexity

scElementsfloat32 Memoryfloat64 Memory
201M4 MB8 MB
2532M128 MB256 MB
301B4 GB8 GB
3532B128 GB256 GB

Optimization Goals

Minimize Time (Speed-Critical)

score = ScoreFunction(tc_weight=1.0, sc_weight=0.0, sc_target=float('inf'))

Use when:

  • Memory is abundant
  • Need fastest execution
  • Real-time applications

Minimize Space (Memory-Limited)

score = ScoreFunction(tc_weight=0.0, sc_weight=1.0, sc_target=25.0)

Use when:

  • Memory is constrained (e.g., 8GB GPU)
  • Can afford longer computation
  • Embedded systems

Balanced (Most Common)

score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, sc_target=28.0)

Use when:

  • Both time and space matter
  • Setting sc_target to available memory
  • General-purpose optimization

Example: Comparing Orders

# Greedy result
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)

# TreeSA result
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)

# Compare
print(f"Greedy - Time: 2^{comp_greedy.tc:.2f}, Space: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - Time: 2^{comp_sa.tc:.2f}, Space: 2^{comp_sa.sc:.2f}")

# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")

Next Steps

Algorithms

omeco provides two main optimization algorithms with different speed-quality trade-offs.

Algorithm Comparison

AlgorithmSpeedQualityUse Case
GreedyMethodFast (seconds)GoodQuick optimization, large networks
TreeSASlower (minutes)BetterHigh-quality solutions, important workloads

Quick Guide

Use GreedyMethod when:

  • You need results quickly
  • Network has <100 tensors
  • Greedy result is good enough

Use TreeSA when:

  • You have time to optimize
  • Need best possible solution
  • Greedy result is too slow/large
  • Working with complex tensor networks

Topics

Next Steps

Choose an algorithm to learn more, or see the comparison for benchmarks.

Greedy Method

Fast greedy algorithm for tensor contraction order optimization.

How It Works

The greedy algorithm repeatedly contracts the pair of tensors with minimum cost until one tensor remains.

Algorithm:

while more than one tensor remains:
    1. Consider all pairs of tensors
    2. Compute cost of contracting each pair
    3. Contract the pair with minimum cost
    4. Replace the pair with their contraction result

Time Complexity: O(n² log n) where n is the number of tensors.

Basic Usage

Python

from omeco import optimize_code, GreedyMethod

# Deterministic greedy (default optimizer)
tree = optimize_code(ixs, out, sizes)

# Or explicitly
tree = optimize_code(ixs, out, sizes, GreedyMethod())

Rust

#![allow(unused)]
fn main() {
use omeco::{EinCode, GreedyMethod, optimize_code};

let method = GreedyMethod::default();
let tree = optimize_code(&code, &sizes, &method)?;
}

Stochastic Variants

Add randomness to explore more solutions:

# alpha: controls randomness (0 = deterministic, 1 = fully random)
# temperature: softmax temperature for selection
method = GreedyMethod(alpha=0.5, temperature=1.0)
tree = optimize_code(ixs, out, sizes, method)

Parameters:

  • alpha=0.0 (default): Always pick minimum cost (deterministic)
  • alpha=0.5: Mix of greedy and random choices
  • alpha=1.0: Uniform random selection
  • temperature: Controls selection distribution (higher = more random)

Performance Characteristics

Advantages:

  • ⚡ Very fast: seconds for 100+ tensors
  • 🎯 Deterministic by default (reproducible)
  • 📈 Scales well to large networks
  • 💡 Good baseline for most cases

Limitations:

  • 🎲 Can get stuck in local optima
  • 🔍 Myopic: only considers immediate cost
  • 📊 May miss global optimal solution

Example: Matrix Chain

from omeco import optimize_code

# A[100×10] × B[10×20] × C[20×5]
ixs = [[0, 1], [1, 2], [2, 3]]
out = [0, 3]
sizes = {0: 100, 1: 10, 2: 20, 3: 5}

tree = optimize_code(ixs, out, sizes)
print(tree)

Output:

ab, bd -> ad
├─ tensor_0
└─ bc, cd -> bd
   ├─ tensor_1
   └─ tensor_2

This contracts B×C first (cost: 10×20×5 = 1,000), then A×(BC) (cost: 100×10×5 = 5,000). Total: 6,000 FLOPs.

Alternative order (A×B)×C would cost: 100×10×20 + 100×20×5 = 30,000 FLOPs (5x worse!).

When to Use

Use GreedyMethod when:

  • You need quick results (prototyping, iteration)
  • Network is straightforward (chains, grids)
  • Memory/time constraints are relaxed

Consider TreeSA instead when:

  • Greedy result is too slow/large
  • Network is complex (irregular graphs)
  • You have time for better optimization
  • Result will be used repeatedly

Tips

  1. Start with default: GreedyMethod() works for most cases

  2. Try stochastic for variety:

    # Run 10 times with randomness, pick best
    best_tree = None
    best_complexity = float('inf')
    
    for _ in range(10):
        tree = optimize_code(ixs, out, sizes, GreedyMethod(alpha=0.3))
        complexity = tree.complexity(ixs, sizes)
        if complexity.tc < best_complexity:
            best_tree = tree
            best_complexity = complexity.tc
    
  3. Combine with slicing if memory is tight:

    tree = optimize_code(ixs, out, sizes)
    if tree.complexity(ixs, sizes).sc > 25.0:
        sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
    

Next Steps

TreeSA

Simulated annealing optimizer for high-quality tensor contraction orders.

How It Works

TreeSA uses simulated annealing to explore the space of contraction trees:

  1. Start with a random or greedy tree
  2. Repeatedly apply random mutations (subtree rotations, swaps)
  3. Accept better mutations, sometimes accept worse ones (to escape local optima)
  4. Gradually decrease “temperature” to converge

Result: Often finds significantly better orders than greedy.

Basic Usage

from omeco import optimize_code, TreeSA

# Quick optimization with sensible defaults
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Custom Configuration

from omeco import optimize_code, TreeSA, ScoreFunction

# More thorough search
score = ScoreFunction(sc_target=25.0)
optimizer = TreeSA(
    ntrials=10,      # Number of independent runs
    niters=50,       # Iterations per run
    score=score      # Custom scoring function
)
tree = optimize_code(ixs, out, sizes, optimizer)

Rust

#![allow(unused)]
fn main() {
use omeco::{TreeSA, ScoreFunction, optimize_code};

let score = ScoreFunction::default();
let optimizer = TreeSA::fast(score);
let tree = optimize_code(&code, &sizes, &optimizer)?;
}

Parameters

ntrials (default: 10)

Number of independent optimization runs. The best result is returned.

  • Higher = better quality (more exploration)
  • Lower = faster
  • Typical range: 5-20

niters (default: 50)

Number of iterations per trial.

  • Higher = better convergence
  • Lower = faster
  • Typical range: 20-100

betas (optional)

Temperature schedule for simulated annealing.

  • Default: logarithmic schedule from 1 to 40
  • Custom: [β₁, β₂, ..., βₙ] where β = 1/temperature
  • Higher β = less randomness (exploitation)
  • Lower β = more randomness (exploration)

score (optional)

Custom ScoreFunction for hardware-specific optimization.

# GPU optimization (see GPU Optimization Guide)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=30.0
)
optimizer = TreeSA(ntrials=5, niters=50, score=score)

Performance Characteristics

Advantages:

  • 🏆 Higher quality solutions than greedy
  • 🔍 Explores global search space
  • ⚙️ Configurable time-quality trade-off
  • 🎯 Often finds near-optimal solutions

Limitations:

  • ⏱️ Slower than greedy (minutes vs seconds)
  • 🎲 Non-deterministic (different runs give different results)
  • 🔧 Requires parameter tuning for best results

Example Comparison

from omeco import optimize_code, GreedyMethod, TreeSA

# Same problem, different optimizers
ixs = [[0, 1, 2], [2, 3, 4], [4, 5, 6], [6, 7, 0]]  # Cycle
out = [1, 3, 5, 7]
sizes = {i: 2 for i in range(8)}

# Greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)

# TreeSA
tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
comp_sa = tree_sa.complexity(ixs, sizes)

print(f"Greedy - tc: 2^{comp_greedy.tc:.2f}, sc: 2^{comp_greedy.sc:.2f}")
print(f"TreeSA - tc: 2^{comp_sa.tc:.2f}, sc: 2^{comp_sa.sc:.2f}")

# Speedup
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)
print(f"TreeSA is {speedup:.1f}x faster")

Typical output:

Greedy - tc: 2^16.00, sc: 2^12.00
TreeSA - tc: 2^14.58, sc: 2^11.00
TreeSA is 2.7x faster

When to Use

Use TreeSA when:

  • Greedy result is too slow/uses too much memory
  • Network is complex (cycles, irregular graphs)
  • Result will be used many times (worth optimizing once)
  • You have minutes to spare for optimization

Stick with GreedyMethod when:

  • Need results in seconds
  • Network is simple (chains, small grids)
  • Iterating/prototyping
  • Greedy is already good enough

Optimization Workflow

# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy: tc={comp_greedy.tc:.2f}, sc={comp_greedy.sc:.2f}")

# 2. If not satisfactory, try TreeSA
if comp_greedy.sc > 28.0:  # Too much memory
    score = ScoreFunction(sc_target=28.0, sc_weight=2.0)
    tree_sa = optimize_code(ixs, out, sizes, TreeSA(ntrials=10, score=score))
    comp_sa = tree_sa.complexity(ixs, sizes)
    print(f"TreeSA: tc={comp_sa.tc:.2f}, sc={comp_sa.sc:.2f}")

Tuning Parameters

For Speed

# Fast TreeSA (2-3x slower than greedy)
TreeSA(ntrials=2, niters=20)

For Quality

# Thorough TreeSA (10-20x slower than greedy)
TreeSA(ntrials=20, niters=100)

For Memory-Constrained

# Prioritize space over time
score = ScoreFunction(tc_weight=0.1, sc_weight=10.0, sc_target=25.0)
TreeSA(ntrials=10, niters=50, score=score)

Advanced: Custom Temperature Schedule

# Custom annealing schedule
import numpy as np

# Slow cooling for thorough search
betas = np.logspace(0, 2, 50)  # 1 to 100

optimizer = TreeSA(ntrials=5, betas=betas.tolist())

Next Steps

Algorithm Comparison

Detailed comparison of GreedyMethod vs TreeSA.

Quick Summary

MetricGreedyMethodTreeSA
SpeedSecondsMinutes
Solution QualityGoodBetter
DeterministicYes (default)No
Scalability100+ tensors50-100 tensors
Tuning RequiredMinimalSome

Performance Benchmarks

Based on benchmark results from the Julia implementation (OMEinsumContractionOrders.jl):

3-Regular Graph (50 vertices, bond dim 2)

AlgorithmTime Complexity (tc)Space Complexity (sc)Runtime
GreedyMethod2^34.12^11.40.05s
TreeSA (fast)2^33.82^10.92.1s
TreeSA (thorough)2^33.52^10.88.4s

Result: TreeSA finds 1.5x faster, 40% less memory solution, taking 40-170x longer to optimize.

Matrix Chain (20 matrices)

AlgorithmTime ComplexitySpace ComplexityRuntime
GreedyMethod2^18.22^10.50.01s
TreeSA2^17.92^10.30.3s

Result: TreeSA is 1.2x faster, using 30x more optimization time.

Speed vs Quality Trade-off

                Quality
                   ↑
                   │
         TreeSA    │
        (thorough) │
                   │
         TreeSA    │
         (fast)    │
                   │
        Greedy     │
       (stochastic)│
                   │
        Greedy     │
      (default)    │
                   │
                   └──────────────→ Speed

When Each Algorithm Shines

GreedyMethod Excels

Problem Type: Chains and trees

# Chain: A×B×C×D×E
ixs = [[0,1], [1,2], [2,3], [3,4]]
out = [0,4]

# Greedy finds optimal order in O(n²)
tree = optimize_code(ixs, out, sizes, GreedyMethod())

Characteristics:

  • Sequential structure
  • No cycles
  • Clear optimal pairing strategy

TreeSA Excels

Problem Type: Cycles and complex graphs

# Cycle graph (much harder!)
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]]
out = [1,3,5,7]

# TreeSA explores better orderings
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Characteristics:

  • Irregular connectivity
  • Hyperedges (indices in 3+ tensors)
  • Multiple reasonable orderings

Real-World Example: Quantum Circuit

Problem: 30-qubit quantum circuit, 50 gates

# Greedy result
tree_greedy = optimize_code(circuit_ixs, out, sizes, GreedyMethod())
comp_greedy = contraction_complexity(tree_greedy, circuit_ixs, sizes)
# tc: 2^42.3, sc: 2^28.1, time: 0.2s

# TreeSA result
tree_sa = optimize_code(circuit_ixs, out, sizes, TreeSA(ntrials=10, niters=50))
comp_sa = contraction_complexity(tree_sa, circuit_ixs, sizes)
# tc: 2^40.1, sc: 2^26.8, time: 15s

# Improvement
speedup = 2 ** (comp_greedy.tc - comp_sa.tc)  # 4.6x faster execution
memory_reduction = 2 ** (comp_greedy.sc - comp_sa.sc)  # 2.5x less memory

Verdict: TreeSA took 75x longer to optimize but found a solution that’s 4.6x faster to execute. If you’ll run the circuit 100+ times, TreeSA pays for itself.

Optimization Time vs Problem Size

GreedyMethod Scaling

TensorsOptimization Time
10<0.01s
500.05s
1000.2s
5004s

Complexity: O(n² log n)

TreeSA Scaling

TensorsTreeSA.fast()TreeSA (default)
100.1s0.5s
502s10s
10015s60s
20090s300s

Complexity: O(ntrials × niters × n²)

Decision Guide

Start
  │
  ├─ Need results in <1 second? ──→ GreedyMethod
  │
  ├─ Problem is a chain/tree? ──→ GreedyMethod
  │
  ├─ Greedy result good enough? ──→ GreedyMethod
  │
  ├─ Have >1 minute to optimize? ──→ TreeSA
  │
  ├─ Complex graph structure? ──→ TreeSA
  │
  └─ Need best possible solution? ──→ TreeSA (thorough)

Hybrid Approach

Use both in sequence:

# 1. Quick baseline with greedy
tree_greedy = optimize_code(ixs, out, sizes, GreedyMethod())
comp_greedy = tree_greedy.complexity(ixs, sizes)
print(f"Greedy baseline: tc={comp_greedy.tc:.2f}")

# 2. If not good enough, refine with TreeSA
if comp_greedy.tc > 35.0:  # Too slow
    print("Refining with TreeSA...")
    tree_sa = optimize_code(ixs, out, sizes, TreeSA.fast())
    comp_sa = tree_sa.complexity(ixs, sizes)
    print(f"TreeSA result: tc={comp_sa.tc:.2f}")

    improvement = 2 ** (comp_greedy.tc - comp_sa.tc)
    print(f"Improvement: {improvement:.1f}x speedup")

Summary Recommendations

ScenarioAlgorithmConfiguration
Quick prototypingGreedyMethodDefault
Production (simple)GreedyMethodDefault
Production (complex)TreeSATreeSA.fast()
Critical optimizationTreeSAntrials=20, niters=100
Memory-constrainedTreeSA + SlicingCustom ScoreFunction

Next Steps

Score Function Configuration

Configure omeco’s optimization for different hardware and use cases.

What is the Score Function?

The ScoreFunction controls how the optimizer balances three objectives:

  • tc (time complexity): Total FLOPs
  • sc (space complexity): Peak memory usage
  • rwc (read-write complexity): Memory I/O operations

Formula:

score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)

Lower score is better. The optimizer tries to minimize this score.

Basic Usage

Default (Balanced)

from omeco import ScoreFunction

# Balanced optimization (works for most cases)
score = ScoreFunction()
# Equivalent to:
# ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=20.0)

Custom Weights

# Prioritize memory over time
score = ScoreFunction(
    tc_weight=1.0,      # Time matters
    sc_weight=2.0,      # Memory matters 2x more
    rw_weight=0.0,      # Ignore I/O (CPU)
    sc_target=25.0      # Target 2^25 elements (~256MB for float64)
)

Using with Optimizers

from omeco import TreeSA, optimize_code

score = ScoreFunction(sc_target=28.0)
tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Hardware-Specific Configuration

CPU Optimization

Characteristics:

  • Memory bandwidth is not the bottleneck
  • Balance time and space
  • Ignore read-write complexity
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=0.0,      # ← CPU: I/O is not a bottleneck
    sc_target=28.0      # ~256MB (2^28 × 8 bytes for float64)
)

Calculate sc_target:

import math

# For 16GB RAM, reserve half for tensors
available_gb = 8
bytes_per_element = 8  # float64
sc_target = math.log2(available_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 30.0 (8GB)

GPU Optimization

Note: GPU memory bandwidth is often the bottleneck, not compute.

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # See GPU Optimization Guide for tuning
    sc_target=30.0      # ~8GB GPU (2^30 × 4 bytes for float32)
)

For GPU optimization: See GPU Optimization Guide for rw_weight tuning methodology.

Calculate sc_target for GPU:

import math

# NVIDIA RTX 3090: 24GB VRAM
gpu_gb = 24
bytes_per_element = 4  # float32 (most common on GPU)
sc_target = math.log2(gpu_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.5

# Be conservative (leave room for framework overhead)
sc_target = 32.0  # ~16GB usable

Decision Guide

Choose based on your bottleneck:

  1. Default (CPU, balanced): Use ScoreFunction() with defaults

    • Works for most CPU scenarios
    • Balances time and memory
  2. GPU: See GPU Optimization Guide for complete rw_weight tuning methodology

  3. Memory constrained: Increase sc_weight and lower sc_target

    • Example: ScoreFunction(sc_weight=3.0, sc_target=25.0)
    • Accepts slower execution to fit in memory
  4. Need slicing: See Slicing Strategy Guide

    • Use when contraction exceeds available memory
    • Trades time for space

Advanced Configuration

Memory-Constrained Optimization

When memory is the primary constraint:

# Heavily penalize exceeding target
score = ScoreFunction(
    tc_weight=0.1,      # Time is secondary
    sc_weight=10.0,     # Memory is critical
    rw_weight=0.0,
    sc_target=25.0      # Hard limit: 256MB
)

# Use with TreeSA for best results
optimizer = TreeSA(ntrials=10, niters=50, score=score)
tree = optimize_code(ixs, out, sizes, optimizer)

Speed-Critical Optimization

When execution speed is paramount:

# Minimize time complexity only
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=0.0,      # Ignore memory
    rw_weight=0.0,
    sc_target=float('inf')  # No memory limit
)

Hybrid CPU+GPU

For heterogeneous systems:

# Moderate I/O penalty (between CPU and GPU)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # Moderate I/O cost
    sc_target=29.0      # 4GB limit
)

sc_target Reference

Memoryfloat32float64sc_target
256 MB64M elements32M elements25.0-26.0
1 GB256M128M27.0-28.0
4 GB1B512M29.0-30.0
8 GB2B1B30.0-31.0
16 GB4B2B31.0-32.0
32 GB8B4B32.0-33.0

Tuning Workflow

  1. Start with defaults:

    tree = optimize_code(ixs, out, sizes)
    complexity = tree.complexity(ixs, sizes)
    print(f"tc: {complexity.tc:.2f}, sc: {complexity.sc:.2f}")
    
  2. Identify bottleneck:

    • If sc too high → increase sc_weight or lower sc_target
    • If tc too high → try TreeSA with default score
    • If running on GPU → see GPU Optimization Guide
  3. Adjust and re-optimize:

    score = ScoreFunction(sc_weight=2.0, sc_target=28.0)
    tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
    
  4. Verify improvement:

    new_complexity = tree.complexity(ixs, sizes)
    print(f"New tc: {new_complexity.tc:.2f}, sc: {new_complexity.sc:.2f}")
    

Examples

See examples/score_function_examples.py for complete examples:

  • CPU optimization
  • GPU optimization (experimental rw_weight tuning)
  • Memory-limited environments
  • Dynamic sc_target calculation

Next Steps

GPU Optimization

Maximize performance when running tensor contractions on GPUs.

The GPU Bottleneck

GPUs have 10-100x higher compute-to-memory-bandwidth ratio than CPUs. Memory I/O is the bottleneck.

Configuration

from omeco import ScoreFunction, TreeSA, optimize_code

# GPU optimization
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,     # Tune based on profiling (see below)
    sc_target=30.0      # Your GPU memory limit
)

tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

Tuning rw_weight:

No established “best” value—tune empirically:

  1. Start with rw_weight=1.0
  2. Try 10.0, then 20.0 or higher
  3. Profile actual GPU execution time
  4. Adjust based on results

Range: 0.1 to >20 depending on GPU model, tensor sizes, and precision.

Calculate sc_target for Your GPU

import math

# NVIDIA A100: 40GB
gpu_gb = 40
bytes_per_element = 4  # float32

# Maximum usable memory (leave ~20% for overhead)
usable_gb = gpu_gb * 0.8
sc_target = math.log2(usable_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 32.0

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=sc_target
)

GPU Memory Reference

GPUVRAMsc_target (float32)
RTX 306012GB31.0
RTX 309024GB32.2
A100 (40GB)40GB32.9
A100 (80GB)80GB33.9
H10080GB33.9

When It Matters

High impact: Complex networks (quantum circuits, GNNs, irregular contractions) Low impact: Simple chains, matrix multiplications, small networks (<10 tensors)

Common Mistakes

  • ❌ Using rw_weight=0.0 (CPU default) on GPU
  • ❌ Wrong sc_target (e.g., using CPU memory size for GPU)
  • ❌ Using GreedyMethod instead of TreeSA (greedy doesn’t optimize for I/O)

Checklist

  1. Calculate sc_target from GPU memory (float32: 4 bytes/element, reserve 20% overhead)
  2. Set rw_weight (start with 1.0, try 10.0 and 20.0, profile)
  3. Use TreeSA optimizer (not greedy)

Next Steps

Slicing Strategy

Reduce memory usage by slicing tensor indices, trading time for space.

What is Slicing?

Slicing breaks a tensor contraction into smaller chunks by looping over one or more indices.

Example:

# Original: C[i,k] = Σⱼ A[i,j] * B[j,k]
# Too large: j has size 10000

# Sliced: Loop over j in chunks
C = zeros(i, k)
for j_val in range(10000):
    C += A[:, j_val] * B[j_val, :]

Trade-off:

  • ✅ Memory reduced: Only store one slice at a time
  • ❌ Time increased: Repeat computation for each slice

When to Use Slicing

Use slicing when:

  • Contraction exceeds available memory
  • sc (space complexity) is too high
  • Can’t reduce memory by changing contraction order alone

Decision flowchart:

Optimize contraction order
    ↓
Check space complexity
    ↓
sc > sc_target?
  ├─ No  → Done
  └─ Yes → Apply slicing

Basic Usage

Automatic Slicing

from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction

# 1. Optimize contraction order
tree = optimize_code(ixs, out, sizes)

# 2. Check if memory is too large
complexity = tree.complexity(ixs, sizes)
if complexity.sc > 28.0:  # More than 256MB
    # 3. Slice to reduce memory
    score = ScoreFunction(sc_target=28.0)
    slicer = TreeSASlicer.fast(score=score)
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # 4. Verify reduction
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Original: 2^{complexity.sc:.2f} elements")
    print(f"Sliced: 2^{sliced_comp.sc:.2f} elements")
    print(f"Sliced indices: {sliced.slicing()}")

Manual Slicing

Specify which indices to slice:

# Force slicing on specific indices
slicer = TreeSASlicer.fast(fixed_slices=[1, 2])
sliced = slice_code(tree, ixs, sizes, slicer)

# Verify that indices 1 and 2 are sliced
assert 1 in sliced.slicing()
assert 2 in sliced.slicing()

TreeSASlicer Configuration

Fast Preset

# Quick slicing optimization
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)

Custom Configuration

from omeco import TreeSASlicer, ScoreFunction

score = ScoreFunction(
    tc_weight=0.1,      # Accept higher time
    sc_weight=10.0,     # Prioritize space
    sc_target=25.0      # Target 256MB
)

slicer = TreeSASlicer(
    ntrials=10,          # Number of optimization runs
    niters=50,           # Iterations per run
    fixed_slices=[1],    # Force slicing on index 1
    score=score
)

sliced = slice_code(tree, ixs, sizes, slicer)

Understanding the Trade-off

Example: Large Matrix Multiplication

# A[10000 × 10000] × B[10000 × 10000]
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10000, 1: 10000, 2: 10000}

# Without slicing
tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)
# tc ≈ 33.2 (10^10 FLOPs)
# sc ≈ 26.6 (100M elements = 800MB for float64)

# With slicing on index 1 (middle dimension)
slicer = TreeSASlicer.fast(fixed_slices=[1])
sliced = slice_code(tree, ixs, sizes, slicer)
sliced_comp = sliced.complexity(ixs, sizes)
# tc ≈ 33.2 (same FLOPs, distributed over slices)
# sc ≈ 23.3 (10M elements = 80MB for float64)

# Result: 10x memory reduction, same total time (but distributed)

Choosing Which Indices to Slice

Automatic Selection

TreeSASlicer automatically chooses which indices to slice:

# Let the optimizer decide
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))
sliced = slice_code(tree, ixs, sizes, slicer)

print(f"Automatically sliced indices: {sliced.slicing()}")

Algorithm: Tries slicing different indices and picks the combination that best satisfies sc_target while minimizing tc increase.

Manual Selection Guidelines

Good candidates for slicing:

  • Indices with large dimensions
  • Indices that appear in many tensors (high degree)
  • Indices not in the output (no outer loop needed)

Poor candidates:

  • Output indices (can’t slice without changing result structure)
  • Small dimensions (little memory benefit)
  • Indices that would cause many small intermediate tensors

Advanced: Slicing with GPU

For GPU workloads, consider both memory and I/O:

# GPU slicing: balance memory and I/O
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=2.0,      # Space is critical on GPU
    rw_weight=0.1,      # Experimental: tune empirically
    sc_target=30.0      # 8GB GPU limit
)

slicer = TreeSASlicer(
    ntrials=10,
    niters=50,
    score=score
)

sliced = slice_code(tree, ixs, sizes, slicer)

Note: Slicing may increase rwc (read-write complexity) because data is loaded multiple times. The optimizer balances this with the rw_weight parameter.

Example Workflow

from omeco import optimize_code, slice_code
from omeco import TreeSA, TreeSASlicer, ScoreFunction

# Problem setup
ixs = [[0,1,2], [2,3,4], [4,5,6], [6,7,0]]  # Cycle graph
out = [1,3,5,7]
sizes = {i: 32 for i in range(8)}

# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
comp = tree.complexity(ixs, sizes)
print(f"Optimized: sc = 2^{comp.sc:.2f}")

# Step 2: Check if slicing needed
if comp.sc > 20.0:  # Exceeds 1MB
    print("Applying slicing...")
    
    # Step 3: Slice
    slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=18.0))
    sliced = slice_code(tree, ixs, sizes, slicer)
    
    # Step 4: Verify
    sliced_comp = sliced.complexity(ixs, sizes)
    print(f"Sliced: sc = 2^{sliced_comp.sc:.2f}")
    print(f"Sliced indices: {sliced.slicing()}")
    
    # Calculate overhead
    time_overhead = 2 ** (sliced_comp.tc - comp.tc)
    memory_reduction = 2 ** (comp.sc - sliced_comp.sc)
    print(f"Time overhead: {time_overhead:.1f}x")
    print(f"Memory reduction: {memory_reduction:.1f}x")

Multiple Slices

Slice multiple indices for extreme memory reduction:

# Very tight memory constraint
score = ScoreFunction(
    tc_weight=0.01,     # Accept much higher time
    sc_weight=100.0,    # Memory is critical
    sc_target=15.0      # Only 32KB available!
)

slicer = TreeSASlicer(ntrials=20, niters=100, score=score)
sliced = slice_code(tree, ixs, sizes, slicer)

# May slice multiple indices
print(f"Sliced {len(sliced.slicing())} indices: {sliced.slicing()}")

Performance Impact

ScenarioMemory ReductionTime OverheadWhen to Use
Slice 1 index2-10x1-2xModerate memory constraint
Slice 2 indices10-100x2-5xSevere memory constraint
Slice 3+ indices100-1000x5-20xExtreme memory constraint

Tips

  1. Optimize order first, slice second:

    # Good workflow
    tree = optimize_code(ixs, out, sizes, TreeSA.fast())
    sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
    
  2. Set realistic sc_target:

    # Target 80% of available memory
    available_gb = 8
    sc_target = math.log2(available_gb * 0.8 * 1024**3 / 8)
    
  3. Profile actual memory usage:

    # Theoretical vs actual memory may differ
    # Always test with real data
    
  4. Consider time-memory trade-off:

    # For one-time computation: accept high memory
    # For repeated computation: slice to fit memory
    

Next Steps

PyTorch Integration

Use omeco to optimize tensor contractions in PyTorch.

Overview

omeco optimizes the contraction order, then you execute with PyTorch tensors.

Workflow:

  1. Define network structure (indices and sizes)
  2. Optimize with omeco → get contraction tree
  3. Execute tree with PyTorch tensors

Basic Example

import torch
from omeco import optimize_code

# Define tensor network structure
ixs = [[0, 1], [1, 2], [2, 3]]  # Matrix chain
out = [0, 3]
sizes = {0: 100, 1: 200, 2: 50, 3: 100}

# Optimize contraction order
tree = optimize_code(ixs, out, sizes)
print(f"Optimized tree:\n{tree}")

# Create PyTorch tensors
tensors = [
    torch.randn(100, 200),  # A[0,1]
    torch.randn(200, 50),   # B[1,2]
    torch.randn(50, 100),   # C[2,3]
]

# Execute contraction (see full example below)
result = execute_tree(tree, tensors)

Complete Example

See examples/pytorch_tensor_network_example.py for a full working example with:

  • Tree traversal and execution
  • Verification against direct einsum
  • 4×4 grid tensor network
  • Memory reduction via slicing

Converting Tree to Dict

tree_dict = tree.to_dict()

# Dictionary structure:
# Leaf: {"tensor_index": int}
# Node: {"args": [children], "eins": {"ixs": [[int]], "iy": [int]}}

Executing the Tree

def execute_tree(tree_dict, tensors):
    """Recursively execute contraction tree."""
    if "tensor_index" in tree_dict:
        # Leaf node - return tensor
        return tensors[tree_dict["tensor_index"]]
    
    # Internal node - execute children then contract
    args = [execute_tree(child, tensors) for child in tree_dict["args"]]
    eins = tree_dict["eins"]
    
    # Build einsum string
    ixs_str = [indices_to_str(ix) for ix in eins["ixs"]]
    iy_str = indices_to_str(eins["iy"])
    einsum_eq = f"{','.join(ixs_str)}->{iy_str}"
    
    # Execute with PyTorch
    return torch.einsum(einsum_eq, *args)

def indices_to_str(indices):
    """Convert integer indices to letters."""
    return ''.join(chr(ord('a') + i) for i in indices)

GPU Execution

# Move tensors to GPU
tensors_gpu = [t.cuda() for t in tensors]

# Execute on GPU
result = execute_tree(tree.to_dict(), tensors_gpu)

With Autograd

# Tensors with gradients
tensors = [torch.randn(100, 200, requires_grad=True) for _ in range(3)]

# Execute
result = execute_tree(tree.to_dict(), tensors)

# Backward pass
result.sum().backward()

# Gradients available
print(tensors[0].grad)

Next Steps

Performance Benchmarks

Real-world performance comparison with Julia implementation.

Benchmark Methodology

All benchmarks compare against OMEinsumContractionOrders.jl, the original Julia implementation.

Test Environment:

  • Benchmarks run via Python bindings (PyO3)
  • Configuration: ntrials=1, niters=50-100
  • See benchmarks/ directory for scripts

Rust vs Julia (TreeSA)

Comparison of TreeSA optimizer performance:

ProblemTensorsIndicesRust (ms)Julia (ms)tc (Rust)tc (Julia)Speedup
chain_10101118.531.623.1023.101.7x
grid_4x4162488.0150.79.189.261.7x
grid_5x52540155.4297.710.9610.961.9x
reg3_2502503722,4355,09948.0047.172.1x

Results:

  • Rust is 1.7-2.1x faster than Julia for TreeSA optimization
  • Both implementations find solutions with comparable time complexity (tc)
  • Solution quality is nearly identical between implementations

Greedy Method Performance

GreedyMethod is extremely fast for quick optimization:

ProblemTensorsTime (ms)tcsc
chain_10100.0423.1013.29
grid_4x4160.129.545.0
grid_5x5250.2311.286.0
reg3_2502507.569.0047.0

Key Observations:

  • Greedy is 100-300x faster than TreeSA
  • Greedy gives good results for small problems (chain_10, grids)
  • For large problems (reg3_250), TreeSA finds much better solutions:
    • Greedy: tc = 69.00
    • TreeSA: tc = 48.00
    • Improvement: 2^(69-48) = 2^21 = 2 million times faster execution!

Python Overhead

Python bindings via PyO3 add minimal overhead:

TreeSA (grid_4x4, 100 iterations):

  • Pure Rust backend: ~88ms
  • Python call overhead: <1ms (~1%)

Greedy (reg3_250):

  • Pure optimization: ~7.5ms
  • Python overhead: <0.5ms (~6%)

Conclusion: Python overhead is negligible, especially for TreeSA optimization.

Algorithm Comparison

When to use each algorithm:

ScenarioRecommendedReason
Quick prototypingGreedySub-millisecond optimization
Production (< 50 tensors)GreedyFast enough, good quality
Production (50-200 tensors)TreeSA.fast()Better quality, still reasonable time
Large networks (> 200 tensors)TreeSASignificant quality improvement
Time-criticalGreedyPredictable fast performance
Execution-criticalTreeSABetter contraction order = faster execution

Execution Time Savings

The time spent optimizing pays off during execution:

Example: reg3_250 network

  • Greedy optimization: 7.5ms
  • TreeSA optimization: 2,435ms (~2.4 seconds)
  • Extra optimization time: +2.4 seconds

But the execution improvement:

  • Greedy tc: 69.00 → ~2^69 FLOPs
  • TreeSA tc: 48.00 → ~2^48 FLOPs
  • Execution speedup: 2^21 = 2 million times faster

If executing even once, TreeSA optimization is worth it!

Hardware Recommendations

Tensor CountRAMCPURecommendation
< 204GBAnyGreedy is sufficient
20-1008GB4+ coresTreeSA.fast() for production
100-50016GB8+ coresTreeSA with multiple trials
> 50032GB+16+ coresTreeSA (may take minutes)

Running Benchmarks

Reproduce the benchmarks yourself:

cd benchmarks

# Python (Rust via PyO3)
python benchmark_python.py

# Julia (original implementation)
julia --project=. benchmark_julia.jl

# Compare results
python -c "
import json
with open('results_rust_treesa.json') as f:
    rust = json.load(f)
with open('results_julia_treesa.json') as f:
    julia = json.load(f)

for problem in rust['results']:
    r = rust['results'][problem]['avg_ms']
    j = julia['results'][problem]['avg_ms']
    print(f'{problem}: Rust {r:.1f}ms, Julia {j:.1f}ms, Speedup {j/r:.2f}x')
"

Profiling Your Code

Python Profiling

import time
from omeco import optimize_code, TreeSA

# Time optimization
start = time.time()
tree = optimize_code(ixs, out, sizes, TreeSA.fast())
opt_time = time.time() - start

# Check complexity
comp = tree.complexity(ixs, sizes)

print(f"Optimization: {opt_time:.3f}s")
print(f"Time complexity: 2^{comp.tc:.2f} = {2**comp.tc:.2e} FLOPs")
print(f"Space complexity: 2^{comp.sc:.2f} elements")

# Estimate execution time (assuming 10 GFLOP/s CPU)
execution_seconds = 2**comp.tc / 1e10
print(f"Estimated execution: {execution_seconds:.1f}s @ 10 GFLOP/s")

Rust Profiling

# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --example your_example

# Criterion benchmarks (if available)
cargo bench

Performance Tips

  1. Start with Greedy: Always try Greedy first to get a baseline
  2. Use TreeSA for production: The extra optimization time pays off
  3. Increase ntrials for critical code: More trials = better solutions
  4. Profile execution time: Verify that better tc actually improves runtime
  5. Consider memory: Use sc_target if memory is constrained

Next Steps

Troubleshooting

Common issues and solutions when using omeco.

Installation Issues

Python: pip install fails

Problem: pip install omeco fails with compilation errors.

Solutions:

  1. Ensure Rust toolchain is installed:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    rustup update
    
  2. Update pip and setuptools:

    pip install --upgrade pip setuptools wheel
    
  3. Install from source:

    git clone https://github.com/GiggleLiu/omeco
    cd omeco/omeco-python
    pip install maturin
    maturin develop --release
    

Rust: Dependency resolution errors

Problem: Cargo fails to resolve dependencies.

Solution: Update Cargo.lock and rebuild:

cargo clean
cargo update
cargo build --release

Runtime Errors

ImportError: No module named ‘omeco’

Problem: Python cannot find the omeco module after installation.

Diagnosis:

import sys
print(sys.path)  # Check if package is in path

Solutions:

  1. Verify installation:

    pip show omeco
    
  2. Check virtual environment:

    which python  # Ensure you're using the right Python
    
  3. Reinstall in development mode:

    cd omeco-python
    maturin develop --release
    

Memory Errors (OOM)

Problem: Program crashes with “Out of Memory” error during optimization.

Cause: Space complexity (sc) exceeds available memory.

Solutions:

  1. Check space complexity:

    from omeco import optimize_code
    
    tree = optimize_code(ixs, out, sizes)
    comp = tree.complexity(ixs, sizes)
    print(f"Space complexity: 2^{comp.sc:.2f} elements")
    
    # Calculate memory usage
    bytes_per_element = 8  # float64
    memory_gib = (2 ** comp.sc) * bytes_per_element / 1024**3
    print(f"Estimated memory: {memory_gib:.2f} GiB")
    
  2. Use slicing to reduce memory:

    from omeco import slice_code, TreeSASlicer, ScoreFunction
    
    # Set target memory (e.g., 8GB = sc_target ≈ 30.0 for float64)
    import math
    available_gb = 8
    sc_target = math.log2(available_gb * 1024**3 / 8)
    
    score = ScoreFunction(sc_target=sc_target, sc_weight=2.0)
    slicer = TreeSASlicer.fast(score=score)
    sliced = slice_code(tree, ixs, sizes, slicer)
    
  3. Optimize with memory constraints from the start:

    from omeco import TreeSA, ScoreFunction
    
    score = ScoreFunction(sc_target=28.0, sc_weight=3.0)
    tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
    

TypeError: Invalid index type

Problem: Indices must be integers, but got strings or floats.

Cause: Einstein indices in ixs and out must be integers (or strings in some contexts).

Solution: Convert indices to integers:

# Wrong
ixs = [["a", "b"], ["b", "c"]]

# Correct
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

Performance Issues

Optimization is too slow

Problem: TreeSA takes too long to find a good contraction order.

Solutions:

  1. Use faster preset:

    from omeco import TreeSA
    
    # Instead of default
    tree = optimize_code(ixs, out, sizes, TreeSA.fast())
    
  2. Try GreedyMethod first:

    from omeco import optimize_code, GreedyMethod
    
    # Much faster, good baseline
    tree = optimize_code(ixs, out, sizes)
    
  3. Reduce TreeSA iterations:

    from omeco import TreeSA
    
    optimizer = TreeSA(ntrials=5, niters=20)  # Faster but lower quality
    tree = optimize_code(ixs, out, sizes, optimizer)
    

GPU performance is poor

Problem: Execution on GPU is slower than expected.

Diagnosis: Check if optimization used correct GPU settings:

comp = tree.complexity(ixs, sizes)
print(f"Read-write complexity: 2^{comp.rwc:.2f}")

If rwc is high, the contraction does excessive memory I/O.

Solution: Re-optimize with GPU-specific score function:

from omeco import ScoreFunction, TreeSA

# GPU optimization (see GPU Optimization Guide for rw_weight tuning)
score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,
    sc_target=30.0
)

tree = optimize_code(ixs, out, sizes, TreeSA(score=score))

See GPU Optimization Guide for details.

Results differ from expected

Problem: Optimized contraction order gives different complexity than expected.

Diagnosis:

  1. Print the contraction tree:

    tree = optimize_code(ixs, out, sizes)
    print(tree)  # Shows ASCII tree structure
    
  2. Check complexity metrics:

    comp = tree.complexity(ixs, sizes)
    print(f"tc: {comp.tc:.2f}, sc: {comp.sc:.2f}, rwc: {comp.rwc:.2f}")
    
  3. Verify input:

    print(f"ixs: {ixs}")
    print(f"out: {out}")
    print(f"sizes: {sizes}")
    
    # Check all indices are covered
    all_indices = set()
    for ix in ixs:
        all_indices.update(ix)
    print(f"All indices: {all_indices}")
    print(f"Missing sizes: {all_indices - set(sizes.keys())}")
    

Common causes:

  • Missing indices in sizes dictionary
  • Wrong output indices in out
  • Mismatched dimensions between ixs and sizes

Algorithm-Specific Issues

GreedyMethod gives poor results

Problem: Greedy optimization produces sub-optimal contraction order.

Expected: Greedy is fast but not optimal. It uses heuristics.

Solution: Use TreeSA for better quality:

from omeco import TreeSA

tree = optimize_code(ixs, out, sizes, TreeSA.fast())

Comparison:

from omeco import optimize_code, GreedyMethod, TreeSA

# Greedy (fast, lower quality)
greedy_tree = optimize_code(ixs, out, sizes)
greedy_comp = greedy_tree.complexity(ixs, sizes)

# TreeSA (slower, higher quality)
sa_tree = optimize_code(ixs, out, sizes, TreeSA.fast())
sa_comp = sa_tree.complexity(ixs, sizes)

print(f"Greedy tc: {greedy_comp.tc:.2f}")
print(f"TreeSA tc: {sa_comp.tc:.2f}")
print(f"Improvement: {2**(greedy_comp.tc - sa_comp.tc):.1f}x faster")

TreeSA gets stuck in local minimum

Problem: Multiple runs of TreeSA give inconsistent results.

Cause: Simulated annealing is stochastic.

Solutions:

  1. Increase trials:

    optimizer = TreeSA(ntrials=20, niters=50)  # More exploration
    tree = optimize_code(ixs, out, sizes, optimizer)
    
  2. Use best-of-many:

    from omeco import TreeSA
    
    best_tree = None
    best_tc = float('inf')
    
    for _ in range(10):
        tree = optimize_code(ixs, out, sizes, TreeSA.fast())
        comp = tree.complexity(ixs, sizes)
        if comp.tc < best_tc:
            best_tc = comp.tc
            best_tree = tree
    
    print(f"Best tc: {best_tc:.2f}")
    

Integration Issues

PyTorch: Contraction gives wrong results

Problem: Executing the contraction tree with PyTorch tensors produces incorrect output.

Diagnosis:

  1. Verify with einsum:

    import torch
    from omeco import optimize_code
    
    # Direct einsum (slow but correct)
    result_ref = torch.einsum("ab,bc,cd->ad", A, B, C)
    
    # Optimized tree
    tree = optimize_code([[0,1], [1,2], [2,3]], [0,3], {0:10, 1:20, 2:30, 3:10})
    result_tree = execute_tree(tree.to_dict(), [A, B, C])
    
    # Compare
    print(f"Match: {torch.allclose(result_ref, result_tree)}")
    print(f"Max error: {(result_ref - result_tree).abs().max()}")
    
  2. Check tensor shapes:

    for i, t in enumerate([A, B, C]):
        print(f"Tensor {i}: {t.shape}")
    

Common causes:

  • Mismatched tensor shapes and sizes dictionary
  • Wrong index mapping in execute_tree
  • Incorrect einsum string construction

See PyTorch Integration Guide for complete example.

Slicing: Incorrect memory calculation

Problem: Slicing doesn’t reduce memory as expected.

Diagnosis:

from omeco import slice_code

# Before slicing
comp_original = tree.complexity(ixs, sizes)

# After slicing
sliced = slice_code(tree, ixs, sizes, slicer)
comp_sliced = sliced.complexity(ixs, sizes)

print(f"Original sc: 2^{comp_original.sc:.2f}")
print(f"Sliced sc: 2^{comp_sliced.sc:.2f}")
print(f"Reduction: {2**(comp_original.sc - comp_sliced.sc):.1f}x")
print(f"Sliced indices: {sliced.slicing()}")

Solutions:

  1. Increase sc_weight:

    score = ScoreFunction(sc_weight=5.0, sc_target=25.0)
    slicer = TreeSASlicer.fast(score=score)
    
  2. Force specific indices:

    slicer = TreeSASlicer.fast(fixed_slices=[1, 2])
    sliced = slice_code(tree, ixs, sizes, slicer)
    

Debugging Tips

Enable verbose output

Rust library doesn’t have built-in verbose mode, but you can add print statements:

# Check optimization result
tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")

# Check complexity
comp = tree.complexity(ixs, sizes)
print(f"tc: 2^{comp.tc:.2f}, sc: 2^{comp.sc:.2f}, rwc: 2^{comp.rwc:.2f}")

# Check slicing
sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")

Compare with Julia implementation

If you suspect a bug, compare with the original Julia implementation:

using OMEinsumContractionOrders

ixs = [[1, 2], [2, 3], [3, 1]]
out = [1]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)

# Greedy
tree = optimize_code(ixs, out, sizes)
println(tree)

# TreeSA
tree = optimize_code(ixs, out, sizes)
println(tree)

Minimal reproducible example

When reporting issues, provide:

from omeco import optimize_code

# Minimal example
ixs = [[0, 1], [1, 2]]
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

tree = optimize_code(ixs, out, sizes)
print(f"Tree:\n{tree}")

comp = tree.complexity(ixs, sizes)
print(f"Complexity: tc={comp.tc:.2f}, sc={comp.sc:.2f}")

# What was expected vs what you got
print(f"Expected tc: ~13.0")
print(f"Got tc: {comp.tc:.2f}")

FAQ

Q: How do I use different optimizers?

A: optimize_code is the unified function that accepts different optimizer types:

from omeco import optimize_code, GreedyMethod, TreeSA

# Use default optimizer (GreedyMethod)
tree1 = optimize_code(ixs, out, sizes)

# Explicitly specify GreedyMethod
tree2 = optimize_code(ixs, out, sizes, GreedyMethod())

# Use TreeSA optimizer
tree3 = optimize_code(ixs, out, sizes, TreeSA.fast())

Q: How do I choose between Greedy and TreeSA?

A:

  • GreedyMethod: Fast (milliseconds to seconds), good for prototyping
  • TreeSA: Slower (seconds to minutes), better quality, use for production

Rule of thumb:

  • < 50 tensors → Greedy is fine
  • 50-200 tensors → TreeSA.fast()
  • > 200 tensors → Start with Greedy, then try TreeSA if needed

Q: What does sc_target mean?

A: sc_target is the log₂ of the target memory limit in elements.

Calculate it:

import math

# For 4GB memory, float64 (8 bytes)
memory_gb = 4
bytes_per_element = 8
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 29.0

# For 8GB memory, float32 (4 bytes)
memory_gb = 8
bytes_per_element = 4
sc_target = math.log2(memory_gb * 1024**3 / bytes_per_element)
# sc_target ≈ 31.0

Q: Why is GPU optimization different?

A: GPUs have limited memory bandwidth relative to compute (10-100x higher compute-to-memory-bandwidth ratio than CPUs).

  • CPU: Memory I/O is fast → rw_weight=0.0
  • GPU: Use rw_weight to penalize memory I/O

See GPU Optimization Guide for complete tuning methodology.

Q: Can I save and load optimized trees?

A: Yes, use to_dict() and JSON:

import json
from omeco import optimize_code

# Optimize
tree = optimize_code(ixs, out, sizes)

# Save
tree_dict = tree.to_dict()
with open("tree.json", "w") as f:
    json.dump(tree_dict, f)

# Load
with open("tree.json", "r") as f:
    tree_dict = json.load(f)

# Use dict directly or reconstruct (implementation dependent)

Q: What if my tensor network has repeated indices?

A: Repeated indices (traces) are supported:

# Trace: C[i,j] = Σₖ A[i,k,k] * B[k,j]
ixs = [[0, 1, 1], [1, 2]]  # Note: index 1 appears twice in first tensor
out = [0, 2]
sizes = {0: 10, 1: 20, 2: 10}

tree = optimize_code(ixs, out, sizes)

Still Having Issues?

  1. Check the documentation:

  2. Search existing issues:

  3. Report a bug:

    • Include minimal reproducible example
    • Include Python/Rust version, OS, omeco version
    • Show expected vs actual behavior
  4. Ask for help:

    • GitHub Discussions
    • Include context and what you’ve tried

API Reference

Quick reference for omeco’s Python and Rust APIs.

Python API

Full Python API documentation is available in the package docstrings. This page provides a quick overview.

Optimization Functions

optimize_code

def optimize_code(
    ixs: List[List[int]],
    out: List[int],
    sizes: Dict[int, int],
    optimizer: Optional[Optimizer] = None
) -> NestedEinsum

Optimize tensor contraction order.

Parameters:

  • ixs: List of index lists for each tensor (e.g., [[0, 1], [1, 2]])
  • out: Output indices (e.g., [0, 2])
  • sizes: Dictionary mapping indices to dimensions (e.g., {0: 10, 1: 20, 2: 10})
  • optimizer: Optimizer instance (default: GreedyMethod())

Returns: NestedEinsum representing the optimized contraction tree

Example:

from omeco import optimize_code, TreeSA

tree = optimize_code(
    ixs=[[0, 1], [1, 2], [2, 3]],
    out=[0, 3],
    sizes={0: 10, 1: 20, 2: 30, 3: 10},
    optimizer=TreeSA.fast()
)

Slicing Functions

slice_code

def slice_code(
    tree: NestedEinsum,
    ixs: List[List[int]],
    sizes: Dict[int, int],
    slicer: Optional[TreeSASlicer] = None
) -> SlicedCode

Apply slicing to reduce memory usage.

Parameters:

  • tree: Optimized contraction tree from optimize_code
  • ixs: Original index lists
  • sizes: Dimension sizes
  • slicer: Slicer instance (optional, defaults to TreeSASlicer())

Returns: SlicedCode with slicing applied

Example:

from omeco import optimize_code, slice_code, TreeSASlicer, ScoreFunction

tree = optimize_code(ixs, out, sizes)

slicer = TreeSASlicer.fast(
    score=ScoreFunction(sc_target=28.0)
)
sliced = slice_code(tree, ixs, sizes, slicer)

Complexity Functions

contraction_complexity

def contraction_complexity(
    tree: NestedEinsum,
    ixs: List[List[int]],
    sizes: Dict[int, int]
) -> Complexity

Calculate complexity metrics for a contraction tree.

Returns: Complexity object with fields:

  • tc: Time complexity (log₂ FLOPs)
  • sc: Space complexity (log₂ elements)
  • rwc: Read-write complexity (log₂ elements)

Example:

from omeco import optimize_code

tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)

print(f"Time: 2^{comp.tc:.2f} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements")
print(f"Read-write: 2^{comp.rwc:.2f} elements")

sliced_complexity

def sliced_complexity(
    sliced: SlicedCode,
    ixs: List[List[int]],
    sizes: Dict[int, int]
) -> Complexity

Calculate complexity for sliced code.

Example:

from omeco import slice_code

sliced = slice_code(tree, ixs, sizes, slicer)
comp = sliced.complexity(ixs, sizes)
print(f"Sliced space: 2^{comp.sc:.2f}")

Optimizer Classes

GreedyMethod

class GreedyMethod:
    def __init__(
        self,
        max_iter: Optional[int] = None,
        temperature: float = 0.0,
        max_branches: int = 1
    )

Greedy optimization algorithm.

Parameters:

  • max_iter: Maximum iterations (default: unlimited)
  • temperature: Temperature for stochastic variant (0.0 = deterministic)
  • max_branches: Number of branches to explore (1 = pure greedy)

Presets:

# Default greedy
GreedyMethod()

# Stochastic variant
GreedyMethod(temperature=1.0, max_branches=5)

TreeSA

class TreeSA:
    def __init__(
        self,
        ntrials: int = 1,
        niters: int = 50,
        score: Optional[ScoreFunction] = None
    )
    
    @staticmethod
    def fast(score: Optional[ScoreFunction] = None) -> TreeSA

Tree-based simulated annealing optimizer.

Parameters:

  • ntrials: Number of independent trials
  • niters: Iterations per trial
  • score: Score function for evaluating candidates

Presets:

# Fast preset (good balance)
TreeSA.fast()

# Custom configuration
TreeSA(ntrials=10, niters=100, score=ScoreFunction(sc_target=28.0))

TreeSASlicer

class TreeSASlicer:
    def __init__(
        self,
        ntrials: int = 1,
        niters: int = 50,
        fixed_slices: Optional[List[int]] = None,
        score: Optional[ScoreFunction] = None
    )
    
    @staticmethod
    def fast(
        fixed_slices: Optional[List[int]] = None,
        score: Optional[ScoreFunction] = None
    ) -> TreeSASlicer

Simulated annealing slicer.

Parameters:

  • ntrials: Number of trials
  • niters: Iterations per trial
  • fixed_slices: Indices that must be sliced
  • score: Score function

Example:

# Automatic slice selection
TreeSASlicer.fast(score=ScoreFunction(sc_target=25.0))

# Force slicing on specific indices
TreeSASlicer.fast(fixed_slices=[1, 2])

Configuration Classes

ScoreFunction

class ScoreFunction:
    def __init__(
        self,
        tc_weight: float = 1.0,
        sc_weight: float = 1.0,
        rw_weight: float = 0.0,
        sc_target: float = 20.0
    )

Configure optimization objectives.

Parameters:

  • tc_weight: Weight for time complexity
  • sc_weight: Weight for space complexity
  • rw_weight: Weight for read-write complexity (default: 0.0 for CPU, see GPU Optimization Guide for GPU)
  • sc_target: Target space complexity (log₂ elements)

Score Formula:

score = tc_weight × 2^tc + rw_weight × 2^rwc + sc_weight × max(0, 2^sc - 2^sc_target)

Examples:

# CPU optimization
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=0.0, sc_target=28.0)

# GPU optimization (see GPU Optimization Guide)
ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0)

# Memory-constrained
ScoreFunction(tc_weight=0.1, sc_weight=10.0, rw_weight=0.0, sc_target=25.0)

Data Classes

NestedEinsum

Represents an optimized contraction tree.

Methods:

tree.to_dict() -> dict          # Convert to dictionary
tree.leaf_count() -> int        # Number of leaf tensors
tree.depth() -> int             # Tree depth
str(tree)                       # Pretty-print tree
repr(tree)                      # Summary representation

Dictionary Structure:

# Leaf node
{"tensor_index": int}

# Internal node
{
    "args": [child1_dict, child2_dict, ...],
    "eins": {
        "ixs": [[int, ...], ...],  # Input indices
        "iy": [int, ...]            # Output indices
    }
}

Example:

tree = optimize_code(ixs, out, sizes)

# Pretty print
print(tree)
# Output:
# ab, bc -> ac
# ├─ ab
# └─ bc

# Convert to dict for execution
tree_dict = tree.to_dict()

SlicedCode

Represents a contraction with slicing applied.

Methods:

sliced.slicing() -> List[int]   # Get sliced indices

Example:

sliced = slice_code(tree, ixs, sizes, slicer)
print(f"Sliced indices: {sliced.slicing()}")
# Output: [1, 3]

Complexity

Complexity metrics for a contraction.

Fields:

comp.tc: float   # Time complexity (log₂ FLOPs)
comp.sc: float   # Space complexity (log₂ elements)
comp.rwc: float  # Read-write complexity (log₂ elements)

Example:

comp = tree.complexity(ixs, sizes)

# Interpret log₂ values
flops = 2 ** comp.tc
memory_elements = 2 ** comp.sc
io_operations = 2 ** comp.rwc

print(f"{flops:.2e} FLOPs")
print(f"{memory_elements:.2e} elements in memory")

Rust API

Full Rust API documentation: docs.rs/omeco

Core Types

NestedEinsum<I>

#![allow(unused)]
fn main() {
pub struct NestedEinsum<I> { /* ... */ }

impl<I> NestedEinsum<I> {
    pub fn leaf(tensor_index: usize) -> Self
    pub fn new(args: Vec<Self>, eins: ContractionOperation<I>) -> Self
    pub fn depth(&self) -> usize
    pub fn leaf_count(&self) -> usize
}
}

ContractionOperation<I>

#![allow(unused)]
fn main() {
pub struct ContractionOperation<I> {
    pub ixs: Vec<Vec<I>>,  // Input indices
    pub iy: Vec<I>,        // Output indices
}
}

Optimization Functions

optimize_greedy

#![allow(unused)]
fn main() {
pub fn optimize_greedy<I>(
    ixs: &[Vec<I>],
    out: &[I],
    sizes: &HashMap<I, usize>
) -> NestedEinsum<I>
where
    I: Hash + Eq + Clone + Ord
}

optimize_code

#![allow(unused)]
fn main() {
pub fn optimize_code<I, O>(
    ixs: &[Vec<I>],
    out: &[I],
    sizes: &HashMap<I, usize>,
    optimizer: O
) -> NestedEinsum<I>
where
    I: Hash + Eq + Clone + Ord,
    O: Optimizer<I>
}

Optimizer Traits

Optimizer<I>

#![allow(unused)]
fn main() {
pub trait Optimizer<I> {
    fn optimize(
        &self,
        ixs: &[Vec<I>],
        out: &[I],
        sizes: &HashMap<I, usize>
    ) -> NestedEinsum<I>;
}
}

Configuration

ScoreFunction

#![allow(unused)]
fn main() {
pub struct ScoreFunction {
    pub tc_weight: f64,
    pub sc_weight: f64,
    pub rw_weight: f64,
    pub sc_target: f64,
}

impl ScoreFunction {
    pub fn new(tc_weight: f64, sc_weight: f64, rw_weight: f64, sc_target: f64) -> Self
}
}

Type Mapping (Python ↔ Rust)

PythonRustNotes
List[List[int]]Vec<Vec<i64>>Index lists
Dict[int, int]HashMap<i64, usize>Dimension sizes
NestedEinsumNestedEinsum<i64>Contraction tree
ScoreFunctionScoreFunctionSame fields
ComplexityContractionComplexityMetrics

Common Patterns

Optimize with custom score

from omeco import optimize_code, TreeSA, ScoreFunction

score = ScoreFunction(
    tc_weight=1.0,
    sc_weight=1.0,
    rw_weight=10.0,  # For GPU (see GPU Optimization Guide)
    sc_target=30.0
)

tree = optimize_code(
    ixs=[[0, 1], [1, 2]],
    out=[0, 2],
    sizes={0: 10, 1: 20, 2: 10},
    optimizer=TreeSA(score=score)
)

Optimize then slice

from omeco import optimize_code, slice_code, TreeSA, TreeSASlicer, ScoreFunction

# Step 1: Optimize contraction order
tree = optimize_code(ixs, out, sizes, TreeSA.fast())

# Step 2: Apply slicing if needed
slicer = TreeSASlicer.fast(score=ScoreFunction(sc_target=28.0))
sliced = slice_code(tree, ixs, sizes, slicer)

Check complexity

from omeco import optimize_code

tree = optimize_code(ixs, out, sizes)
comp = tree.complexity(ixs, sizes)

print(f"Time: 2^{comp.tc:.2f} FLOPs = {2**comp.tc:.2e} FLOPs")
print(f"Space: 2^{comp.sc:.2f} elements = {2**comp.sc:.2e} elements")

# Memory in GiB (float64)
memory_gib = (2 ** comp.sc) * 8 / 1024**3
print(f"Memory: {memory_gib:.2f} GiB")

Next Steps

Changelog

All notable changes to omeco are documented here.

The format follows Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

  • Comprehensive mdBook documentation following tropical-gemm standard
  • Pretty printing for Python NestedEinsum with ASCII tree visualization
  • PyTorch integration guide and examples
  • GPU optimization guide with rw_weight configuration
  • Slicing strategy guide for memory-constrained environments
  • Troubleshooting guide with common issues and solutions
  • API reference for Python and Rust APIs
  • Performance benchmarks comparing Rust vs Julia
  • Algorithm comparison guide (Greedy vs TreeSA)

Changed

  • Migrated documentation from scattered markdown files to structured mdBook
  • Improved Python bindings with better __str__ and __repr__ methods

Deprecated

  • Legacy docs/score_function_guide.md (migrated to mdBook)

[0.2.1] - 2024-01-XX

Fixed

  • Issue #6: Hyperedge index preservation in contraction operations (PR #7)
    • Fixed contract_tree! macro to correctly preserve tensor indices during contraction
    • Added regression tests to verify hyperedge handling
    • Ensures contraction order matches input tensor order specified in ixs

Added

  • Test suite for hyperedge index preservation
  • CI improvements for better test coverage

[0.2.0] - 2024-01-XX

Added

  • TreeSA (Tree-based Simulated Annealing) optimizer
  • TreeSA.fast() preset for quick high-quality optimization
  • Slicing support with TreeSASlicer for memory reduction
  • ScoreFunction for configurable optimization objectives
  • contraction_complexity and sliced_complexity functions
  • Python bindings via PyO3
  • optimize_code generic function accepting optimizer instances
  • Read-write complexity (rwc) metric for GPU optimization

Changed

  • Improved API ergonomics with preset methods
  • Better default parameters for optimizers

Performance

  • 1.4-1.5x faster than Julia OMEinsumContractionOrders.jl on benchmarks
  • Efficient TreeSA implementation with better exploration

[0.1.0] - 2023-XX-XX

Added

  • Initial release
  • GreedyMethod optimizer
  • Basic contraction order optimization
  • Support for tensor networks with arbitrary indices
  • Complexity calculation (time and space)
  • Rust core library
  • Basic documentation

Features

  • Greedy algorithm with configurable parameters
  • Stochastic variants for improved solutions
  • Efficient index handling with generic types
  • HashMap-based dimension tracking

Version Numbering

omeco follows Semantic Versioning:

  • MAJOR version for incompatible API changes
  • MINOR version for new functionality in a backward compatible manner
  • PATCH version for backward compatible bug fixes

Release Process

  1. Update version in Cargo.toml files
  2. Update CHANGELOG.md with release notes
  3. Tag release: git tag v0.X.Y
  4. Push tags: git push --tags
  5. Publish to crates.io: cargo publish
  6. Publish to PyPI: maturin publish (Python bindings)

Contributing

See CONTRIBUTING.md for guidelines on contributing to omeco.

Migration Guides

Migrating from 0.1.x to 0.2.x

Major Changes:

  1. New optimize_code function: Unified API for all optimizers

    # Old (0.1.x)
    tree = optimize_greedy(ixs, out, sizes)
    tree = optimize_treesa(ixs, out, sizes)
    
    # New (0.2.x) - unified interface
    from omeco import optimize_code, GreedyMethod, TreeSA
    tree = optimize_code(ixs, out, sizes)  # Uses GreedyMethod by default
    tree = optimize_code(ixs, out, sizes, TreeSA.fast())
    
  2. ScoreFunction configuration:

    # New in 0.2.x
    from omeco import ScoreFunction, TreeSA
    
    score = ScoreFunction(tc_weight=1.0, sc_weight=1.0, rw_weight=10.0, sc_target=30.0)
    tree = optimize_code(ixs, out, sizes, TreeSA(score=score))
    
  3. Slicing support:

    # New in 0.2.x
    from omeco import slice_code, TreeSASlicer
    
    sliced = slice_code(tree, ixs, sizes, TreeSASlicer.fast())
    

Breaking Changes:

  • Removed optimize_greedy() and optimize_treesa() from Python exports
  • Use optimize_code(...) instead with GreedyMethod() or TreeSA() optimizers
  • Rust API unchanged

Migrating from Julia OMEinsumContractionOrders.jl

Index Differences:

  • Julia: 1-based indexing
  • Rust/Python: 0-based indexing (or use arbitrary hashable types)
# Julia
ixs = [[1, 2], [2, 3], [3, 1]]
sizes = Dict(1 => 10, 2 => 20, 3 => 10)
# Python (0-based)
ixs = [[0, 1], [1, 2], [2, 0]]
sizes = {0: 10, 1: 20, 2: 10}

Function Names:

Juliaomeco (Python/Rust)
optimize_greedyoptimize_code(..., GreedyMethod())
optimize_treesaoptimize_code(..., TreeSA.fast())
contraction_complexitycontraction_complexity
slicingslice_code

API Compatibility: Most functions have similar signatures and behavior.