Existing attention accelerators often trade exact softmax semantics or suffer from sequential depth that limits throughput on long sequences. We present ELSA, an algorithmic reformulation of softmax attention that preserves exact semantics in real arithmetic. By casting the online softmax update into a prefix scan over an associative monoid of triples $(m, S, W)$, ELSA achieves $O(n)$ extra memory and $O(\log n)$ parallel depth. Implemented in Triton and CUDA C++ without Tensor Core dependencies, it improves FP32 inference latency over memory-efficient SDPA by 1.3–3.5× as a drop-in replacement, with no retraining required.
Vision Transformers have become the backbone of modern perception systems—from satellite imagery and medical diagnostics to autonomous driving and robot navigation. But as these models get deployed in the real world, a hard constraint emerges: not every device has a data-centre GPU.
Platforms like NVIDIA Jetson AGX Orin and Drive AGX run perception stacks in real time with tight power and VRAM budgets. Materialising the full O(n²) attention matrix is simply not an option, and FlashAttention's Tensor Core dependency (HMMA/GMMA) makes it unavailable on many embedded GPUs (e.g. Jetson TX2, compute capability 6.2). ME-SDPA preserves memory but processes tokens sequentially— leaving the majority of CUDA threads idle and burning latency budget.
Clinical imaging, hyperspectral sensors (12–16-bit dynamic range), and scientific microscopy demand FP32 exactness— half-precision accumulation is clinically inadmissible. FlashAttention's FP32 fallback reverts to untuned SIMD without fused HMMA kernels, losing its throughput advantage entirely. ME-SDPA supports FP32 but is up to 3.5× slower than ELSA at 16K tokens.
Re-training CLIP, SAM, VGGT, or LLaMA to compensate for approximate attention costs millions of GPU-hours. Yet standard exact attention OOMs at long sequences, and FlashAttention requires Ampere/Hopper Tensor Cores unavailable on older, AMD, or edge hardware. There is no hardware-agnostic, exact, high-throughput drop-in for FP32 inference—until ELSA.
| Approach | Exact FP32 | No Tensor Cores needed | Parallel (O(log n) depth) | Retrain-free | Verdict |
|---|---|---|---|---|---|
| Standard SDPA (Math) | ✓ | ✓ | ✗ O(n²) memory | ✓ | OOM at long sequences |
| ME-SDPA (xFormers) | ✓ | ✓ | ✗ O(n/Tₖ) sequential depth | ✓ | 3.5× slower than ELSA at 16K |
| FlashAttention-2/3 | ✗ FP32 path uncompetitive | ✗ Requires HMMA/GMMA | ✓ | ✓ | Unavailable on edge / AMD / older GPUs |
| Linear / Approximate Attention | ✗ Modifies softmax | ✓ | ✓ | ✗ Requires retraining | Impractical for foundation models |
| ELSA (Ours) | ✓ Provable O(u log n) | ✓ Scalar max/exp + fma only | ✓ O(log n) depth | ✓ Drop-in, zero retraining | Fills the gap |
The key insight behind ELSA comes from an analogy with classical image processing. A Summed-Area Table (SAT)—also known as an integral image—lets you compute the sum of any rectangular region in constant time, regardless of its size, by a single-pass precomputation.
Softmax attention faces an analogous bottleneck. The standard online recurrence of Rabe & Staats compresses the $O(n^2)$ attention matrix to $O(n)$ memory, but at the cost of strict sequentiality. ELSA resolves this by showing that the attention state $(m, S, W)$ satisfies the same composability property as prefix sums—enabling a fully parallel reduction with $O(\log n)$ depth.
ELSA transforms the sequential online softmax recurrence into a fully parallelizable computation by proving that attention states $(m, S, W)$ form an associative monoid under a merge operator $\oplus$.
Classical ME-SDPA chains tokens one by one: each step depends on the previous running maximum $m_j$, forcing $O(n)$ sequential depth. ELSA breaks this chain by observing that any two contiguous blocks can be merged exactly via:
$u_A \oplus u_B = \mathrm{renorm}\!\left(\max(m_A, m_B),\; \tilde{S}_A + \tilde{S}_B,\; \tilde{W}_A + \tilde{W}_B\right)$
Total scan depth: $L(n, B) = \lceil\log_2 B\rceil + 2\lceil\log_2(n/B)\rceil + 3 \leq 2\lceil\log_2 n\rceil + 3$, with a provable FP32 relative error of $O(u \log n)$.
The animation below visualises how ELSA moves tokens through the GPU memory hierarchy compared to ME-SDPA (xFormers FP32). Left: ME-SDPA loads one tile at a time sequentially (O(n) depth). Right: ELSA loads 64 tiles in parallel across 8 SRAM blocks, runs Hillis–Steele intra-block scans (log₂8 = 3 levels), then a Blelloch inter-block sweep—achieving O(log n) = 6 steps with no Tensor Core dependency.
No retraining. No architecture changes. Pick your entry point and swap one line.
import torch.nn.functional as F out = F.scaled_dot_product_attention( q, k, v, scale=scale )
from elsa import ELSA_triton_fp32 out = ELSA_triton_fp32.apply( q, k, v, scale )
from timm.models.vision_transformer \ import Attention self.attn = Attention( dim=768, num_heads=12 )
from elsa import ElsaAttention self.attn = ElsaAttention( dim=768, num_heads=12, backend="auto" )
backend="auto" selects best kernel per dtypeimport timm
model = timm.create_model(
"vit_large_patch14_clip_224",
pretrained=True
).cuda().eval()
with torch.no_grad():
out = model(images)
import timm from elsa import patch_vit_attention model = timm.create_model( "vit_large_patch14_clip_224", pretrained=True ).cuda().eval() patch_vit_attention(model, method="elsa", precision="fp32") with torch.no_grad(): out = model(images)
from transformers import \
AutoModelForCausalLM
model = AutoModelForCausalLM\
.from_pretrained(
"meta-llama/Llama-2-13b-hf",
torch_dtype=torch.float32,
device_map="cuda"
)
# OOM at ≥32K tokens in FP32
out = model(**long_input)
from transformers import \
AutoModelForCausalLM
from elsa import patch_llama_attention
model = AutoModelForCausalLM\
.from_pretrained(
"meta-llama/Llama-2-13b-hf",
torch_dtype=torch.float32,
device_map="cuda"
)
patch_llama_attention(model)
out = model(**long_input)
A parallel scan only helps if it produces the same answer as sequential attention. We verify this empirically against a float64 oracle across three scenarios (regular, long, stress) and six complementary drift metrics.
Regular: n=1024, d=64 · Long: n=8192 · Stress: adversarial logits, n=4096. All values well below single-precision unit roundoff u ≈ 6×10⁻⁸.
| Metric | Regular | Long | Stress | Interpretation |
|---|---|---|---|---|
| ‖δP‖∞ (max abs prob drift) | 3.12×10⁻¹⁷ | 2.34×10⁻¹⁷ | 3.33×10⁻¹⁶ | Worst-case pointwise attention weight error |
| ‖δP‖₂/‖P‖₂ (rel. L2 prob) | 1.73×10⁻¹⁵ | 3.42×10⁻¹⁵ | 3.50×10⁻¹⁵ | Global distributional shift across all attention weights |
| JS Divergence (per-row) | 3.56×10⁻¹⁶ | 3.77×10⁻¹⁶ | 3.07×10⁻¹⁶ | Per-query attention distribution shape preserved |
| Argmax Disagreement Rate | 0 | 0 | 0 | Top-attended token never changes — exact behavioral equivalence |
| ‖δY‖∞ (max abs output drift) | 4.99×10⁻¹⁶ | 4.99×10⁻¹⁶ | 3.28×10⁻¹⁵ | Worst-case error in output embeddings to next layer |
| ‖δY‖₂/‖Y‖₂ (rel. L2 output) | 2.39×10⁻¹⁵ | 4.72×10⁻¹⁵ | 4.94×10⁻¹⁵ | Normalized aggregate output error across full tensor |
ELSA exposes three precision modes, spanning a continuum from maximum throughput to minimal VRAM. All variants are drop-in replaceable; select based on your hardware constraints.
| Method | Throughput (M tok/s) | vs ME-SDPA | Peak Memory (GB) | Best For |
|---|---|---|---|---|
| ME-SDPA (baseline) | 0.40 | 1.00× | 0.289 | — |
| ELSA-Strict (FP32) | 1.36 | 3.40×+240% | 0.336 | Auditable FP32, medical / scientific |
| ELSA-Turbo (TF32) ★ | 1.43 | 3.58×+258% | 0.288 | General use — default configuration |
| ELSA-Lean (η=0.25) | 0.13 | 0.33× | 0.288 | Minimum VRAM, embedded / consumer GPU |
backend= in ElsaAttention or env var
All A100 benchmarks: single NVIDIA A100 (40 GB), CUDA 12.6, Triton 3.2.0, PyTorch 2.6, averaged over 200 runs unless otherwise noted. We compare only exact, drop-in replacements; approximate linear-attention variants require retraining and are excluded.
Single attention head · synthetic sequences · 1K / 4K / 16K tokens
| Method | Precision | Tokens | Latency (ms) | Peak Mem (GB) | Speedup vs ME-SDPA |
|---|---|---|---|---|---|
| Math-SDPA | FP32 | 1K | 0.65 | 0.22 | 0.26×– |
| ME-SDPA | FP32 | 1K | 0.25 | 0.11 | 1.00× (baseline) |
| ELSA | FP32 | 1K | 0.19 | 0.10 | 1.32×+32% |
| Math-SDPA | FP32 | 4K | 8.28 | 1.83 | 0.35×– |
| ME-SDPA | FP32 | 4K | 2.89 | 0.15 | 1.00× (baseline) |
| ELSA | FP32 | 4K | 0.92 | 0.16 | 3.14×+214% |
| Math-SDPA | FP32 | 16K | OOM | OOM | — |
| ME-SDPA | FP32 | 16K | 12.06 | 0.29 | 1.00× (baseline) |
| ELSA | FP32 | 16K | 3.46 | 0.34 | 3.49×+249% |
| Math-SDPA | FP16 | 1K | 0.55 | 0.22 | 0.22×– |
| FA2 | FP16 | 1K | 0.18 | 0.11 | 0.67×– |
| FA3 | FP16 | 1K | 0.27 | 0.12 | 0.44×– |
| ME-SDPA | FP16 | 1K | 0.12 | 0.11 | 1.00× (baseline) |
| ELSA | FP16 | 1K | 0.12 | 0.10 | 1.00× |
| Math-SDPA | FP16 | 4K | 5.80 | 1.85 | 0.08×– |
| FA2 | FP16 | 4K | 0.47 | 0.15 | 0.96× |
| FA3 | FP16 | 4K | 0.55 | 0.16 | 0.82× |
| ME-SDPA | FP16 | 4K | 0.49 | 0.15 | 1.00× (baseline) |
| ELSA | FP16 | 4K | 0.49 | 0.12 | 1.00× |
| Math-SDPA | FP16 | 16K | OOM | OOM | — |
| FA2 | FP16 | 16K | 6.01 | 0.26 | 0.93× |
| FA3 | FP16 | 16K | 6.66 | 0.36 | 0.84× |
| ME-SDPA | FP16 | 16K | 5.71 | 0.29 | 1.00× (baseline) |
| ELSA | FP16 | 16K | 6.66 | 0.19 | 0.86× |
ELSA vs. ME-SDPA baseline · no retraining · A100 · attention-scope speedup in parentheses
| Model | Variant | Method | Latency (ms) | Latency Reduction | Speedup vs ME-SDPA |
|---|---|---|---|---|---|
| ViT-B/32 | CLIP | ME-SDPA | 2.398 | — | 1.00× |
| ViT-B/32 | CLIP | ELSA-Strict | 2.159 | −10.0% | 1.111×+11.1% |
| ViT-B/32 | CLIP | ELSA-Turbo | 2.150 | −10.3% | 1.115×+11.5% |
| ViT-B/16 | CLIP | ME-SDPA | 4.568 | — | 1.00× |
| ViT-B/16 | CLIP | ELSA-Strict | 4.365 | −4.4% | 1.046×+4.6% |
| ViT-B/16 | CLIP | ELSA-Turbo | 4.329 | −5.2% | 1.055×+5.5% |
| ViT-L/14 | CLIP | ME-SDPA | 15.419 | — | 1.00× |
| ViT-L/14 | CLIP | ELSA-Strict | 14.914 | −3.3% | 1.034×+3.4% |
| ViT-L/14 | CLIP | ELSA-Turbo | 14.842 | −3.7% | 1.039×+3.9% |
| ViT-L/14@336 | CLIP | ME-SDPA | 29.665 | — | 1.00× |
| ViT-L/14@336 | CLIP | ELSA-Strict | 27.989 | −5.7% | 1.060×+6.0% |
| ViT-L/14@336 | CLIP | ELSA-Turbo | 27.931 | −5.8% | 1.062×+6.2% |
ELSA-FP32 vs. xFormers-FP32 · 350×518 input · 1,041 tokens/frame
| Model | Method | Frames | Time (s) | Memory (GB) | Speedup vs xFormers |
|---|---|---|---|---|---|
| VGGT | xFormers-FP32 | 50 | 18.36 | 15.74 | 1.00× (baseline) |
| VGGT | FA2-FP16 | 50 | 3.67 | 17.16 | 5.00× (FP16, +9% mem) |
| VGGT | ELSA-FP32 | 50 | 12.56 | 15.74 | 1.46×+46% |
| VGGT | xFormers-FP32 | 100 | 61.75 | 26.67 | 1.00× (baseline) |
| VGGT | FA2-FP16 | 100 | 9.44 | 27.13 | 6.54× (FP16) |
| VGGT | ELSA-FP32 | 100 | 29.49 | 26.68 | 2.09×+109% |
| VGGT | xFormers-FP32 | 150 | 130.83 | 37.59 | 1.00× (baseline) |
| VGGT | FA2-FP16 | 150 | 17.88 | 37.43 | 7.31× (FP16) |
| VGGT | ELSA-FP32 | 150 | 55.97 | 37.59 | 2.34×+134% |
| FastVGGT | xFormers-FP32 | 100 | 23.94 | 11.50 | 1.00× |
| FastVGGT | ELSA-FP32 | 100 | 20.22 | 11.50 | 1.18×+18% |
| FastVGGT | xFormers-FP32 | 200 | 65.81 | 18.56 | 1.00× |
| FastVGGT | ELSA-FP32 | 200 | 50.76 | 18.56 | 1.30×+30% |
| FastVGGT | xFormers-FP32 | 300 | 129.97 | 25.63 | 1.00× |
| FastVGGT | ELSA-FP32 | 300 | 97.89 | 25.63 | 1.33×+33% |
| FastVGGT | xFormers-FP32 | 400 | 218.30 | 32.70 | 1.00× |
| FastVGGT | ELSA-FP32 | 400 | 158.04 | 32.70 | 1.38×+38% |
ELSA vs. PyTorch Math kernel · batch 1 · d=64 · no Tensor Core dependency
| Tokens | Precision | Math (ms) | ELSA (ms) | Speedup | Latency Reduction |
|---|---|---|---|---|---|
| 64 | FP16 | 1.06 | 0.69 | 1.54× | −34.9% |
| 196 | FP16 | 10.10 | 6.47 | 1.56× | −35.9% |
| 400 | FP16 | 41.23 | 25.95 | 1.59× | −37.0% |
| 576 | FP16 | 84.54 | 52.78 | 1.60× | −37.6% |
| 784 | FP16 | 157.45 | 98.47 | 1.60× | −37.4% |
| 900 | FP16 | 207.55 | 129.74 | 1.60× | −37.5% |
| 64 | FP32 | 1.06 | 0.69 | 1.54× | −34.9% |
| 196 | FP32 | 10.12 | 6.48 | 1.56× | −36.0% |
| 400 | FP32 | 41.34 | 26.04 | 1.59× | −37.0% |
| 576 | FP32 | 84.76 | 52.96 | 1.60× | −37.5% |
| 784 | FP32 | 157.89 | 98.86 | 1.60× | −37.4% |
| 900 | FP32 | 208.91 | 131.14 | 1.59× | −37.2% |
BERT fixed-shape · variable-length packing · DPT segmentation · Swin COCO
| Task / Dataset | Method | Precision | Latency (ms) | Throughput | Speedup |
|---|---|---|---|---|---|
| BERT SST-2 | ME-SDPA | FP32 | 1.308 | — | 1.00× |
| BERT SST-2 | ELSA | FP32 | 0.664 | — | 1.97×+97% |
| BERT IMDB | ME-SDPA | FP32 | 17.693 | — | 1.00× |
| BERT IMDB | ELSA | FP32 | 7.794 | — | 2.27×+127% |
| Padding-free VarLen | SDPA-FP32-mem | FP32 | — | 0.96 M tok/s | 1.00× |
| Padding-free VarLen | ELSA-Strict | FP32 | — | 2.00 M tok/s | 2.08×+109% |
| Padding-free VarLen | ELSA-Turbo | TF32 | — | 2.64 M tok/s | 2.75×+175% |
| Padding-free VarLen | ELSA-FP16 | FP16 | — | 5.03 M tok/s | 5.24×+424% |
| DPT Segmentation (COCO) | ME-SDPA | FP32 | 295.7 | — | 1.00× |
| DPT Segmentation (COCO) | ELSA | FP32 | 252.7 | — | 1.17×+17% |
| BERT Sentiment (FP16) | FA2 | FP16 | 224.1 | — | — |
| BERT Sentiment (FP16) | ELSA | FP16 | 96.8 | — | 2.31×+131% |
| Swin COCO Det. (FP16) | FA2 | FP16 | 46.0 | — | — |
| Swin COCO Det. (FP16) | ELSA | FP16 | 40.1 | — | 1.15×+15% |
Throughput (imgs/s) and peak memory on Pavia, Salinas, WHU datasets
| Model | Dataset | Method | Throughput (imgs/s) | Peak Mem (GB) | Gain vs ME-SDPA |
|---|---|---|---|---|---|
| HSIMAE-B | Pavia | ME-SDPA | 10489 ± 14 | 0.11 | — |
| HSIMAE-B | Pavia | ELSA | 14401 ± 53 | 0.11 | +37.1%+37% |
| HSIMAE-L | Pavia | ME-SDPA | 6471 ± 97 | 0.23 | — |
| HSIMAE-L | Pavia | ELSA | 10479 ± 91 | 0.25 | +62.0%+62% |
| HSIMAE-B | Salinas | ME-SDPA | 10427 ± 18 | 0.10 | — |
| HSIMAE-B | Salinas | ELSA | 14424 ± 106 | 0.10 | +38.3%+38% |
| HSIMAE-L | Salinas | ME-SDPA | 6490 ± 21 | 0.23 | — |
| HSIMAE-L | Salinas | ELSA | 10396 ± 27 | 0.25 | +60.2%+60% |
| HSIMAE-B | WHU | ME-SDPA | 10631 ± 14 | 0.11 | — |
| HSIMAE-B | WHU | ELSA | 14929 ± 16 | 0.11 | +40.4%+40% |
| HSIMAE-L | WHU | ME-SDPA | 6588 ± 5 | 0.23 | — |
| HSIMAE-L | WHU | ELSA | 10657 ± 7 | 0.24 | +61.7%+62% |
Single attn head · 1K / 4K / 16K tokens
ELSA vs. Math-SDPA · four ViT variants (attention module only)
ELSA-FP32 vs. xFormers-FP32 · FastVGGT
ELSA vs. Math kernel · FP32 · batch 1
@inproceedings{hsu2026elsa,
title = {ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers},
author = {Hsu, Chih-Chung and Ma, Xin-Di and Liao, Wo-Ting and Lee, Chia-Ming},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
year = {2026}
}