ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
National Yang Ming Chiao Tung University

CVPR Findings 2026

Can't use FA on your device? Try ELSA on!

TL;DR

Standard FP32 attention is slow, and FlashAttention requires Tensor Cores (HMMA/GMMA) unavailable on older GPUs, AMD hardware, and edge devices. ELSA breaks this limitation by recasting attention as a parallel prefix scan over an associative monoid, achieving O(log n) parallel depth and O(n) memory on any hardware—with no retraining required.

On A100 (FP32), ELSA delivers up to 3.5× speedup over ME-SDPA at 16K tokens. On CLIP ViT-L/14 (strict FP32), ELSA reduces full image-encoder latency by 3.7% as a drop-in replacement—no retraining, no precision loss. At the attention-module level, speedup reaches 1.46–2.15× depending on resolution; gains are partially amortized in the full pipeline by non-attention compute. On Jetson TX2 (edge device), ELSA achieves a consistent ~1.6× speedup over the Math kernel across all sequence lengths tested (64–900 tokens).

Abstract

Existing attention accelerators often trade exact softmax semantics or suffer from sequential depth that limits throughput on long sequences. We present ELSA, an algorithmic reformulation of softmax attention that preserves exact semantics in real arithmetic. By casting the online softmax update into a prefix scan over an associative monoid of triples $(m, S, W)$, ELSA achieves $O(n)$ extra memory and $O(\log n)$ parallel depth. Implemented in Triton and CUDA C++ without Tensor Core dependencies, it improves FP32 inference latency over memory-efficient SDPA by 1.3–3.5× as a drop-in replacement, with no retraining required.

Introduction

Vision Transformers have become the backbone of modern perception systems—from satellite imagery and medical diagnostics to autonomous driving and robot navigation. But as these models get deployed in the real world, a hard constraint emerges: not every device has a data-centre GPU.

🤖

Robotics & Autonomous Systems

Platforms like NVIDIA Jetson AGX Orin and Drive AGX run perception stacks in real time with tight power and VRAM budgets. Materialising the full O(n²) attention matrix is simply not an option, and FlashAttention's Tensor Core dependency (HMMA/GMMA) makes it unavailable on many embedded GPUs (e.g. Jetson TX2, compute capability 6.2). ME-SDPA preserves memory but processes tokens sequentially— leaving the majority of CUDA threads idle and burning latency budget.

🏥

High-Precision Domains (Medical, Hyperspectral)

Clinical imaging, hyperspectral sensors (12–16-bit dynamic range), and scientific microscopy demand FP32 exactness— half-precision accumulation is clinically inadmissible. FlashAttention's FP32 fallback reverts to untuned SIMD without fused HMMA kernels, losing its throughput advantage entirely. ME-SDPA supports FP32 but is up to 3.5× slower than ELSA at 16K tokens.

🌐

Foundation Models on Constrained Hardware

Re-training CLIP, SAM, VGGT, or LLaMA to compensate for approximate attention costs millions of GPU-hours. Yet standard exact attention OOMs at long sequences, and FlashAttention requires Ampere/Hopper Tensor Cores unavailable on older, AMD, or edge hardware. There is no hardware-agnostic, exact, high-throughput drop-in for FP32 inference—until ELSA.

Why Existing Solutions Fall Short

Approach Exact FP32 No Tensor Cores needed Parallel (O(log n) depth) Retrain-free Verdict
Standard SDPA (Math) ✗ O(n²) memory OOM at long sequences
ME-SDPA (xFormers) ✗ O(n/Tₖ) sequential depth 3.5× slower than ELSA at 16K
FlashAttention-2/3 ✗ FP32 path uncompetitive ✗ Requires HMMA/GMMA Unavailable on edge / AMD / older GPUs
Linear / Approximate Attention ✗ Modifies softmax ✗ Requires retraining Impractical for foundation models
ELSA (Ours) ✓ Provable O(u log n) ✓ Scalar max/exp + fma only ✓ O(log n) depth ✓ Drop-in, zero retraining Fills the gap

Motivation

The key insight behind ELSA comes from an analogy with classical image processing. A Summed-Area Table (SAT)—also known as an integral image—lets you compute the sum of any rectangular region in constant time, regardless of its size, by a single-pass precomputation.

Softmax attention faces an analogous bottleneck. The standard online recurrence of Rabe & Staats compresses the $O(n^2)$ attention matrix to $O(n)$ memory, but at the cost of strict sequentiality. ELSA resolves this by showing that the attention state $(m, S, W)$ satisfies the same composability property as prefix sums—enabling a fully parallel reduction with $O(\log n)$ depth.

Method: Parallel Monoid Scan

ELSA transforms the sequential online softmax recurrence into a fully parallelizable computation by proving that attention states $(m, S, W)$ form an associative monoid under a merge operator $\oplus$.

Integral Image analogy for ELSA

From Sequential Recurrence to Parallel Scan

Classical ME-SDPA chains tokens one by one: each step depends on the previous running maximum $m_j$, forcing $O(n)$ sequential depth. ELSA breaks this chain by observing that any two contiguous blocks can be merged exactly via:

$u_A \oplus u_B = \mathrm{renorm}\!\left(\max(m_A, m_B),\; \tilde{S}_A + \tilde{S}_B,\; \tilde{W}_A + \tilde{W}_B\right)$

Two-Level GPU Scan

  • Intra-block (Hillis–Steele): Each thread block independently reduces 128 tokens in shared memory using a tree scan.
  • Inter-block (Blelloch): A work-optimal two-pass scan aggregates block-level states in global memory, adding only $2\lceil\log_2(n/B)\rceil$ merge levels.

Total scan depth: $L(n, B) = \lceil\log_2 B\rceil + 2\lceil\log_2(n/B)\rceil + 3 \leq 2\lceil\log_2 n\rceil + 3$, with a provable FP32 relative error of $O(u \log n)$.

GPU Memory Hierarchy — Token Movement

The animation below visualises how ELSA moves tokens through the GPU memory hierarchy compared to ME-SDPA (xFormers FP32). Left: ME-SDPA loads one tile at a time sequentially (O(n) depth). Right: ELSA loads 64 tiles in parallel across 8 SRAM blocks, runs Hillis–Steele intra-block scans (log₂8 = 3 levels), then a Blelloch inter-block sweep—achieving O(log n) = 6 steps with no Tensor Core dependency.

Drop-in in Seconds

No retraining. No architecture changes. Pick your entry point and swap one line.

Level 1Raw kernel — you already have Q, K, V
Before
import torch.nn.functional as F

out = F.scaled_dot_product_attention(
    q, k, v, scale=scale
)
After (ELSA)
from elsa import ELSA_triton_fp32

out = ELSA_triton_fp32.apply(
    q, k, v, scale
)
✓ FP32 exact  ·  ✓ O(n) memory  ·  ✓ works on any CUDA GPU incl. Jetson TX2
Level 2Replace one attention layer in your model
Before
from timm.models.vision_transformer \
    import Attention

self.attn = Attention(
    dim=768,
    num_heads=12
)
After (ELSA)
from elsa import ElsaAttention

self.attn = ElsaAttention(
    dim=768,
    num_heads=12,
    backend="auto"
)
✓ Matches timm Attention interface exactly  ·  ✓ backend="auto" selects best kernel per dtype
Level 3Patch a pretrained timm model — zero model-code changes
Before
import timm

model = timm.create_model(
    "vit_large_patch14_clip_224",
    pretrained=True
).cuda().eval()

with torch.no_grad():
    out = model(images)
After (ELSA) — 2 lines added
import timm
from elsa import patch_vit_attention

model = timm.create_model(
    "vit_large_patch14_clip_224",
    pretrained=True
).cuda().eval()
patch_vit_attention(model,
    method="elsa", precision="fp32")
with torch.no_grad():
    out = model(images)
✓ Weights untouched  ·  ✓ accuracy identical  ·  ✓ up to 2.15× attention speedup on CLIP ViT-L/14
Level 5HuggingFace LLaMA — FP32 long-context inference
Before
from transformers import \
    AutoModelForCausalLM

model = AutoModelForCausalLM\
  .from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.float32,
    device_map="cuda"
  )

# OOM at ≥32K tokens in FP32
out = model(**long_input)
After (ELSA) — 2 lines added
from transformers import \
    AutoModelForCausalLM
from elsa import patch_llama_attention

model = AutoModelForCausalLM\
  .from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.float32,
    device_map="cuda"
  )
patch_llama_attention(model)
out = model(**long_input)
✓ No retraining  ·  ✓ perplexity identical  ·  ✓ +20.1% throughput at 65K tokens (LLaMA-13B, FP32 offloading)

Exactness Guarantee

A parallel scan only helps if it produces the same answer as sequential attention. We verify this empirically against a float64 oracle across three scenarios (regular, long, stress) and six complementary drift metrics.

0
Argmax Disagreement Rate
Top-attended token identical to float64 oracle in every scenario — the most behaviorally critical metric
3.3×10⁻¹⁶
Max Probability Drift ‖δP‖∞ (p95, stress)
Orders of magnitude below FP32 unit roundoff u ≈ 6×10⁻⁸
4.9×10⁻¹⁵
Relative L2 Output Error ‖δY‖₂/‖Y‖₂ (p95, stress)
Negligible even under adversarial stress inputs at long sequences
O(u log n)
Provable FP32 Error Bound (Theorem 1)
Logarithmic growth vs O(nu) sequential recurrence — numerically safer at scale

95th-Percentile Drift vs. float64 Oracle

Regular: n=1024, d=64 · Long: n=8192 · Stress: adversarial logits, n=4096. All values well below single-precision unit roundoff u ≈ 6×10⁻⁸.

Metric Regular Long Stress Interpretation
‖δP‖∞ (max abs prob drift) 3.12×10⁻¹⁷ 2.34×10⁻¹⁷ 3.33×10⁻¹⁶ Worst-case pointwise attention weight error
‖δP‖₂/‖P‖₂ (rel. L2 prob) 1.73×10⁻¹⁵ 3.42×10⁻¹⁵ 3.50×10⁻¹⁵ Global distributional shift across all attention weights
JS Divergence (per-row) 3.56×10⁻¹⁶ 3.77×10⁻¹⁶ 3.07×10⁻¹⁶ Per-query attention distribution shape preserved
Argmax Disagreement Rate 0 0 0 Top-attended token never changes — exact behavioral equivalence
‖δY‖∞ (max abs output drift) 4.99×10⁻¹⁶ 4.99×10⁻¹⁶ 3.28×10⁻¹⁵ Worst-case error in output embeddings to next layer
‖δY‖₂/‖Y‖₂ (rel. L2 output) 2.39×10⁻¹⁵ 4.72×10⁻¹⁵ 4.94×10⁻¹⁵ Normalized aggregate output error across full tensor
All metrics at float64-oracle level — indistinguishable from exact arithmetic Blockwise monoid validation: per-block drift identical to global result (Figure 7 in paper) Error grows O(log n) not O(n) — safer than sequential recurrence at scale

ELSA Variants — Pick Your Operating Point

ELSA exposes three precision modes, spanning a continuum from maximum throughput to minimal VRAM. All variants are drop-in replaceable; select based on your hardware constraints.

ELSA-Strict FP32
1.36Mtok/s
0.336 GB peak
Full IEEE-754 FP32 throughout. Provable O(u log n) error bound. 3.4× over ME-SDPA. Recommended for medical / scientific workloads where numerical auditability matters.
ELSA-Lean FP32, η=0.25
0.13Mtok/s
0.288 GB peak — matches ME-SDPA exactly
Streams smaller KV chunks (controlled by η). Trades throughput for minimal VRAM. Best for severely memory-constrained deployments (embedded, consumer GPUs at very long sequences).

FP32 Variant Comparison at 16K Tokens (A100)

Method Throughput (M tok/s) vs ME-SDPA Peak Memory (GB) Best For
ME-SDPA (baseline) 0.40 1.00× 0.289
ELSA-Strict (FP32) 1.36 3.40×+240% 0.336 Auditable FP32, medical / scientific
ELSA-Turbo (TF32) ★ 1.43 3.58×+258% 0.288 General use — default configuration
ELSA-Lean (η=0.25) 0.13 0.33× 0.288 Minimum VRAM, embedded / consumer GPU
ELSA-Turbo: same memory as ME-SDPA, 3.58× the throughput All variants: exact softmax semantics, no retraining Switch variant: set backend= in ElsaAttention or env var

FP16 Regime (vs FlashAttention)

At 16K tokens in FP16: ELSA delivers 2.49M tok/s at 0.19 GB—the lowest memory among all exact kernels. FA2 is ~15% faster (2.88M tok/s) but needs 0.29 GB; FA3 reaches 2.73M tok/s at 0.36 GB. ME-SDPA matches ELSA's 0.19 GB footprint but achieves only 1.47M tok/s. ELSA narrows the speed gap as sequence length grows, while retaining full FP32 capability on the same kernel.

Performance Benchmarks

All A100 benchmarks: single NVIDIA A100 (40 GB), CUDA 12.6, Triton 3.2.0, PyTorch 2.6, averaged over 200 runs unless otherwise noted. We compare only exact, drop-in replacements; approximate linear-attention variants require retraining and are excluded.

FP32 Latency & Peak Memory — A100

Single attention head · synthetic sequences · 1K / 4K / 16K tokens

Click column headers to sort
* Standard SDPA OOM at 16K tokens in FP32/FP16. † FA2/FA3 FP32 path reverts to untuned SIMD and is not reported.
Method Precision Tokens Latency (ms) Peak Mem (GB) Speedup vs ME-SDPA
Math-SDPAFP321K0.650.220.26×
ME-SDPAFP321K0.250.111.00× (baseline)
ELSAFP321K0.190.101.32×+32%
Math-SDPAFP324K8.281.830.35×
ME-SDPAFP324K2.890.151.00× (baseline)
ELSAFP324K0.920.163.14×+214%
Math-SDPAFP3216KOOMOOM
ME-SDPAFP3216K12.060.291.00× (baseline)
ELSAFP3216K3.460.343.49×+249%
Math-SDPAFP161K0.550.220.22×
FA2FP161K0.180.110.67×
FA3FP161K0.270.120.44×
ME-SDPAFP161K0.120.111.00× (baseline)
ELSAFP161K0.120.101.00×
Math-SDPAFP164K5.801.850.08×
FA2FP164K0.470.150.96×
FA3FP164K0.550.160.82×
ME-SDPAFP164K0.490.151.00× (baseline)
ELSAFP164K0.490.121.00×
Math-SDPAFP1616KOOMOOM
FA2FP1616K6.010.260.93×
FA3FP1616K6.660.360.84×
ME-SDPAFP1616K5.710.291.00× (baseline)
ELSAFP1616K6.660.190.86×
FP32 peak speedup: 3.49× (16K) ELSA lowest memory at 16K FP16: 0.19 GB Math-SDPA OOM at 16K tokens (FP32)

CLIP Image-Encoder Inference — Full Pipeline, Strict FP32

ELSA vs. ME-SDPA baseline · no retraining · A100 · attention-scope speedup in parentheses

* FA2/FA3 have no compatible FP32 path for CLIP; ELSA is the only hardware-agnostic exact-attention alternative. † Attention-proxy speedups (attention module only): ViT-B/16 1.46–1.98×, ViT-L/14 1.54–2.15×.
Model Variant Method Latency (ms) Latency Reduction Speedup vs ME-SDPA
ViT-B/32CLIPME-SDPA2.3981.00×
ViT-B/32CLIPELSA-Strict2.159−10.0%1.111×+11.1%
ViT-B/32CLIPELSA-Turbo2.150−10.3%1.115×+11.5%
ViT-B/16CLIPME-SDPA4.5681.00×
ViT-B/16CLIPELSA-Strict4.365−4.4%1.046×+4.6%
ViT-B/16CLIPELSA-Turbo4.329−5.2%1.055×+5.5%
ViT-L/14CLIPME-SDPA15.4191.00×
ViT-L/14CLIPELSA-Strict14.914−3.3%1.034×+3.4%
ViT-L/14CLIPELSA-Turbo14.842−3.7%1.039×+3.9%
ViT-L/14@336CLIPME-SDPA29.6651.00×
ViT-L/14@336CLIPELSA-Strict27.989−5.7%1.060×+6.0%
ViT-L/14@336CLIPELSA-Turbo27.931−5.8%1.062×+6.2%
Best full-pipeline speedup: 1.115× (ViT-B/32, ELSA-Turbo) FA2/FA3: no compatible FP32 path for CLIP Attention-module only: up to 2.15× (ViT-L/14-336, 560px)

3D Reconstruction — VGGT & FastVGGT Speedup

ELSA-FP32 vs. xFormers-FP32 · 350×518 input · 1,041 tokens/frame

Model Method Frames Time (s) Memory (GB) Speedup vs xFormers
VGGTxFormers-FP325018.3615.741.00× (baseline)
VGGTFA2-FP16503.6717.165.00× (FP16, +9% mem)
VGGTELSA-FP325012.5615.741.46×+46%
VGGTxFormers-FP3210061.7526.671.00× (baseline)
VGGTFA2-FP161009.4427.136.54× (FP16)
VGGTELSA-FP3210029.4926.682.09×+109%
VGGTxFormers-FP32150130.8337.591.00× (baseline)
VGGTFA2-FP1615017.8837.437.31× (FP16)
VGGTELSA-FP3215055.9737.592.34×+134%
FastVGGTxFormers-FP3210023.9411.501.00×
FastVGGTELSA-FP3210020.2211.501.18×+18%
FastVGGTxFormers-FP3220065.8118.561.00×
FastVGGTELSA-FP3220050.7618.561.30×+30%
FastVGGTxFormers-FP32300129.9725.631.00×
FastVGGTELSA-FP3230097.8925.631.33×+33%
FastVGGTxFormers-FP32400218.3032.701.00×
FastVGGTELSA-FP32400158.0432.701.38×+38%
VGGT peak: 2.34× at 150 frames (FP32, exact) Memory identical to xFormers-FP32 across all configs Math-FP32 OOM beyond 30 frames

Embedded Device — Jetson TX2 (FP32 & FP16)

ELSA vs. PyTorch Math kernel · batch 1 · d=64 · no Tensor Core dependency

Tokens Precision Math (ms) ELSA (ms) Speedup Latency Reduction
64FP161.060.691.54×−34.9%
196FP1610.106.471.56×−35.9%
400FP1641.2325.951.59×−37.0%
576FP1684.5452.781.60×−37.6%
784FP16157.4598.471.60×−37.4%
900FP16207.55129.741.60×−37.5%
64FP321.060.691.54×−34.9%
196FP3210.126.481.56×−36.0%
400FP3241.3426.041.59×−37.0%
576FP3284.7652.961.60×−37.5%
784FP32157.8998.861.60×−37.4%
900FP32208.91131.141.59×−37.2%
Consistent ~1.6× speedup across all token lengths FP16 ≈ FP32 latency on TX2 (SIMD, no Tensor Cores) FA2/FA3 unavailable on Jetson TX2 (compute capability 6.2)

NLP & Downstream Tasks

BERT fixed-shape · variable-length packing · DPT segmentation · Swin COCO

Task / Dataset Method Precision Latency (ms) Throughput Speedup
BERT SST-2ME-SDPAFP321.3081.00×
BERT SST-2ELSAFP320.6641.97×+97%
BERT IMDBME-SDPAFP3217.6931.00×
BERT IMDBELSAFP327.7942.27×+127%
Padding-free VarLenSDPA-FP32-memFP320.96 M tok/s1.00×
Padding-free VarLenELSA-StrictFP322.00 M tok/s2.08×+109%
Padding-free VarLenELSA-TurboTF322.64 M tok/s2.75×+175%
Padding-free VarLenELSA-FP16FP165.03 M tok/s5.24×+424%
DPT Segmentation (COCO)ME-SDPAFP32295.71.00×
DPT Segmentation (COCO)ELSAFP32252.71.17×+17%
BERT Sentiment (FP16)FA2FP16224.1
BERT Sentiment (FP16)ELSAFP1696.82.31×+131%
Swin COCO Det. (FP16)FA2FP1646.0
Swin COCO Det. (FP16)ELSAFP1640.11.15×+15%
BERT IMDB: 2.27× (longer sequences benefit more) FP16 variable-length: 5.24× over SDPA-mem FP16 BERT vs FA2: 2.31× (true seq. len vs FA2's padded 512-token)

Hyperspectral Classification — HSIMAE-B/L (FP32)

Throughput (imgs/s) and peak memory on Pavia, Salinas, WHU datasets

Model Dataset Method Throughput (imgs/s) Peak Mem (GB) Gain vs ME-SDPA
HSIMAE-BPaviaME-SDPA10489 ± 140.11
HSIMAE-BPaviaELSA14401 ± 530.11+37.1%+37%
HSIMAE-LPaviaME-SDPA6471 ± 970.23
HSIMAE-LPaviaELSA10479 ± 910.25+62.0%+62%
HSIMAE-BSalinasME-SDPA10427 ± 180.10
HSIMAE-BSalinasELSA14424 ± 1060.10+38.3%+38%
HSIMAE-LSalinasME-SDPA6490 ± 210.23
HSIMAE-LSalinasELSA10396 ± 270.25+60.2%+60%
HSIMAE-BWHUME-SDPA10631 ± 140.11
HSIMAE-BWHUELSA14929 ± 160.11+40.4%+40%
HSIMAE-LWHUME-SDPA6588 ± 50.23
HSIMAE-LWHUELSA10657 ± 70.24+61.7%+62%
HSIMAE-L peak: +62.0% throughput (Pavia) Negligible memory overhead across all configs

Visual Overview

FP32 Latency (ms) — A100

Single attn head · 1K / 4K / 16K tokens

CLIP Attention-Proxy Speedup vs. Resolution

ELSA vs. Math-SDPA · four ViT variants (attention module only)

3D Reconstruction — VGGT Speedup

ELSA-FP32 vs. xFormers-FP32 · FastVGGT

Embedded Device — Jetson TX2

ELSA vs. Math kernel · FP32 · batch 1

BibTeX

@inproceedings{hsu2026elsa,
  title     = {ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers},
  author    = {Hsu, Chih-Chung and Ma, Xin-Di and Liao, Wo-Ting and Lee, Chia-Ming},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year      = {2026}
}