ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee

National Yang Ming Chiao Tung University

CVPR Findings 2026

TL;DR

Standard FP32 attention is slow, and FlashAttention requires Tensor Cores (HMMA/GMMA) unavailable on older GPUs, AMD hardware, and edge devices. ELSA breaks this limitation by recasting attention as a parallel prefix scan over an associative monoid, achieving O(log n) parallel depth and O(n) memory on any hardware—with no retraining required.

On A100 (FP32), ELSA delivers up to 3.5× speedup over ME-SDPA at 16K tokens. On CLIP ViT-L/14 (strict FP32), ELSA reduces full image-encoder latency by 3.7% as a drop-in replacement—no retraining, no precision loss. At the attention-module level, speedup reaches 1.46–2.15× depending on resolution; gains are partially amortized in the full pipeline by non-attention compute. On Jetson TX2 (edge device), ELSA achieves a consistent ~1.6× speedup over the Math kernel across all sequence lengths tested (64–900 tokens).

Abstract

Existing attention accelerators often trade exact softmax semantics or suffer from sequential depth that limits throughput on long sequences. We present ELSA, an algorithmic reformulation of softmax attention that preserves exact semantics in real arithmetic. By casting the online softmax update into a prefix scan over an associative monoid of triples $(m, S, W)$, ELSA achieves $O(n)$ extra memory and $O(\log n)$ parallel depth. Implemented in Triton and CUDA C++ without Tensor Core dependencies, it improves FP32 inference latency over memory-efficient SDPA by 1.3–3.5× as a drop-in replacement, with no retraining required.

Introduction

Vision Transformers have become the backbone of modern perception systems—from satellite imagery and medical diagnostics to autonomous driving and robot navigation. But as these models get deployed in the real world, a hard constraint emerges: not every device has a data-centre GPU.

🤖

Robotics & Autonomous Systems

Platforms like NVIDIA Jetson AGX Orin and Drive AGX run perception stacks in real time with tight power and VRAM budgets. Materialising the full O(n²) attention matrix is simply not an option, and FlashAttention's Tensor Core dependency (HMMA/GMMA) makes it unavailable on many embedded GPUs (e.g. Jetson TX2, compute capability 6.2). ME-SDPA preserves memory but processes tokens sequentially— leaving the majority of CUDA threads idle and burning latency budget.

🏥

High-Precision Domains (Medical, Hyperspectral)

Clinical imaging, hyperspectral sensors (12–16-bit dynamic range), and scientific microscopy demand FP32 exactness— half-precision accumulation is clinically inadmissible. FlashAttention's FP32 fallback reverts to untuned SIMD without fused HMMA kernels, losing its throughput advantage entirely. ME-SDPA supports FP32 but is up to 3.5× slower than ELSA at 16K tokens.

🌐

Foundation Models on Constrained Hardware

Re-training CLIP, SAM, VGGT, or LLaMA to compensate for approximate attention costs millions of GPU-hours. Yet standard exact attention OOMs at long sequences, and FlashAttention requires Ampere/Hopper Tensor Cores unavailable on older, AMD, or edge hardware. There is no hardware-agnostic, exact, high-throughput drop-in for FP32 inference—until ELSA.

Why Existing Solutions Fall Short

Approach	Exact FP32	No Tensor Cores needed	Parallel (O(log n) depth)	Retrain-free	Verdict
Standard SDPA (Math)	✓	✓	✗ O(n²) memory	✓	OOM at long sequences
ME-SDPA (xFormers)	✓	✓	✗ O(n/Tₖ) sequential depth	✓	3.5× slower than ELSA at 16K
FlashAttention-2/3	✗ FP32 path uncompetitive	✗ Requires HMMA/GMMA	✓	✓	Unavailable on edge / AMD / older GPUs
Linear / Approximate Attention	✗ Modifies softmax	✓	✓	✗ Requires retraining	Impractical for foundation models
ELSA (Ours)	✓ Provable O(u log n)	✓ Scalar max/exp + fma only	✓ O(log n) depth	✓ Drop-in, zero retraining	Fills the gap

Motivation

The key insight behind ELSA comes from an analogy with classical image processing. A Summed-Area Table (SAT)—also known as an integral image—lets you compute the sum of any rectangular region in constant time, regardless of its size, by a single-pass precomputation.

Softmax attention faces an analogous bottleneck. The standard online recurrence of Rabe & Staats compresses the $O(n^2)$ attention matrix to $O(n)$ memory, but at the cost of strict sequentiality. ELSA resolves this by showing that the attention state $(m, S, W)$ satisfies the same composability property as prefix sums—enabling a fully parallel reduction with $O(\log n)$ depth.

Method: Parallel Monoid Scan

ELSA transforms the sequential online softmax recurrence into a fully parallelizable computation by proving that attention states $(m, S, W)$ form an associative monoid under a merge operator $\oplus$.

From Sequential Recurrence to Parallel Scan

Classical ME-SDPA chains tokens one by one: each step depends on the previous running maximum $m_j$, forcing $O(n)$ sequential depth. ELSA breaks this chain by observing that any two contiguous blocks can be merged exactly via:

$u_A \oplus u_B = \mathrm{renorm}\!\left(\max(m_A, m_B),\; \tilde{S}_A + \tilde{S}_B,\; \tilde{W}_A + \tilde{W}_B\right)$

Two-Level GPU Scan

Intra-block (Hillis–Steele): Each thread block independently reduces 128 tokens in shared memory using a tree scan.
Inter-block (Blelloch): A work-optimal two-pass scan aggregates block-level states in global memory, adding only $2\lceil\log_2(n/B)\rceil$ merge levels.

Total scan depth: $L(n, B) = \lceil\log_2 B\rceil + 2\lceil\log_2(n/B)\rceil + 3 \leq 2\lceil\log_2 n\rceil + 3$, with a provable FP32 relative error of $O(u \log n)$.

GPU Memory Hierarchy — Token Movement

The animation below visualises how ELSA moves tokens through the GPU memory hierarchy compared to ME-SDPA (xFormers FP32). Left: ME-SDPA loads one tile at a time sequentially (O(n) depth). Right: ELSA loads 64 tiles in parallel across 8 SRAM blocks, runs Hillis–Steele intra-block scans (log₂8 = 3 levels), then a Blelloch inter-block sweep—achieving O(log n) = 6 steps with no Tensor Core dependency.

Drop-in in Seconds

No retraining. No architecture changes. Pick your entry point and swap one line.

Level 1Raw kernel — you already have Q, K, V

Before

import torch.nn.functional as F

out = F.scaled_dot_product_attention(
    q, k, v, scale=scale
)

→

After (ELSA)

from elsa import ELSA_triton_fp32

out = ELSA_triton_fp32.apply(
    q, k, v, scale
)

✓ FP32 exact · ✓ O(n) memory · ✓ works on any CUDA GPU incl. Jetson TX2

Level 2Replace one attention layer in your model

Before

from timm.models.vision_transformer \
    import Attention

self.attn = Attention(
    dim=768,
    num_heads=12
)

→

After (ELSA)

from elsa import ElsaAttention

self.attn = ElsaAttention(
    dim=768,
    num_heads=12,
    backend="auto"
)

✓ Matches timm Attention interface exactly · ✓ backend="auto" selects best kernel per dtype

Level 3Patch a pretrained timm model — zero model-code changes

Before

import timm

model = timm.create_model(
    "vit_large_patch14_clip_224",
    pretrained=True
).cuda().eval()

with torch.no_grad():
    out = model(images)

→

After (ELSA) — 2 lines added

import timm
from elsa import patch_vit_attention

model = timm.create_model(
    "vit_large_patch14_clip_224",
    pretrained=True
).cuda().eval()
patch_vit_attention(model,
    method="elsa", precision="fp32")
with torch.no_grad():
    out = model(images)

✓ Weights untouched · ✓ accuracy identical · ✓ up to 2.15× attention speedup on CLIP ViT-L/14

Level 5HuggingFace LLaMA — FP32 long-context inference

Before

from transformers import \
    AutoModelForCausalLM

model = AutoModelForCausalLM\
  .from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.float32,
    device_map="cuda"
  )

# OOM at ≥32K tokens in FP32
out = model(**long_input)

→

After (ELSA) — 2 lines added

from transformers import \
    AutoModelForCausalLM
from elsa import patch_llama_attention

model = AutoModelForCausalLM\
  .from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.float32,
    device_map="cuda"
  )
patch_llama_attention(model)
out = model(**long_input)

✓ No retraining · ✓ perplexity identical · ✓ +20.1% throughput at 65K tokens (LLaMA-13B, FP32 offloading)

Exactness Guarantee

A parallel scan only helps if it produces the same answer as sequential attention. We verify this empirically against a float64 oracle across three scenarios (regular, long, stress) and six complementary drift metrics.

Argmax Disagreement Rate

Top-attended token identical to float64 oracle in every scenario — the most behaviorally critical metric

3.3×10⁻¹⁶

Max Probability Drift ‖δP‖∞ (p95, stress)

Orders of magnitude below FP32 unit roundoff u ≈ 6×10⁻⁸

4.9×10⁻¹⁵

Relative L2 Output Error ‖δY‖₂/‖Y‖₂ (p95, stress)

Negligible even under adversarial stress inputs at long sequences

O(u log n)

Provable FP32 Error Bound (Theorem 1)

Logarithmic growth vs O(nu) sequential recurrence — numerically safer at scale

95th-Percentile Drift vs. float64 Oracle

Regular: n=1024, d=64 · Long: n=8192 · Stress: adversarial logits, n=4096. All values well below single-precision unit roundoff u ≈ 6×10⁻⁸.

Metric	Regular	Long	Stress	Interpretation
‖δP‖∞ (max abs prob drift)	3.12×10⁻¹⁷	2.34×10⁻¹⁷	3.33×10⁻¹⁶	Worst-case pointwise attention weight error
‖δP‖₂/‖P‖₂ (rel. L2 prob)	1.73×10⁻¹⁵	3.42×10⁻¹⁵	3.50×10⁻¹⁵	Global distributional shift across all attention weights
JS Divergence (per-row)	3.56×10⁻¹⁶	3.77×10⁻¹⁶	3.07×10⁻¹⁶	Per-query attention distribution shape preserved
Argmax Disagreement Rate	0	0	0	Top-attended token never changes — exact behavioral equivalence
‖δY‖∞ (max abs output drift)	4.99×10⁻¹⁶	4.99×10⁻¹⁶	3.28×10⁻¹⁵	Worst-case error in output embeddings to next layer
‖δY‖₂/‖Y‖₂ (rel. L2 output)	2.39×10⁻¹⁵	4.72×10⁻¹⁵	4.94×10⁻¹⁵	Normalized aggregate output error across full tensor

All metrics at float64-oracle level — indistinguishable from exact arithmetic Blockwise monoid validation: per-block drift identical to global result (Figure 7 in paper) Error grows O(log n) not O(n) — safer than sequential recurrence at scale

ELSA Variants — Pick Your Operating Point

ELSA exposes three precision modes, spanning a continuum from maximum throughput to minimal VRAM. All variants are drop-in replaceable; select based on your hardware constraints.

ELSA-Strict FP32

1.36Mtok/s

0.336 GB peak

Full IEEE-754 FP32 throughout. Provable O(u log n) error bound. 3.4× over ME-SDPA. Recommended for medical / scientific workloads where numerical auditability matters.

ELSA-Turbo TF32 ★ Default

1.43Mtok/s

0.288 GB peak — same as ME-SDPA

TF32 accumulation (analogous to PyTorch SDPA default). Numerical error within 10⁻⁴ of strict FP32. +5% over Strict with no memory penalty. Used throughout the paper as the headline configuration.

ELSA-Lean FP32, η=0.25

0.13Mtok/s

0.288 GB peak — matches ME-SDPA exactly

Streams smaller KV chunks (controlled by η). Trades throughput for minimal VRAM. Best for severely memory-constrained deployments (embedded, consumer GPUs at very long sequences).

FP32 Variant Comparison at 16K Tokens (A100)

Method	Throughput (M tok/s)	vs ME-SDPA	Peak Memory (GB)	Best For
ME-SDPA (baseline)	0.40	1.00×	0.289	—
ELSA-Strict (FP32)	1.36	3.40×+240%	0.336	Auditable FP32, medical / scientific
ELSA-Turbo (TF32) ★	1.43	3.58×+258%	0.288	General use — default configuration
ELSA-Lean (η=0.25)	0.13	0.33×	0.288	Minimum VRAM, embedded / consumer GPU

ELSA-Turbo: same memory as ME-SDPA, 3.58× the throughput All variants: exact softmax semantics, no retraining Switch variant: set backend= in ElsaAttention or env var

FP16 Regime (vs FlashAttention)

At 16K tokens in FP16: ELSA delivers 2.49M tok/s at 0.19 GB—the lowest memory among all exact kernels. FA2 is ~15% faster (2.88M tok/s) but needs 0.29 GB; FA3 reaches 2.73M tok/s at 0.36 GB. ME-SDPA matches ELSA's 0.19 GB footprint but achieves only 1.47M tok/s. ELSA narrows the speed gap as sequence length grows, while retaining full FP32 capability on the same kernel.

Performance Benchmarks

All A100 benchmarks: single NVIDIA A100 (40 GB), CUDA 12.6, Triton 3.2.0, PyTorch 2.6, averaged over 200 runs unless otherwise noted. We compare only exact, drop-in replacements; approximate linear-attention variants require retraining and are excluded.

FP32 Latency & Peak Memory — A100

Single attention head · synthetic sequences · 1K / 4K / 16K tokens

Click column headers to sort

Precision: * Standard SDPA OOM at 16K tokens in FP32/FP16. † FA2/FA3 FP32 path reverts to untuned SIMD and is not reported.

Method	Precision	Tokens	Latency (ms)	Peak Mem (GB)	Speedup vs ME-SDPA
Math-SDPA	FP32	1K	0.65	0.22	0.26×–
ME-SDPA	FP32	1K	0.25	0.11	1.00× (baseline)
ELSA	FP32	1K	0.19	0.10	1.32×+32%
Math-SDPA	FP32	4K	8.28	1.83	0.35×–
ME-SDPA	FP32	4K	2.89	0.15	1.00× (baseline)
ELSA	FP32	4K	0.92	0.16	3.14×+214%
Math-SDPA	FP32	16K	OOM	OOM	—
ME-SDPA	FP32	16K	12.06	0.29	1.00× (baseline)
ELSA	FP32	16K	3.46	0.34	3.49×+249%
Math-SDPA	FP16	1K	0.55	0.22	0.22×–
FA2	FP16	1K	0.18	0.11	0.67×–
FA3	FP16	1K	0.27	0.12	0.44×–
ME-SDPA	FP16	1K	0.12	0.11	1.00× (baseline)
ELSA	FP16	1K	0.12	0.10	1.00×
Math-SDPA	FP16	4K	5.80	1.85	0.08×–
FA2	FP16	4K	0.47	0.15	0.96×
FA3	FP16	4K	0.55	0.16	0.82×
ME-SDPA	FP16	4K	0.49	0.15	1.00× (baseline)
ELSA	FP16	4K	0.49	0.12	1.00×
Math-SDPA	FP16	16K	OOM	OOM	—
FA2	FP16	16K	6.01	0.26	0.93×
FA3	FP16	16K	6.66	0.36	0.84×
ME-SDPA	FP16	16K	5.71	0.29	1.00× (baseline)
ELSA	FP16	16K	6.66	0.19	0.86×

FP32 peak speedup: 3.49× (16K) ELSA lowest memory at 16K FP16: 0.19 GB Math-SDPA OOM at 16K tokens (FP32)

CLIP Image-Encoder Inference — Full Pipeline, Strict FP32

ELSA vs. ME-SDPA baseline · no retraining · A100 · attention-scope speedup in parentheses

Model: * FA2/FA3 have no compatible FP32 path for CLIP; ELSA is the only hardware-agnostic exact-attention alternative. † Attention-proxy speedups (attention module only): ViT-B/16 1.46–1.98×, ViT-L/14 1.54–2.15×.

Model	Variant	Method	Latency (ms)	Latency Reduction	Speedup vs ME-SDPA
ViT-B/32	CLIP	ME-SDPA	2.398	—	1.00×
ViT-B/32	CLIP	ELSA-Strict	2.159	−10.0%	1.111×+11.1%
ViT-B/32	CLIP	ELSA-Turbo	2.150	−10.3%	1.115×+11.5%
ViT-B/16	CLIP	ME-SDPA	4.568	—	1.00×
ViT-B/16	CLIP	ELSA-Strict	4.365	−4.4%	1.046×+4.6%
ViT-B/16	CLIP	ELSA-Turbo	4.329	−5.2%	1.055×+5.5%
ViT-L/14	CLIP	ME-SDPA	15.419	—	1.00×
ViT-L/14	CLIP	ELSA-Strict	14.914	−3.3%	1.034×+3.4%
ViT-L/14	CLIP	ELSA-Turbo	14.842	−3.7%	1.039×+3.9%
ViT-L/14@336	CLIP	ME-SDPA	29.665	—	1.00×
ViT-L/14@336	CLIP	ELSA-Strict	27.989	−5.7%	1.060×+6.0%
ViT-L/14@336	CLIP	ELSA-Turbo	27.931	−5.8%	1.062×+6.2%

Best full-pipeline speedup: 1.115× (ViT-B/32, ELSA-Turbo) FA2/FA3: no compatible FP32 path for CLIP Attention-module only: up to 2.15× (ViT-L/14-336, 560px)

3D Reconstruction — VGGT & FastVGGT Speedup

ELSA-FP32 vs. xFormers-FP32 · 350×518 input · 1,041 tokens/frame

Model:

Model	Method	Frames	Time (s)	Memory (GB)	Speedup vs xFormers
VGGT	xFormers-FP32	50	18.36	15.74	1.00× (baseline)
VGGT	FA2-FP16	50	3.67	17.16	5.00× (FP16, +9% mem)
VGGT	ELSA-FP32	50	12.56	15.74	1.46×+46%
VGGT	xFormers-FP32	100	61.75	26.67	1.00× (baseline)
VGGT	FA2-FP16	100	9.44	27.13	6.54× (FP16)
VGGT	ELSA-FP32	100	29.49	26.68	2.09×+109%
VGGT	xFormers-FP32	150	130.83	37.59	1.00× (baseline)
VGGT	FA2-FP16	150	17.88	37.43	7.31× (FP16)
VGGT	ELSA-FP32	150	55.97	37.59	2.34×+134%
FastVGGT	xFormers-FP32	100	23.94	11.50	1.00×
FastVGGT	ELSA-FP32	100	20.22	11.50	1.18×+18%
FastVGGT	xFormers-FP32	200	65.81	18.56	1.00×
FastVGGT	ELSA-FP32	200	50.76	18.56	1.30×+30%
FastVGGT	xFormers-FP32	300	129.97	25.63	1.00×
FastVGGT	ELSA-FP32	300	97.89	25.63	1.33×+33%
FastVGGT	xFormers-FP32	400	218.30	32.70	1.00×
FastVGGT	ELSA-FP32	400	158.04	32.70	1.38×+38%

VGGT peak: 2.34× at 150 frames (FP32, exact) Memory identical to xFormers-FP32 across all configs Math-FP32 OOM beyond 30 frames

Embedded Device — Jetson TX2 (FP32 & FP16)

ELSA vs. PyTorch Math kernel · batch 1 · d=64 · no Tensor Core dependency

Precision:

Tokens	Precision	Math (ms)	ELSA (ms)	Speedup	Latency Reduction
64	FP16	1.06	0.69	1.54×	−34.9%
196	FP16	10.10	6.47	1.56×	−35.9%
400	FP16	41.23	25.95	1.59×	−37.0%
576	FP16	84.54	52.78	1.60×	−37.6%
784	FP16	157.45	98.47	1.60×	−37.4%
900	FP16	207.55	129.74	1.60×	−37.5%
64	FP32	1.06	0.69	1.54×	−34.9%
196	FP32	10.12	6.48	1.56×	−36.0%
400	FP32	41.34	26.04	1.59×	−37.0%
576	FP32	84.76	52.96	1.60×	−37.5%
784	FP32	157.89	98.86	1.60×	−37.4%
900	FP32	208.91	131.14	1.59×	−37.2%

Consistent ~1.6× speedup across all token lengths FP16 ≈ FP32 latency on TX2 (SIMD, no Tensor Cores) FA2/FA3 unavailable on Jetson TX2 (compute capability 6.2)

NLP & Downstream Tasks

BERT fixed-shape · variable-length packing · DPT segmentation · Swin COCO

Task:

Task / Dataset	Method	Precision	Latency (ms)	Throughput	Speedup
BERT SST-2	ME-SDPA	FP32	1.308	—	1.00×
BERT SST-2	ELSA	FP32	0.664	—	1.97×+97%
BERT IMDB	ME-SDPA	FP32	17.693	—	1.00×
BERT IMDB	ELSA	FP32	7.794	—	2.27×+127%
Padding-free VarLen	SDPA-FP32-mem	FP32	—	0.96 M tok/s	1.00×
Padding-free VarLen	ELSA-Strict	FP32	—	2.00 M tok/s	2.08×+109%
Padding-free VarLen	ELSA-Turbo	TF32	—	2.64 M tok/s	2.75×+175%
Padding-free VarLen	ELSA-FP16	FP16	—	5.03 M tok/s	5.24×+424%
DPT Segmentation (COCO)	ME-SDPA	FP32	295.7	—	1.00×
DPT Segmentation (COCO)	ELSA	FP32	252.7	—	1.17×+17%
BERT Sentiment (FP16)	FA2	FP16	224.1	—	—
BERT Sentiment (FP16)	ELSA	FP16	96.8	—	2.31×+131%
Swin COCO Det. (FP16)	FA2	FP16	46.0	—	—
Swin COCO Det. (FP16)	ELSA	FP16	40.1	—	1.15×+15%

BERT IMDB: 2.27× (longer sequences benefit more) FP16 variable-length: 5.24× over SDPA-mem FP16 BERT vs FA2: 2.31× (true seq. len vs FA2's padded 512-token)

Hyperspectral Classification — HSIMAE-B/L (FP32)

Throughput (imgs/s) and peak memory on Pavia, Salinas, WHU datasets

Dataset:

Model	Dataset	Method	Throughput (imgs/s)	Peak Mem (GB)	Gain vs ME-SDPA
HSIMAE-B	Pavia	ME-SDPA	10489 ± 14	0.11	—
HSIMAE-B	Pavia	ELSA	14401 ± 53	0.11	+37.1%+37%
HSIMAE-L	Pavia	ME-SDPA	6471 ± 97	0.23	—
HSIMAE-L	Pavia	ELSA	10479 ± 91	0.25	+62.0%+62%
HSIMAE-B	Salinas	ME-SDPA	10427 ± 18	0.10	—
HSIMAE-B	Salinas	ELSA	14424 ± 106	0.10	+38.3%+38%
HSIMAE-L	Salinas	ME-SDPA	6490 ± 21	0.23	—
HSIMAE-L	Salinas	ELSA	10396 ± 27	0.25	+60.2%+60%
HSIMAE-B	WHU	ME-SDPA	10631 ± 14	0.11	—
HSIMAE-B	WHU	ELSA	14929 ± 16	0.11	+40.4%+40%
HSIMAE-L	WHU	ME-SDPA	6588 ± 5	0.23	—
HSIMAE-L	WHU	ELSA	10657 ± 7	0.24	+61.7%+62%

HSIMAE-L peak: +62.0% throughput (Pavia) Negligible memory overhead across all configs

Visual Overview

FP32 Latency (ms) — A100

Single attn head · 1K / 4K / 16K tokens

CLIP Attention-Proxy Speedup vs. Resolution

ELSA vs. Math-SDPA · four ViT variants (attention module only)

3D Reconstruction — VGGT Speedup

ELSA-FP32 vs. xFormers-FP32 · FastVGGT

Embedded Device — Jetson TX2

ELSA vs. Math kernel · FP32 · batch 1

BibTeX

@inproceedings{hsu2026elsa,
  title     = {ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers},
  author    = {Hsu, Chih-Chung and Ma, Xin-Di and Liao, Wo-Ting and Lee, Chia-Ming},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year      = {2026}
}