PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

Abstract

Shadow removal under diverse lighting requires disentangling illumination from intrinsic reflectance— a challenge when physical priors are misaligned.

We propose PhaSR with dual-level prior alignment: (1) PAN performs parameter-free illumination correction via Gray-world normalization and log-domain Retinex decomposition, suppressing chromatic bias. (2) GSRA extends differential attention to harmonize depth-derived geometry with DINO-v2 semantics, resolving modal conflicts across illumination conditions.

Experiments demonstrate competitive performance with lower complexity, generalizing to ambient lighting where traditional methods fail.

Network Architecture

A multi-scale Transformer encoder-decoder integrates frozen DINO-v2 semantic features and DepthAnything-v2 geometric priors via GSRA's cross-modal differential attention ($\mathbf{A}_\text{rect} = \mathbf{A}_\text{sem} - \lambda \cdot \mathbf{A}_\text{geo}$).

⚡ Geometric-Semantic Rectification Attention (GSRA)

Real-world scenes carry two physically distinct signals that respond to illumination very differently. Geometric priors (depth, surface normals from DepthAnything-v2) are sharp at shadow boundaries but noisy in uniformly lit regions. Semantic embeddings (DINO-v2) stay stable across lighting changes—a red apple is always semantically a red apple—but are spatially coarse. GSRA harmonizes these two modalities through cross-modal differential attention:

$$\mathbf{A}_\text{rect} = \mathbf{A}_\text{sem} - \lambda \cdot \mathbf{A}_\text{geo}$$

The subtraction suppresses geometric noise in uniformly lit regions while preserving geometric precision at true illumination boundaries—producing features that balance local edge sharpness with global material consistency.

OmniSR

Geometric-semantic fusion

Concatenates or additively merges geometric and semantic features without explicit alignment. Under ambient or multi-source lighting, geometric noise from interreflections bleeds into semantic attention, causing boundary blurring and color artifacts near shadow edges.

PhaSR (GSRA)

Cross-modal differential rectification

Explicitly subtracts geometric attention from semantic attention with a learnable $\lambda$. This rectification gate lets the network adaptively weight geometric precision (high $\lambda$ at boundaries) vs. semantic stability (low $\lambda$ in smooth regions), achieving sharper boundary localization and cleaner reflectance recovery under complex indirect illumination.

Advantage 1

Ambient lighting generalization

OmniSR's fusion struggles with multi-source indirect illumination (Ambient6K: 23.01 dB). GSRA's modal decoupling reaches 23.32 dB on the same benchmark.

Advantage 2

Shadow boundary precision

Geometric attention is applied as a rectifier, not a fuser—preventing semantic over-smoothing at shadow edges, visible in intermediate feature maps (Fig. 2 & 17).

Advantage 3

Lower complexity

Despite richer prior integration, PhaSR achieves 55.6 G FLOPs vs. OmniSR's 78.3 G—32% fewer—thanks to the lightweight asymmetric decoder design.

🎬 Physically Aligned Normalization (PAN) — Distribution Visualization

PAN is a parameter-free preprocessing module that corrects illumination before any learned feature extraction. The animation below illustrates how PAN decomposes the input pixel intensity distribution, identifies shadow pixels as outliers on the distribution tail, and recombines the components so that those outliers are pulled back into the normal range— effectively suppressing shadows without a single trainable weight.

Step 1

Gray-world normalization

$$\mathbf{I}_\text{norm} = \mathbf{I} \cdot \frac{\mathbb{E}[\mathbf{I}]}{\mathbb{E}_c[\mathbf{I}] + \varepsilon}$$

Balances per-channel means to remove color cast from the illuminant (warm indoor / cool daylight), centering the distribution.

Step 2

Log-domain Retinex decomposition

$$\log \hat{S} = \mathbb{E}_{H,W}[\log(\mathbf{I}_\text{norm} + \varepsilon)]$$$$\log \hat{R} = \log(\mathbf{I}_\text{norm} + \varepsilon) - \log \hat{S}$$

Separates global illumination $\hat{S}$ from surface reflectance $\hat{R}$ via additive log-space decomposition. Shadow pixels become extreme outliers in the $\hat{R}$ distribution tail.

Step 3

Recombination & range normalization

$$\hat{I} = \frac{\hat{R} \otimes \hat{S} - \min(\hat{R} \otimes \hat{S})}{\max(\hat{R} \otimes \hat{S}) - \min(\hat{R} \otimes \hat{S}) + \varepsilon}$$

Recombines components and applies min-max normalization. Outlier pixels on the distribution tail are pulled back into the valid radiometric range, suppressing shadow artifacts.

Each panel shows the pixel intensity histogram at a different stage of PAN. Red bars represent shadow-region pixels clustered at the low-brightness end; blue bars represent the normal (non-shadow) distribution. After PAN, the red distribution shifts toward brighter values—shadows become less dark— and the two distributions converge into a more uniform output. The darker the residue map (GT − Î), the better the correction.

Comparison Results on WSRD+ (Realworld-Indoor)

Sample 1

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

ShadowRefiner

Input

ShadowRefiner

StableShadowDiffusion

Input

StableShadowDiffusion

Sample 2

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

ShadowRefiner

Input

ShadowRefiner

StableShadowDiffusion

Input

StableShadowDiffusion

Comparison Results on INS Dataset (Synthesized-Indoor)

Sample 3 - Full Image

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

OmniSR

Input

OmniSR

StableShadowDiffusion

Input

StableShadowDiffusion

Sample 3 - Cropped Region

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

OmniSR

Input

OmniSR

StableShadowDiffusion

Input

StableShadowDiffusion

Sample 4 - Full Image

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

OmniSR

Input

OmniSR

StableShadowDiffusion

Input

StableShadowDiffusion

Sample 4 - Cropped Region

PhaSR (Ours)

Input

PhaSR

DenseSR

Input

DenseSR

OmniSR

Input

OmniSR

StableShadowDiffusion

Input

StableShadowDiffusion

BibTeX

@inproceedings{lee2026phasr, title = {PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors}, author = {Lee, Chia-Ming and Lin, Yu-Fan and Hsiao, Yu-Jou and Jiang, Jin-Hui and Liu, Yu-Lun and Hsu, Chih-Chung}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, }

PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

CVPR 2026

Motivation: Existing methods lose physical priors through encoder-decoder bottlenecks, failing to localize shadows accurately (see feature visualizations above). PhaSR addresses this via dual-level physically aligned priors, generalizing from single-light to multi-source ambient lighting scenarios.

Abstract

Network Architecture

Qualitative Results

Comparison Results on WSRD+ (Realworld-Indoor)

Sample 1

Sample 2

Comparison Results on INS Dataset (Synthesized-Indoor)

Sample 3 - Full Image

Sample 3 - Cropped Region

Sample 4 - Full Image

Sample 4 - Cropped Region

BibTeX