Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Md Ashiqur Rahman¹, Chiao-An Yang¹, Michael N. Cheng¹, Lim Jun Hao², Jeremiah Jiang², Teck-Yian Lim², Raymond A. Yeh¹

¹ Purdue University ² DSO National Laboratories

Global vs. Local Scaling

🌍 Global Scaling

Uniformly resizes the entire image . Widely studied but rarely encountered in isolation.

🎯 Local Scaling

Different objects change size independently. Only the bird changes size, everything else remains the same.

Key Contributions

Local Scale Equivariance

We model realistic, spatially-varying scale changes rather than uniform global scaling.

Monotone Scaling Group

We approximate local scaling using invertible monotonic functions that form a transformation group.

Deep Equilibrium Canonicalizer (DEC)

We introduce a novel canonicalizer using deep equilibrium models instead of explicit optimization.

Plug-and-Play Boost

Our method improves scale robustness and accuracy across models with minimal overhead.

Monotone Scaling: A Clean Approximation

1D Intuition: Stretching a Rubber Band

Consider a function \( f(x) \) on \([0,1]\). Global scaling rescales the domain uniformly:

\( R_a[f](x) = f(a^{-1} x) \)

To capture non-uniform stretch, we introduce a strictly increasing warp \( l \) and define:

\( S(f; l)(x) = f(l^{-1}(x)) \)

Intuition — imagine a rubber band: instead of pulling both ends evenly, you stretch some segments more than others, but never fold it. The derivative \( \frac{dl}{dx} \) reveals the local stretch / compression rate.

Locally scaled 1D function — Locally scaled

2D Extension: Warping an Image Grid

We extend 1D monotone scaling to 2D images \( I(x, y) \in [0,1]^2 \to \mathbb{R} \) by applying a smooth, invertible warp \( l(x, y) = (l_X(x, y), l_Y(x, y)) \).

\( S(I; l)(x, y) = I(l^{-1}(x, y)) \)

Here, \( l_X \) controls local horizontal scaling and \( l_Y \) controls vertical scaling. The result is a gentle deformation of the image where pixel order is preserved: grid lines bend smoothly, but never cross.

Unlike rigid global rescaling, this allows different parts of the image to stretch or compress independently, mimicking realistic viewpoint and depth-based distortions.

Theorem The set of such 2D monotone maps with commuting, symmetric positive-definite Jacobians forms a group under composition. This means: transformations can be composed (closure), undone (invertibility), and include an identity (no change)—a perfect fit for equivariant modeling.

The Jacobian \( J_l(x, y) \) describes how locally the grid is scaled. If each local scaling stretches smoothly and consistently (i.e. SPD Jacobians that commute), then stacking such scalings remains valid and invertible.

Analogy: Imagine your image as an elastic fabric. Monotone scaling is like pulling this fabric gently in different directions without tearing or folding it. No matter how many times you apply such a stretch, the fabric remains smooth and untangled.

Locally scaled 2D image — Locally scaled

Deep Equilibrium Canonicalizer (DEC)

To handle local scaling equivariantly, we seek a canonical representation of features under spatial warps. Rather than solving an optimization per input, we use a Deep Equilibrium Model to find a fixed point:

\( \Phi_k = g_\theta(F_k, \Phi_k) \)

This fixed point approximates the solution to an energy minimization: \[ \Phi_k \approx \arg\min_\Phi \mathcal{E}(F_k; \Phi) \] where \( \Phi_k \) is a warp field that brings \( F_k \) into a canonical, scale-normalized frame.

The network then performs its usual processing on the canonicalized feature \( \hat{F}_k = S^{-1}(F_k; \Phi_k) \), and re-applies \( \Phi_k \) afterward to preserve equivariance downstream.

DEC pipeline overview — Figure: DEC predicts a canonicalizing warp in the latent space, enabling plug-and-play local scale equivariance.

Qualitative Results

For full quantitative results, see our ICCV 2025 paper.