Danskin’s Theorem

Value functions, envelope formulas, directional derivatives, and nonsmooth gradients.

Let xX be the decision variable and yY be a parameter. Define the value function g(y):=minxXf(x,y). When the minimizer is not unique, g may fail to be differentiable. The right object is a directional derivative (and, if desired, a generalized subdifferential).

Reference. The material is standard; see Theorem 4.13 in the book Perturbation Analysis of Optimization Problems (Bonnans–Shapiro) for closely related statements and proofs.

Danskin’s theorem is a workhorse because it turns a seemingly “implicit” differentiation problem into a “direct” one. When a value function is defined by an inner optimization,

g(y)=minxXf(x,y),

it is tempting to think that differentiating g requires differentiating the optimizer map yx(y). In practice that map can be discontinuous, non-unique, or simply expensive to differentiate (it may require sensitivity equations or differentiating through an iterative solver). Danskin’s theorem says you typically don’t need any of that: whenever g is differentiable at y, its gradient is obtained by holding the optimizer fixed,

g(y)=yf(x(y),y),

so the dependence x(y) “drops out.” This is exactly what makes bilevel objectives and value functions tractable in modern optimization pipelines: compute an optimal (or near-optimal) x(y) by solving the inner problem, then compute yf(x(y),y) by standard autodiff—without backpropagating through the solver or ever forming dx/dy. In large-scale settings this is the difference between a cheap, stable gradient oracle and an impractical one, and it underpins widely used methods such as envelope-based gradient descent, alternating minimization with outer updates, and many dual/regularized formulations where the “min over x” is used to define a smooth (or directionally smooth) outer objective.


Assumptions

Let XRn be nonempty and closed, and let YRm be open. Assume:


Theorem (directional derivative and generalized gradient)

Directional derivative.

Under the assumptions above, g is directionally differentiable on Y and for every yY and dRm,

g(y;d)=minxS(y)yf(x,y)d.

Differentiability criterion.

If either S(y) is a singleton, or yf(x,y) is the same for all xS(y), then g is differentiable at y and

g(y)=yf(x,y)for any xS(y).

Clarke subdifferential (optional).

If g is locally Lipschitz (e.g., yf is bounded on a neighborhood of (y,S(y))), then the Clarke subdifferential satisfies

Cg(y)=conv{yf(x,y):xS(y)}.

Proof (directional derivative)

Fix yY and direction dRm. For ϵ>0 small, set yϵ=y+ϵdY.

Upper bound. For any minimizer xS(y), we have g(y)=f(x,y) and g(yϵ)f(x,yϵ). Hence

g(yϵ)g(y)ϵf(x,yϵ)f(x,y)ϵ.

Letting ϵ0 and using differentiability of f(x,),

lim supϵ0g(yϵ)g(y)ϵyf(x,y)d.

Since this holds for every xS(y),

lim supϵ0g(yϵ)g(y)ϵminxS(y)yf(x,y)d.

Lower bound. Let xϵS(yϵ) be any minimizer at yϵ. Since g(y)f(xϵ,y), we have

g(yϵ)g(y)=f(xϵ,yϵ)g(y)f(xϵ,yϵ)f(xϵ,y).

Divide by ϵ. By the mean value theorem applied to θf(y+θd,xϵ), there exists θϵ(0,ϵ) such that

f(xϵ,yϵ)f(xϵ,y)ϵ=yf(xϵ,y+θϵd)d.

By compactness of S(yϵ) and continuity, the sequence (xϵ) has limit points. Take a sequence ϵk0 such that xϵkx¯. By standard stability of argmin under continuity + compactness, x¯S(y). Using continuity of yf,

lim infkg(yϵk)g(y)ϵkyf(x¯,y)dminxS(y)yf(x,y)d.

Since the right-hand side does not depend on the subsequence, we obtain

lim infϵ0g(yϵ)g(y)ϵminxS(y)yf(x,y)d.

Conclusion. The limsup and liminf match, so the directional derivative exists and equals minxS(y)yf(x,y)d.


Remark (Clarke subdifferential)

The Clarke formula can be derived quickly by rewriting g(y)=h(y) with h(y)=maxxX(f(x,y)) and applying the standard Clarke/Danskin rule for maxima, which yields Ch(y)=conv{yf(x,y):xS(y)}, hence Cg(y)=conv{yf(x,y):xS(y)}.