Danskin’s Theorem

Value functions, envelope formulas, directional derivatives, and nonsmooth gradients.

Let \(x\in\mathcal X\) be the decision variable and \(y\in\mathcal Y\) be a parameter. Define the value function \[ g(y) := \min_{x\in\mathcal X} f(x, y). \] When the minimizer is not unique, \(g\) may fail to be differentiable. The right object is a directional derivative (and, if desired, a generalized subdifferential).

Reference. The material is standard; see Theorem 4.13 in the book Perturbation Analysis of Optimization Problems (Bonnans–Shapiro) for closely related statements and proofs.

Danskin’s theorem is a workhorse because it turns a seemingly “implicit” differentiation problem into a “direct” one. When a value function is defined by an inner optimization,

\[ g(y)=\min_{x\in\mathcal X} f(x, y), \]

it is tempting to think that differentiating \(g\) requires differentiating the optimizer map \(y\mapsto x^\star(y)\). In practice that map can be discontinuous, non-unique, or simply expensive to differentiate (it may require sensitivity equations or differentiating through an iterative solver). Danskin’s theorem says you typically don’t need any of that: whenever \(g\) is differentiable at \(y\), its gradient is obtained by holding the optimizer fixed,

\[ \nabla g(y)=\nabla_y f\bigl(x^\star(y), y\bigr), \]

so the dependence \(x^\star(y)\) “drops out.” This is exactly what makes bilevel objectives and value functions tractable in modern optimization pipelines: compute an optimal (or near-optimal) \(x^\star(y)\) by solving the inner problem, then compute \(\nabla_y f(x^\star(y), y)\) by standard autodiff—without backpropagating through the solver or ever forming \(\mathrm d x^\star/\mathrm d y\). In large-scale settings this is the difference between a cheap, stable gradient oracle and an impractical one, and it underpins widely used methods such as envelope-based gradient descent, alternating minimization with outer updates, and many dual/regularized formulations where the “min over \(x\)” is used to define a smooth (or directionally smooth) outer objective.


Assumptions

Let \(\mathcal X\subseteq\mathbb R^n\) be nonempty and closed, and let \(\mathcal Y\subseteq\mathbb R^m\) be open. Assume:


Theorem (directional derivative and generalized gradient)

Directional derivative.

Under the assumptions above, \(g\) is directionally differentiable on \(\mathcal Y\) and for every \(y\in\mathcal Y\) and \(d\in\mathbb R^m\),

\[ g'(y;d) = \min_{x^* \in S(y)} \nabla_y f(x^*, y)^\top d. \]

Differentiability criterion.

If either \(S(y)\) is a singleton, or \(\nabla_y f(x^*, y)\) is the same for all \(x^*\in S(y)\), then \(g\) is differentiable at \(y\) and

\[ \nabla g(y) = \nabla_y f(x^*, y)\quad \text{for any }x^*\in S(y). \]

Clarke subdifferential (optional).

If \(g\) is locally Lipschitz (e.g., \(\nabla_y f\) is bounded on a neighborhood of \((y,S(y))\)), then the Clarke subdifferential satisfies

\[ \partial^{\mathrm C} g(y) = \operatorname{conv}\{\nabla_y f(x^*, y) : x^*\in S(y)\}. \]

Proof (directional derivative)

Fix \(y\in\mathcal Y\) and direction \(d\in\mathbb R^m\). For \(\epsilon>0\) small, set \(y_\epsilon=y+\epsilon d\in\mathcal Y\).

Upper bound. For any minimizer \(x^*\in S(y)\), we have \(g(y)=f(x^*,y)\) and \(g(y_\epsilon)\le f(x^*,y_\epsilon)\). Hence

\[ \frac{g(y_\epsilon)-g(y)}{\epsilon} \le \frac{f(x^*,y_\epsilon)-f(x^*,y)}{\epsilon}. \]

Letting \(\epsilon\downarrow 0\) and using differentiability of \(f(x^*,\cdot)\),

\[ \limsup_{\epsilon\downarrow 0}\frac{g(y_\epsilon)-g(y)}{\epsilon} \le \nabla_y f(x^*, y)^\top d. \]

Since this holds for every \(x^*\in S(y)\),

\[ \limsup_{\epsilon\downarrow 0}\frac{g(y_\epsilon)-g(y)}{\epsilon} \le \min_{x^*\in S(y)} \nabla_y f(x^*, y)^\top d. \]

Lower bound. Let \(x_\epsilon\in S(y_\epsilon)\) be any minimizer at \(y_\epsilon\). Since \(g(y)\le f(x_\epsilon,y)\), we have

\[ g(y_\epsilon)-g(y) = f(x_\epsilon,y_\epsilon)-g(y) \ge f(x_\epsilon,y_\epsilon)-f(x_\epsilon,y). \]

Divide by \(\epsilon\). By the mean value theorem applied to \(\theta\mapsto f(y+\theta d, x_\epsilon)\), there exists \(\theta_\epsilon\in(0,\epsilon)\) such that

\[ \frac{f(x_\epsilon, y_\epsilon)-f(x_\epsilon,y)}{\epsilon} = \nabla_y f(x_\epsilon, y+\theta_\epsilon d)^\top d. \]

By compactness of \(S(y_\epsilon)\) and continuity, the sequence \((x_\epsilon)\) has limit points. Take a sequence \(\epsilon_k\downarrow 0\) such that \(x_{\epsilon_k}\to \bar x\). By standard stability of argmin under continuity + compactness, \(\bar x\in S(y)\). Using continuity of \(\nabla_y f\),

\[ \liminf_{k\to\infty}\frac{g(y_{\epsilon_k})-g(y)}{\epsilon_k} \ge \nabla_y f(\bar x, y)^\top d \ge \min_{x^*\in S(y)} \nabla_y f(x^*, y)^\top d. \]

Since the right-hand side does not depend on the subsequence, we obtain

\[ \liminf_{\epsilon\downarrow 0}\frac{g(y_\epsilon)-g(y)}{\epsilon} \ge \min_{x^*\in S(y)} \nabla_y f(x^*, y)^\top d. \]

Conclusion. The limsup and liminf match, so the directional derivative exists and equals \(\min_{x^*\in S(y)} \nabla_y f(x^*, y)^\top d\).


Remark (Clarke subdifferential)

The Clarke formula can be derived quickly by rewriting \(g(y)=-h(y)\) with \(h(y)=\max_{x\in\mathcal X}(-f(x,y))\) and applying the standard Clarke/Danskin rule for maxima, which yields \(\partial^{\mathrm C} h(y)=\operatorname{conv}\{-\nabla_y f(x^*, y):x^*\in S(y)\}\), hence \(\partial^{\mathrm C} g(y)=\operatorname{conv}\{\nabla_y f(x^*, y):x^*\in S(y)\}\).