Danskin’s Theorem
Let \(x\in\mathcal X\) be the decision variable and \(y\in\mathcal Y\) be a parameter. Define the value function \[ g(y) := \min_{x\in\mathcal X} f(x, y). \] When the minimizer is not unique, \(g\) may fail to be differentiable. The right object is a directional derivative (and, if desired, a generalized subdifferential).
Danskin’s theorem is a workhorse because it turns a seemingly “implicit” differentiation problem into a “direct” one. When a value function is defined by an inner optimization,
it is tempting to think that differentiating \(g\) requires differentiating the optimizer map \(y\mapsto x^\star(y)\). In practice that map can be discontinuous, non-unique, or simply expensive to differentiate (it may require sensitivity equations or differentiating through an iterative solver). Danskin’s theorem says you typically don’t need any of that: whenever \(g\) is differentiable at \(y\), its gradient is obtained by holding the optimizer fixed,
so the dependence \(x^\star(y)\) “drops out.” This is exactly what makes bilevel objectives and value functions tractable in modern optimization pipelines: compute an optimal (or near-optimal) \(x^\star(y)\) by solving the inner problem, then compute \(\nabla_y f(x^\star(y), y)\) by standard autodiff—without backpropagating through the solver or ever forming \(\mathrm d x^\star/\mathrm d y\). In large-scale settings this is the difference between a cheap, stable gradient oracle and an impractical one, and it underpins widely used methods such as envelope-based gradient descent, alternating minimization with outer updates, and many dual/regularized formulations where the “min over \(x\)” is used to define a smooth (or directionally smooth) outer objective.
Assumptions
Let \(\mathcal X\subseteq\mathbb R^n\) be nonempty and closed, and let \(\mathcal Y\subseteq\mathbb R^m\) be open. Assume:
- \(f: \mathcal X \times \mathcal Y \to \mathbb R\) is continuous in \((x, y)\).
- For every \(y\in\mathcal Y\), the minimum \(g(y)=\min_{x\in\mathcal X} f(x, y)\) is attained, and the minimizer set \[ S(y):=\arg\min_{x\in\mathcal X} f(x, y) \] is nonempty and compact.
- For every \(x\in\mathcal X\), the map \(y\mapsto f(x,y)\) is differentiable on \(\mathcal Y\), and \(\nabla_y f(x,y)\) is continuous in \((x,y)\).
Theorem (directional derivative and generalized gradient)
Directional derivative.
Under the assumptions above, \(g\) is directionally differentiable on \(\mathcal Y\) and for every \(y\in\mathcal Y\) and \(d\in\mathbb R^m\),
Differentiability criterion.
If either \(S(y)\) is a singleton, or \(\nabla_y f(x^*, y)\) is the same for all \(x^*\in S(y)\), then \(g\) is differentiable at \(y\) and
Clarke subdifferential (optional).
If \(g\) is locally Lipschitz (e.g., \(\nabla_y f\) is bounded on a neighborhood of \((y,S(y))\)), then the Clarke subdifferential satisfies
Proof (directional derivative)
Fix \(y\in\mathcal Y\) and direction \(d\in\mathbb R^m\). For \(\epsilon>0\) small, set \(y_\epsilon=y+\epsilon d\in\mathcal Y\).
Upper bound. For any minimizer \(x^*\in S(y)\), we have \(g(y)=f(x^*,y)\) and \(g(y_\epsilon)\le f(x^*,y_\epsilon)\). Hence
Letting \(\epsilon\downarrow 0\) and using differentiability of \(f(x^*,\cdot)\),
Since this holds for every \(x^*\in S(y)\),
Lower bound. Let \(x_\epsilon\in S(y_\epsilon)\) be any minimizer at \(y_\epsilon\). Since \(g(y)\le f(x_\epsilon,y)\), we have
Divide by \(\epsilon\). By the mean value theorem applied to \(\theta\mapsto f(y+\theta d, x_\epsilon)\), there exists \(\theta_\epsilon\in(0,\epsilon)\) such that
By compactness of \(S(y_\epsilon)\) and continuity, the sequence \((x_\epsilon)\) has limit points. Take a sequence \(\epsilon_k\downarrow 0\) such that \(x_{\epsilon_k}\to \bar x\). By standard stability of argmin under continuity + compactness, \(\bar x\in S(y)\). Using continuity of \(\nabla_y f\),
Since the right-hand side does not depend on the subsequence, we obtain
Conclusion. The limsup and liminf match, so the directional derivative exists and equals \(\min_{x^*\in S(y)} \nabla_y f(x^*, y)^\top d\).
Remark (Clarke subdifferential)
The Clarke formula can be derived quickly by rewriting \(g(y)=-h(y)\) with \(h(y)=\max_{x\in\mathcal X}(-f(x,y))\) and applying the standard Clarke/Danskin rule for maxima, which yields \(\partial^{\mathrm C} h(y)=\operatorname{conv}\{-\nabla_y f(x^*, y):x^*\in S(y)\}\), hence \(\partial^{\mathrm C} g(y)=\operatorname{conv}\{\nabla_y f(x^*, y):x^*\in S(y)\}\).