• \(df = f_xdx + f_ydy+f_zdz\)
  • \(df \neq \Delta f\)
  • \(\Delta f = f_x\Delta x + f_y\Delta y + f_z\Delta z\)
  • \(df\) is used to encode infinitesimal changes
  • used to act as a placegolder value
  • divide wrt time to get rate of change \(\rightarrow\) CHAIN RULE

Chain Rule with More Variables

Let \(w = f(x, y)\) when \(x(u, v), y(u, v)\) then,

\[ dw = f_x dx + f_y dy = (f_xx_u+ f_yy_u)du + (f_xx_v+ f_yy_v)dv = \frac{\partial f}{\partial u}du + \frac{\partial f}{\partial v}dv \]

Gradient Vector

\[ \frac{dw}{dt} = w_x \frac{dx}{dt} + w_y \frac{dy}{dt} + w_z \frac{dz}{dt} = \vec{\nabla} w.\frac{d\vec{r}}{dt} \]

Note: \(\vec{\nabla}w \ \perp \text{ level surfaces}\) (tangent to the level surface at any given point)

Directional Derivatives

\[ \frac{dw}{ds}|_{\hat{u}} = \vec{\nabla}w \cdot \frac{d\vec{r}}{ds} = \vec{\nabla}w \cdot \hat{u} \]


Direction of \(\vec{\nabla}w\) is the direction of fastest increase of \(w\)

Lagrange Multipliers

Goal: minima/maximize a multi-variable function (\(min/max\ \ f(x, y, z)\)) where \(x, y, z\) are not independent and \(\exists\) \(g(x, y, z) = c\).

These can be obtained on combining the given restraints with the following.

\[ \vec{\nabla}f = \lambda \vec{\nabla}g \]

Basic idea: to find \((x, y)\) where the level curves of \(f\) and \(g\) are tangent to each other (\(\vec{\nabla}f \parallel \vec{\nabla}g\)).

Note: Take care that the point is indeed a maxima or minima as required and not just a saddle point (second derivative test won't be applicable so be vigilant).

Functions Example Value First derivative Second derivative
\(f: \mathbb R \to \mathbb R\) \(x^2\) \(\mathbb R\) \(\mathbb R\) \(\mathbb R\)
\(f: \mathbb R^d \to \mathbb R\) loss function \(\mathbb R\) \(\mathbb R^d [\text{gradient}]\) \(\mathbb R^{d\times d} [\text{hessian}]\)
\(f: \mathbb R^d \to \mathbb R^p\) neural net layer \(\mathbb R^p\) \(\mathbb R^{d \times p} [\text{jacobian}]\) \(\mathbb R^{d \times p \times p}\)


\[ \nabla_x f(x) = \nabla_x f(x_1, \cdots, x_d) = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1}\\ \vdots\\ \frac{\partial f(x)}{\partial x_d} \end{bmatrix} \]
\[ \nabla_Af(A) = \begin{bmatrix} \frac{\partial f(A)}{\partial A_{11}} & \cdots\\ \vdots & \vdots\\ \cdots & \frac{\partial f(A)}{\partial A_{mn}} \end{bmatrix} \]


We have \(f: \mathbb R^d \to \mathbb R^p\) thus, \(f(x_1, \cdots, x_d) = \begin{bmatrix} f_1(x_1, \cdots, x_d)\\ \vdots \\ f_p(x_1, \cdots, x_d) \end{bmatrix}\)

Note: Hessians are square-symmetric matrices.

\[ \nabla_x^2 f(x) = \begin{bmatrix} \frac{\partial^2f(x)}{\partial x_1^2} & \frac{\partial^2f(x)}{\partial x_1x_2} & \cdots\\ \vdots & \ddots & \vdots \\ \vdots &\cdots & \frac{\partial^2f(x)}{\partial x_n^2} \end{bmatrix} \]


\[ J = \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_d}\end{bmatrix}= \begin{bmatrix} \nabla^{\mathrm T} f_1 \\ \vdots \\ \nabla^{\mathrm T} f_p \end{bmatrix}= \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_p}{\partial x_1} & \cdots & \dfrac{\partial f_p}{\partial x_d}\end{bmatrix} \]

where \(\nabla^{\mathrm T} f_i\) is the transpose (row vector) of the gradient of the \(i\) component.


  • \(\nabla_xb^Tx = b\)
  • \(\nabla_x^2 b^Tx = 0\)
  • \(\nabla_xx^TAx = 2Ax\), if \(A\) is symmetric
  • \(\nabla_x^2x^TAx = 2A\), if \(A\) is symmetric