## Differentials

• $$df = f_xdx + f_ydy+f_zdz$$
• $$df \neq \Delta f$$
• $$\Delta f = f_x\Delta x + f_y\Delta y + f_z\Delta z$$
• $$df$$ is used to encode infinitesimal changes
• used to act as a placegolder value
• divide wrt time to get rate of change $$\rightarrow$$ CHAIN RULE

## Chain Rule with More Variables

Let $$w = f(x, y)$$ when $$x(u, v), y(u, v)$$ then,

$dw = f_x dx + f_y dy = (f_xx_u+ f_yy_u)du + (f_xx_v+ f_yy_v)dv = \frac{\partial f}{\partial u}du + \frac{\partial f}{\partial v}dv$

$\frac{dw}{dt} = w_x \frac{dx}{dt} + w_y \frac{dy}{dt} + w_z \frac{dz}{dt} = \vec{\nabla} w.\frac{d\vec{r}}{dt}$

Note: $$\vec{\nabla}w \ \perp \text{ level surfaces}$$ (tangent to the level surface at any given point)

## Directional Derivatives

$\frac{dw}{ds}|_{\hat{u}} = \vec{\nabla}w \cdot \frac{d\vec{r}}{ds} = \vec{\nabla}w \cdot \hat{u}$

## Implications

Direction of $$\vec{\nabla}w$$ is the direction of fastest increase of $$w$$

## Lagrange Multipliers

Goal: minima/maximize a multi-variable function ($$min/max\ \ f(x, y, z)$$) where $$x, y, z$$ are not independent and $$\exists$$ $$g(x, y, z) = c$$.

These can be obtained on combining the given restraints with the following.

$\vec{\nabla}f = \lambda \vec{\nabla}g$

Basic idea: to find $$(x, y)$$ where the level curves of $$f$$ and $$g$$ are tangent to each other ($$\vec{\nabla}f \parallel \vec{\nabla}g$$).

Note: Take care that the point is indeed a maxima or minima as required and not just a saddle point (second derivative test won't be applicable so be vigilant).

Functions Example Value First derivative Second derivative
$$f: \mathbb R \to \mathbb R$$ $$x^2$$ $$\mathbb R$$ $$\mathbb R$$ $$\mathbb R$$
$$f: \mathbb R^d \to \mathbb R$$ loss function $$\mathbb R$$ $$\mathbb R^d [\text{gradient}]$$ $$\mathbb R^{d\times d} [\text{hessian}]$$
$$f: \mathbb R^d \to \mathbb R^p$$ neural net layer $$\mathbb R^p$$ $$\mathbb R^{d \times p} [\text{jacobian}]$$ $$\mathbb R^{d \times p \times p}$$

$\nabla_x f(x) = \nabla_x f(x_1, \cdots, x_d) = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1}\\ \vdots\\ \frac{\partial f(x)}{\partial x_d} \end{bmatrix}$
$\nabla_Af(A) = \begin{bmatrix} \frac{\partial f(A)}{\partial A_{11}} & \cdots\\ \vdots & \vdots\\ \cdots & \frac{\partial f(A)}{\partial A_{mn}} \end{bmatrix}$

### Hessian

We have $$f: \mathbb R^d \to \mathbb R^p$$ thus, $$f(x_1, \cdots, x_d) = \begin{bmatrix} f_1(x_1, \cdots, x_d)\\ \vdots \\ f_p(x_1, \cdots, x_d) \end{bmatrix}$$

Note: Hessians are square-symmetric matrices.

$\nabla_x^2 f(x) = \begin{bmatrix} \frac{\partial^2f(x)}{\partial x_1^2} & \frac{\partial^2f(x)}{\partial x_1x_2} & \cdots\\ \vdots & \ddots & \vdots \\ \vdots &\cdots & \frac{\partial^2f(x)}{\partial x_n^2} \end{bmatrix}$

### Jacobian

$J = \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_d}\end{bmatrix}= \begin{bmatrix} \nabla^{\mathrm T} f_1 \\ \vdots \\ \nabla^{\mathrm T} f_p \end{bmatrix}= \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_p}{\partial x_1} & \cdots & \dfrac{\partial f_p}{\partial x_d}\end{bmatrix}$

where $$\nabla^{\mathrm T} f_i$$ is the transpose (row vector) of the gradient of the $$i$$ component.

### Examples

• $$\nabla_xb^Tx = b$$
• $$\nabla_x^2 b^Tx = 0$$
• $$\nabla_xx^TAx = 2Ax$$, if $$A$$ is symmetric
• $$\nabla_x^2x^TAx = 2A$$, if $$A$$ is symmetric