*Author: Vlatimir Mikulik and Matteo Hessel*

Gradient computing is an important part of modern machine learning methods. This section discusses some advanced topics in the areas of machine differentiation as it relates to modern machine learning.

While understanding how internal auto-diffing works is not critical to using JAX in most contexts, we encourage the reader to easily check it out.Videoto get a deeper idea of what is going on.

The Autodiff Cookbookis a more advanced and detailed explanation of how these ideas are implemented in the JAX backend. You don't need to understand this to do most things in JAX. Some functions (eg configurationcustom derivatives) rely on understanding these, so it's important to know that this explanation exists if you need to use it.

## higher order derivatives#

JAX autodiff simplifies the computation of higher order derivatives because the functions that compute the derivatives are differentiable. So higher order derivatives are as simple as stack transformations.

We illustrate this in the case of a single variable:

The derivation of\(f(x) = x^3 + 2x^2 - 3x + 1\)can be calculated as:

Object jaxF = lambda X: X**3 + 2*X**2 - 3*X + 1dfx = jax.Diploma(F)

The higher order derivatives of\(F\)they are:

\[\begin{dividir}\begin{matriz}{l}f'(x) = 3x^2 + 4x -3\\f''(x) = 6x + 4\\f'''(x) = 6\\f^{iv}(x) = 0\end{matriz}\end{dividir}\]

Calculating any of them in JAX is as easy as concatenating`Diploma`

Function:

d2fdx = jax.Diploma(dfx)d3fdx = jax.Diploma(d2fdx)d4fdx = jax.Diploma(d3fdx)

Evaluation of the above\(x=1\)would give us:

\[\begin{divide}\begin{matrix}{l}f'(1) = 4\\f''(1) = 10\\f'''(1) = 6\\f^{iv} (1) = 0\end{matrix}\end{divide}\]

With JAX:

press(dfx(1.))press(d2fdx(1.))press(d3fdx(1.))press(d4fdx(1.))

4.010.06.00.0

In the multivariate case, higher order derivatives are more complicated. The second order derivative of a function is represented by itshessian matrix, defined according to

\[(\mathbf{H}f)_{i,j} = \frac{\partial^2 f}{\partial_i\partial_j}.\]

The Hessian of a real-valued multivariable function,\(f:\mathbb R^n\para\mathbb R\), can be identified with the Jacobin of its gradient. JAX provides two transformations to compute the Jacobian of a function:`jax.jacfwd`

mi`jax.jacrev`

, according to the autodiff of forward and reverse mode. They give the same answer, but one may be more efficient than the other in different circumstances - see thoseVideo about automatic differentiationlinked above for explanation.

definitely Million(F): reversing jax.jacfwd(jax.Diploma(F))

Let's check if this is correct in the dot product.\(f:\mathbf{x}\mapsto\mathbf{x}^\top\mathbf{x}\).

con\(i=j\),\(\frac{\partial^2 f}{\partial_i\partial_j}(\mathbf{x}) = 2\). Otherwise,\(\frac{\partial^2 f}{\partial_i\partial_j}(\mathbf{x}) = 0\).

Object jax.numpy and etc.definitely F(X): reversing etc..Point(X, X)Million(F)(etc..diversity([1., 2., 3.]))

ArrayDispositivo([[2., 0., 0.], [0., 2., 0.], [0., 0., 2.]], dtype=float32)

Often, however, we are not interested in calculating the full Hessian ourselves, and doing so can be very inefficient.The Autodiff Cookbookexplains some tricks, such as B. the Hessian vector product, which allows you to use it without materializing the entire matrix.

If you want to work with higher order derivatives in JAX, we strongly recommend that you read the Autodiff Cookbook.

## Greater order optimization#

Some meta-learning techniques, such as B. Model-independent meta-learning (MALM), require differentiation via gradient updates. In other frameworks this can be quite complicated, but in JAX it is much simpler:

definitely meta_loss_fn(Parameter, Data): """Calculate the loss after one SGD step.""" graduates = jax.Diploma(loss_fn)(Parameter, Data) reversing loss_fn(Parameter - yo * graduates, Data)meta_graduates = jax.Diploma(meta_loss_fn)(Parameter, Data)

## stop the slopes#

Auto-Diff allows the gradient of a function to be automatically calculated with respect to its inputs. However, sometimes we may want additional control: for example, we may want to avoid backpropagation gradients in a subset of the computational graph.

For example, consider the TD(0) (time difference) Reinforcement learning update. This is used to learn to appreciate*courage*of a state in an environment from the experience of interacting with the environment. let's accept appreciation\(v_{\theta}(s_{t-1}\)) in a state\(s_{t-1}\)is parameterized by a linear function.

# Function initial value and parametersvalor_fn = lambda theta, Condition: etc..Point(theta, Condition)theta = etc..diversity([0,1, -0,1, 0.])

Consider a state transition\(s_{t-1}\)to a state\(calle\)during which we observe the reward\(r_t\)

# A transition example.s_tm1 = etc..diversity([1., 2., -1.])r_t = etc..diversity(1.)calle = etc..diversity([2., 1., 0.])

The update of TD(0) for the network parameters is:

\[\Delta \theta = (r_t + v_{\theta}(s_t) - v_{\theta}(s_{t-1})) \nabla v_{\theta}(s_{t-1})\]

This update is not the gradient of a loss function.

However, it can be**written**as the gradient of the pseudoloss function

\[L(\theta) = [r_t + v_{\theta}(s_t) - v_{\theta}(s_{t-1})]^2\]

if the goal dependency\(r_t + v_{\theta}(s_t)\)Parameter im\(\theta\)it is ignored

How can we implement this in JAX? Writing the pseudoloss naively, we get:

definitely td_loss(theta, s_tm1, r_t, calle): v_tm1 = valor_fn(theta, s_tm1) Meta = r_t + valor_fn(theta, calle) reversing (Meta - v_tm1) ** 2td_update = jax.Diploma(td_loss)delta_theta = td_update(theta, s_tm1, r_t, calle)delta_theta

ArrayDispositivo([2.4, -2.4, 2.4], dtype=float32)

But`td_update`

o**NO**Compute an update of TD(0) since the gradient computation will include the dependency of`Meta`

one\(\theta\).

we can use`jax.lax.stop_gradient`

to force JAX to ignore target dependency\(\theta\):

definitely td_loss(theta, s_tm1, r_t, calle): v_tm1 = valor_fn(theta, s_tm1) Meta = r_t + valor_fn(theta, calle) reversing (jax.rested.stop_gradient(Meta) - v_tm1) ** 2td_update = jax.Diploma(td_loss)delta_theta = td_update(theta, s_tm1, r_t, calle)delta_theta

Device array ([-2.4, -4.8, 2.4], dtype=float32)

what will he treat`Meta`

As if I had**NO**depends on the parameters\(\theta\)and calculate the correct update of the parameters.

o`jax.lax.stop_gradient`

It can also be useful in other configurations, for example, when you want the gradient of a loss to affect only a subset of the neural network parameters (because, for example, the other parameters are trained with a different loss).

## Direct estimator with`stop_gradient`

#

The direct estimator is a trick to define a "gradient" of an otherwise non-differentiable function. A non-differentiable function is given\(f : \mathbb{R}^n \a \mathbb{R}^n\)used as part of a larger function for which we want to find a gradient, we just do it during the step back\(F\)is the identity function. This can be perfectly implemented with`jax.lax.stop_gradient`

:

definitely F(X): reversing etc..repeat(X) # not differentiabledefinitely direct_through_f(X): # Construct an expression exactly zero using Sterbenz's lemma that has # a gradient of exactly one. null = X - jax.rested.stop_gradient(X) reversing null + jax.rested.stop_gradient(F(X))press("f(x):", F(3.2))press("straight_through_f(x):", direct_through_f(3.2))press("grau(f)(x):", jax.Diploma(F)(3.2))press("graduado(straight_through_f)(x):", jax.Diploma(direct_through_f)(3.2))

f(x): 3.0straight_through_f(x): 3.0degrees(f)(x): 0.0degrees(straight_through_f)(x): 1.0

## tilt for example#

Although most ML systems compute gradients and updates from batches of data, sometimes, for reasons of computational efficiency and/or variance reduction, it is necessary to have access to the gradient/update associated with each specific sample in the batch. .

This is required, for example, to prioritize data based on gradient size or to apply clipping/normalization on a sample-by-sample basis.

In many frameworks (PyTorch, TF, Theano) it is usually not trivial to calculate gradients, for example, since the library builds the gradient directly in the batch. Naive workarounds, like B. Calculating a separate loss and then adding the resulting gradients is often very inefficient.

In JAX we can define the code to calculate the gradient per sample in a simple but efficient way.

Just adjust the`jit`

,`vmap`

mi`Diploma`

transformations together:

perex_grads = jax.jit(jax.vmap(jax.Diploma(td_loss), in_axes=(none, 0, 0, 0)))# Try it:lot_s_tm1 = etc..Battery([s_tm1, s_tm1])lote_r_t = etc..Battery([r_t, r_t])lote_s_t = etc..Battery([calle, calle])perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)

Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

Let's go through this transformation one by one.

First we apply`jax.grad`

for`td_loss`

to get a function that calculates the loss gradient with respect to parameters on individual (non-batch) inputs:

dtdloss_dtheta = jax.Diploma(td_loss)dtdloss_dtheta(theta, s_tm1, r_t, calle)

Device array ([-2.4, -4.8, 2.4], dtype=float32)

This function calculates a row of the above matrix.

So we vectorize this function too`jax.vmap`

. This adds a batch dimension to all inputs and outputs. Now, given a batch of inputs, we produce a batch of outputs: each output in the batch corresponds to the gradient of the corresponding member of the input batch.

fast_perex_grads = jax.vmap(dtdloss_dtheta)lote_theta = etc..Battery([theta, theta])fast_perex_grads(lote_theta, lot_s_tm1, lote_r_t, lote_s_t)

Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

That's not really what we want because we have to manually feed a batch to this function`theta`

s, although we actually want to use a single`theta`

. We fix this by adding`in_axes`

For him`jax.vmap`

, where theta is given as`none`

, and the other arguments like`0`

. This causes the resulting function to just add an extra axis to the other arguments and exit`theta`

without batches as we want:

ineficiente_perex_grads = jax.vmap(dtdloss_dtheta, in_axes=(none, 0, 0, 0))ineficiente_perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)

Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

We almost arrive! This does what we want, but it's slower than it should be. Now let's pack it all into one`jax.jit`

to get the compiled and efficient version of the same function:

perex_grads = jax.jit(ineficiente_perex_grads)perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)

Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

%Timeineficiente_perex_grads(theta, batched_s_tm1, batched_r_t, batched_s_t).block_until_ready()%Timeperex_grads(theta, batched_s_tm1, batched_r_t, batched_s_t).block_until_ready()

100 loops, best of 5: 7.74 ms pro Loop 10000 loops, best of 5: 86.2 µs pro Loop