Advanced self-distinguishing in JAX - JAX Documentation (2023)

Advanced self-distinguishing in JAX - JAX Documentation (1) Advanced self-distinguishing in JAX - JAX Documentation (2)

Author: Vlatimir Mikulik and Matteo Hessel

Gradient computing is an important part of modern machine learning methods. This section discusses some advanced topics in the areas of machine differentiation as it relates to modern machine learning.

While understanding how internal auto-diffing works is not critical to using JAX in most contexts, we encourage the reader to easily check it out.Videoto get a deeper idea of ​​what is going on.

The Autodiff Cookbookis a more advanced and detailed explanation of how these ideas are implemented in the JAX backend. You don't need to understand this to do most things in JAX. Some functions (eg configurationcustom derivatives) rely on understanding these, so it's important to know that this explanation exists if you need to use it.

higher order derivatives#

JAX autodiff simplifies the computation of higher order derivatives because the functions that compute the derivatives are differentiable. So higher order derivatives are as simple as stack transformations.

We illustrate this in the case of a single variable:

The derivation of\(f(x) = x^3 + 2x^2 - 3x + 1\)can be calculated as:

Object jaxF = lambda X: X**3 + 2*X**2 - 3*X + 1dfx = jax.Diploma(F)

The higher order derivatives of\(F\)they are:

\[\begin{dividir}\begin{matriz}{l}f'(x) = 3x^2 + 4x -3\\f''(x) = 6x + 4\\f'''(x) = 6\\f^{iv}(x) = 0\end{matriz}\end{dividir}\]

Calculating any of them in JAX is as easy as concatenatingDiplomaFunction:

d2fdx = jax.Diploma(dfx)d3fdx = jax.Diploma(d2fdx)d4fdx = jax.Diploma(d3fdx)

Evaluation of the above\(x=1\)would give us:

\[\begin{divide}\begin{matrix}{l}f'(1) = 4\\f''(1) = 10\\f'''(1) = 6\\f^{iv} (1) = 0\end{matrix}\end{divide}\]

With JAX:


In the multivariate case, higher order derivatives are more complicated. The second order derivative of a function is represented by itshessian matrix, defined according to

(Video) Intro to JAX: Accelerating Machine Learning research

\[(\mathbf{H}f)_{i,j} = \frac{\partial^2 f}{\partial_i\partial_j}.\]

The Hessian of a real-valued multivariable function,\(f:\mathbb R^n\para\mathbb R\), can be identified with the Jacobin of its gradient. JAX provides two transformations to compute the Jacobian of a function:jax.jacfwdmijax.jacrev, according to the autodiff of forward and reverse mode. They give the same answer, but one may be more efficient than the other in different circumstances - see thoseVideo about automatic differentiationlinked above for explanation.

definitely Million(F): reversing jax.jacfwd(jax.Diploma(F))

Let's check if this is correct in the dot product.\(f:\mathbf{x}\mapsto\mathbf{x}^\top\mathbf{x}\).

con\(i=j\),\(\frac{\partial^2 f}{\partial_i\partial_j}(\mathbf{x}) = 2\). Otherwise,\(\frac{\partial^2 f}{\partial_i\partial_j}(\mathbf{x}) = 0\).

Object jax.numpy and etc.definitely F(X): reversing etc..Point(X, X)Million(F)(etc..diversity([1., 2., 3.]))
ArrayDispositivo([[2., 0., 0.], [0., 2., 0.], [0., 0., 2.]], dtype=float32)

Often, however, we are not interested in calculating the full Hessian ourselves, and doing so can be very inefficient.The Autodiff Cookbookexplains some tricks, such as B. the Hessian vector product, which allows you to use it without materializing the entire matrix.

If you want to work with higher order derivatives in JAX, we strongly recommend that you read the Autodiff Cookbook.

Greater order optimization#

Some meta-learning techniques, such as B. Model-independent meta-learning (MALM), require differentiation via gradient updates. In other frameworks this can be quite complicated, but in JAX it is much simpler:

definitely meta_loss_fn(Parameter, Data): """Calculate the loss after one SGD step.""" graduates = jax.Diploma(loss_fn)(Parameter, Data) reversing loss_fn(Parameter - yo * graduates, Data)meta_graduates = jax.Diploma(meta_loss_fn)(Parameter, Data)

stop the slopes#

Auto-Diff allows the gradient of a function to be automatically calculated with respect to its inputs. However, sometimes we may want additional control: for example, we may want to avoid backpropagation gradients in a subset of the computational graph.

For example, consider the TD(0) (time difference) Reinforcement learning update. This is used to learn to appreciatecourageof a state in an environment from the experience of interacting with the environment. let's accept appreciation\(v_{\theta}(s_{t-1}\)) in a state\(s_{t-1}\)is parameterized by a linear function.

# Function initial value and parametersvalor_fn = lambda theta, Condition: etc..Point(theta, Condition)theta = etc..diversity([0,1, -0,1, 0.])

Consider a state transition\(s_{t-1}\)to a state\(calle\)during which we observe the reward\(r_t\)

# A transition example.s_tm1 = etc..diversity([1., 2., -1.])r_t = etc..diversity(1.)calle = etc..diversity([2., 1., 0.])
(Video) JAX Crash Course - Accelerating Machine Learning code!

The update of TD(0) for the network parameters is:

\[\Delta \theta = (r_t + v_{\theta}(s_t) - v_{\theta}(s_{t-1})) \nabla v_{\theta}(s_{t-1})\]

This update is not the gradient of a loss function.

However, it can bewrittenas the gradient of the pseudoloss function

\[L(\theta) = [r_t + v_{\theta}(s_t) - v_{\theta}(s_{t-1})]^2\]

if the goal dependency\(r_t + v_{\theta}(s_t)\)Parameter im\(\theta\)it is ignored

How can we implement this in JAX? Writing the pseudoloss naively, we get:

definitely td_loss(theta, s_tm1, r_t, calle): v_tm1 = valor_fn(theta, s_tm1) Meta = r_t + valor_fn(theta, calle) reversing (Meta - v_tm1) ** 2td_update = jax.Diploma(td_loss)delta_theta = td_update(theta, s_tm1, r_t, calle)delta_theta
ArrayDispositivo([2.4, -2.4, 2.4], dtype=float32)

Buttd_updateoNOCompute an update of TD(0) since the gradient computation will include the dependency ofMetaone\(\theta\).

we can usejax.lax.stop_gradientto force JAX to ignore target dependency\(\theta\):

definitely td_loss(theta, s_tm1, r_t, calle): v_tm1 = valor_fn(theta, s_tm1) Meta = r_t + valor_fn(theta, calle) reversing (jax.rested.stop_gradient(Meta) - v_tm1) ** 2td_update = jax.Diploma(td_loss)delta_theta = td_update(theta, s_tm1, r_t, calle)delta_theta
Device array ([-2.4, -4.8, 2.4], dtype=float32)

what will he treatMetaAs if I hadNOdepends on the parameters\(\theta\)and calculate the correct update of the parameters.

ojax.lax.stop_gradientIt can also be useful in other configurations, for example, when you want the gradient of a loss to affect only a subset of the neural network parameters (because, for example, the other parameters are trained with a different loss).

Direct estimator withstop_gradient#

The direct estimator is a trick to define a "gradient" of an otherwise non-differentiable function. A non-differentiable function is given\(f : \mathbb{R}^n \a \mathbb{R}^n\)used as part of a larger function for which we want to find a gradient, we just do it during the step back\(F\)is the identity function. This can be perfectly implemented withjax.lax.stop_gradient:

definitely F(X): reversing etc..repeat(X) # not differentiabledefinitely direct_through_f(X): # Construct an expression exactly zero using Sterbenz's lemma that has # a gradient of exactly one. null = X - jax.rested.stop_gradient(X) reversing null + jax.rested.stop_gradient(F(X))press("f(x):", F(3.2))press("straight_through_f(x):", direct_through_f(3.2))press("grau(f)(x):", jax.Diploma(F)(3.2))press("graduado(straight_through_f)(x):", jax.Diploma(direct_through_f)(3.2))
f(x): 3.0straight_through_f(x): 3.0degrees(f)(x): 0.0degrees(straight_through_f)(x): 1.0
(Video) Machine Learning with JAX - From Zero to Hero | Tutorial #1

tilt for example#

Although most ML systems compute gradients and updates from batches of data, sometimes, for reasons of computational efficiency and/or variance reduction, it is necessary to have access to the gradient/update associated with each specific sample in the batch. .

This is required, for example, to prioritize data based on gradient size or to apply clipping/normalization on a sample-by-sample basis.

In many frameworks (PyTorch, TF, Theano) it is usually not trivial to calculate gradients, for example, since the library builds the gradient directly in the batch. Naive workarounds, like B. Calculating a separate loss and then adding the resulting gradients is often very inefficient.

In JAX we can define the code to calculate the gradient per sample in a simple but efficient way.

Just adjust thejit,vmapmiDiplomatransformations together:

perex_grads = jax.jit(jax.vmap(jax.Diploma(td_loss), in_axes=(none, 0, 0, 0)))# Try it:lot_s_tm1 = etc..Battery([s_tm1, s_tm1])lote_r_t = etc..Battery([r_t, r_t])lote_s_t = etc..Battery([calle, calle])perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)
Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

Let's go through this transformation one by one.

First we applyjax.gradfortd_lossto get a function that calculates the loss gradient with respect to parameters on individual (non-batch) inputs:

dtdloss_dtheta = jax.Diploma(td_loss)dtdloss_dtheta(theta, s_tm1, r_t, calle)
Device array ([-2.4, -4.8, 2.4], dtype=float32)

This function calculates a row of the above matrix.

So we vectorize this function toojax.vmap. This adds a batch dimension to all inputs and outputs. Now, given a batch of inputs, we produce a batch of outputs: each output in the batch corresponds to the gradient of the corresponding member of the input batch.

(Video) Using JAX Jacobians for Adjoint Sensitivities over Nonlinear Systems of Equations

fast_perex_grads = jax.vmap(dtdloss_dtheta)lote_theta = etc..Battery([theta, theta])fast_perex_grads(lote_theta, lot_s_tm1, lote_r_t, lote_s_t)
Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

That's not really what we want because we have to manually feed a batch to this functionthetas, although we actually want to use a singletheta. We fix this by addingin_axesFor himjax.vmap, where theta is given asnone, and the other arguments like0. This causes the resulting function to just add an extra axis to the other arguments and exitthetawithout batches as we want:

ineficiente_perex_grads = jax.vmap(dtdloss_dtheta, in_axes=(none, 0, 0, 0))ineficiente_perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)
Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)

We almost arrive! This does what we want, but it's slower than it should be. Now let's pack it all into onejax.jitto get the compiled and efficient version of the same function:

perex_grads = jax.jit(ineficiente_perex_grads)perex_grads(theta, lot_s_tm1, lote_r_t, lote_s_t)
Device array ([[-2.4, -4.8, 2.4], [-2.4, -4.8, 2.4]], dtype=float32)
%Timeineficiente_perex_grads(theta, batched_s_tm1, batched_r_t, batched_s_t).block_until_ready()%Timeperex_grads(theta, batched_s_tm1, batched_r_t, batched_s_t).block_until_ready()
100 loops, best of 5: 7.74 ms pro Loop 10000 loops, best of 5: 86.2 µs pro Loop
(Video) Magical NumPy with JAX | SciPy 2021


1. JAX: accelerated machine learning research via composable function transformations in Python
(Fields Institute)
2. JAX: Accelerated Machine Learning Research | SciPy 2020 | VanderPlas
3. NeurIPS 2020: JAX Ecosystem Meetup
4. Machine Learning with JAX - From Hero to HeroPro+ | Tutorial #2
(Aleksa Gordić - The AI Epiphany)
5. JAX numpy killer
(Mark Saroufim)
6. 28 - Building Web Services with JAX-WS
(Viprav Programming)
Top Articles
Latest Posts
Article information

Author: The Hon. Margery Christiansen

Last Updated: 04/06/2023

Views: 5815

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: The Hon. Margery Christiansen

Birthday: 2000-07-07

Address: 5050 Breitenberg Knoll, New Robert, MI 45409

Phone: +2556892639372

Job: Investor Mining Engineer

Hobby: Sketching, Cosplaying, Glassblowing, Genealogy, Crocheting, Archery, Skateboarding

Introduction: My name is The Hon. Margery Christiansen, I am a bright, adorable, precious, inexpensive, gorgeous, comfortable, happy person who loves writing and wants to share my knowledge and understanding with you.