-
-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving differentiation for algebraic solver #1257
Comments
If f:R^N -> R^M, then we can compute a Jacobian-adjoint with N directional derivative calculations (forward mode) or with M gradient calculations (reverse mode) and an explicit Jacobian-adjoint multiply. You seem to be indicating it can be done in a single sweep of reverse mode, but each sweep of reverse mode only computes derivatives with respect to a single output component of f. |
The reason we require M gradient calculations in reverse mode is because of the initial cotangents we use. Suppose the output is y, with adjoint v = (a, b, c), and that the input is x. With |
We have three options here. Let f(x, y) the function for which we want the root, x in R^M is the unknown, and y parameters in R^N. Let v be the adjoint of the output. The implicit function theorem gives us dx/dy = - [df/dx]^-1 df/dy. The options are:
According to @betanalpha, when N, the number of parameters, becomes large the third option becomes better. I've encountered both regimes (M > N, and M < N), but we'll need to pick. Option 2 is the simplest to implement. |
The function grad(v) initialize's v's adjoint to 1, and then does the backward pass. If all we need to do is initialize other adjoints, it'd be easy to write a more general grad(v).
The adjoint-Jacobian thing from Ben is just a way to define a lazy chain() method for reverse mode. Rather than storing the Jacobian and multiplying it by the adjoint and adding it to the operand adjoints, we lazily evaluate incrementing the operand adjoints with the Jacobian-adjoint product. That allows us to get away with a lot less memory. It also saves virtual function calls on multivariate functions by managing non-propagating vari instances and putting all the work on a single vari.
|
Have you, @charelsm93, considered to apply the same trick as in the ode speedup pr which you reviewed recently? I mean, it sounds as if you do similar things here which is repeated gradients to get the Jacobian. So Parameters stay always the same and you can take advantage of that. Probably this is an additional thing to do here? For non stiff ode integration this gave me 20% speed for the sie example. |
@bob-carpenter Indeed we would need a more general grad(v). I'm guessing it would have other applications too. @wds15 That's a very good point. The trick applies to both options. For option (1) we use the fact the input x is constant and for option (2) the fact that both x and y remain constant. I still want to do this incrementally: first one of the above-described improvements, then your trick on top of it. |
We don’t necessarily have to use grad(v) here — the Jacobian-adjoint
product can be computed by implementing the nested sweeps by hand
as I originally did in the ODE code.
… On Jun 6, 2019, at 7:22 AM, Charles Margossian ***@***.***> wrote:
@bob-carpenter <https://github.com/bob-carpenter> Indeed we would need a more general grad(v). I'm guessing it would have other applications too.
@wds15 <https://github.com/wds15> That's a very good point. The trick applies to both options. For option (1) we use the fact the input x is constant and for option (2) the fact that both x and y remain constant. I still want to do this incrementally: first one of the above-described improvements, then your trick on top of it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1257?email_source=notifications&email_token=AALU3FVC5DCROHM76O7DZ2TPZDXPDA5CNFSM4HS32JH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCRHAA#issuecomment-499454848>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALU3FU5VZKLP6OF26RTCU3PZDXPDANCNFSM4HS32JHQ>.
|
Per the discussion during yesterday's Stan meeting, I'm going with option (2). |
On Jun 6, 2019, at 11:44 PM, Michael Betancourt ***@***.***> wrote:
We don’t necessarily have to use grad(v) here — the Jacobian-adjoint
product can be computed by implementing the nested sweeps by hand
as I originally did in the ODE code.
That's what I meant---writing a new function like grad(v) that computes an adjoint gradient. I don't think it's a good idea to try to generalize grad(v).
|
You mean in general, or for this specific issue? |
I implemented option 2 -- fairly straightforward -- and ran some performance test. The first test is based on pharma steady states, as described in the appendix of the autodiff review paper (https://arxiv.org/abs/1811.05031). The second test involves finding the mode of a conditional distribution, as required when doing INLA on a Poisson process with a latent Gaussian variable. These tests show two things:
Hmmm... did I do something wrong? The number of sweeps is halved and there is shared computation so there should be some speedup. In any case, this change provides at best a marginal improvement. The code is written and the PR review should not take very long, but I'm not sure that it's worth it. |
I'd be happy to take a look at the code diff if you can point me to something manageable.
The trick to answering your second question is to figure out what's dominating computation. If you speed up an operation that only takes 5% of the computation, you won't see much overall gain. Often computation's dominated by memory or branch prediction (and how the code gets generated to exploit CPU features) rather than by sheer number of operations.
|
@bob-carpenter See the code for the chain method on lines 66 - 73 in There are some changes in I think you're right: costs are dominated by memory and branch prediction, as opposed to number of operations. I'm running other tests that corroborate this conjecture. |
One thing you can do is avoid The bigger issue is whether you can automate a lot of the stuff you built by hand using adjoint-Jacobian apply. The solver's
The Jacobian itself is calculated using a
Can you wait to construct the Jacobian until the |
Right... this came up last time I did a PR. Ok, I'll save that one for last.
There are actually a few complications.
Here's the prototype code, with some print statements I used for testing:
|
Without too much knowledge about the details... have you considered looking into using IDAS from sundials? That package supports the forward sensitivity method and an adjoint sensitivity method. At least for ODE problems, the adjoint sensitivity approach is far better for problems with many parameters. I hope this comment is useful to point you in a useful direction... |
On Jun 13, 2019, at 4:04 AM, Charles Margossian ***@***.***> wrote:
One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.
Right... this came up last time I did a PR. Ok, I'll save that one for last.
Can you wait to construct the Jacobian until the chain() method is called?
There are actually a few complications.
• First, more arguments of the constructor need to be stored in the class, since they are needed to construct the Jacobian matrix. Those elements are theta_dbl, the structures fs and f, and the arrays dat and dat_int. It is likely less memory than storing J, although I'm not sure what the cost of storing f and fs may be.
• Secondly, aren't there issues running Jacobian() inside the chain() method? Based on the unit tests I conducted, this messes up the adjoints for the gradient. Specifically theta[i]->adj_ goes to 0 for all i's. I'm not sure how to best fix this, but using start_nested() doesn't seem to work.
You're right---you can't call autodiff functions during a
call to chain(). Forward-mode over double values would work,
but not reverse. But that's not fully general for Stan functins,
even when I've finished testing.
Sorry for the confusion---I didn't realize you were using
nested autodiff to compute that Jacobian.
- Bob
|
That's good to know, I might use that for specialized solvers. I could also do the forward pass when instantiating the vari class if needed (the INLA problem is an example where forward diff would be much better than reverse mode, since the system is high dimensional but the model parameters are low dimensional). @betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to @wds15 Thanks for the pointers. I haven't looked at IDAS yet. |
@betanalpha <https://github.com/betanalpha> Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.
Did you mean to tag Bob here?
Take a look at how the `grad` function is implemented to
see where you can access the adjoints after the forward
sweep has been executed.
|
Accessing the adjoints is not my problem here. Once I have these adjoints I use them to build the initial cotangent, with something like follows:
The problem is that I then want to do an autodiff call inside the chain method, but this doesn't seem possible, at least not for the reverse call. I detailed what I wanted to do on this post. To be clear, I think this scheme is useful for the INLA solver, but I'm not convinced it is the way to go in general, i.e. for |
@betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.
Given a var, you can get the adjoint of its vari. So you can initialize whatever you want. The var.grad() method initializes var's adjoint to 1 and then propagates back from there.
|
The problem is if you have nested autodiff inside the chain() method, you have to be careful not to pass any gradient information back to things outside the nested autodiff. So that means no simple functions of parameters for thing being computed. You'd have to pull their double values out, use those, then put everything back together.
More inline on your specifics.
On Jun 14, 2019, at 2:27 AM, Charles Margossian ***@***.***> wrote:
Take a look at how the grad function is implemented to
see where you can access the adjoints after the forward
sweep has been executed.
Accessing the adjoints is not my problem here. Once I have these adjoints I use them to build the initial cotangent, with something like follows:
void chain() {
// compute the initial cotangent vector
Eigen::vectorXd J_l_theta(theta_size_);
This is OK as it's a double.
for (int i = 0; i < theta_size_; i++) J_l_theta(i) = theta_[i]->adj_;
I don't think theta[i]->adj_ has a value at this point if it's a component of the vari whose chain method is being defined. Only the parents have values at the point chain() is being called. That is, in reverse mode, you start from final result and work backward, at each step only calling chain() on a vari if all of its parents have already had their chain() methods called. For example, if you have y = f(x1, x2), then y is the parent of x1 and x2.
Eigen::vectorXd
init_cotangent = - mdivide_left(J_f_theta_, J_l_theta);
The problem is that I then want to do an autodiff call inside the chain method, but this doesn't seem possible, at least not for the reverse call.
I think you could get away with a nested autodiff call that started with fresh double values---you can't involve any existing vari in expressions or you'll wind up erroneously passing down to those in the nested evaluation.
|
Linking to this relevant post on the forum: https://discourse.mc-stan.org/t/improved-differentiation-of-the-algebraic-solver/15040 |
@charlesm93, This issue has been resolved by #2421! |
Description
There are two ways to improve the way in which we compute the Jacobian matrix of an algebraic solver:
The two options are competing but should be fairly straightforward to implement and compare.
Note the procedure here will work for any algebraic solver where the sensitivities are computed via the implicit function theorem.
Expected Output
Increased speed when solving algebraic equations and computing their sensitivities. Will test on classical PK steady state problem and find the mode of a Laplace approximation.
Discussion on forum
https://discourse.mc-stan.org/t/better-computation-of-the-jacobian-matrix-for-algebraic-solver/8593
Current Version:
v2.19.1
The text was updated successfully, but these errors were encountered: