Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving differentiation for algebraic solver #1257

Closed
charlesm93 opened this issue Jun 4, 2019 · 24 comments
Closed

Improving differentiation for algebraic solver #1257

charlesm93 opened this issue Jun 4, 2019 · 24 comments
Assignees

Comments

@charlesm93
Copy link
Contributor

charlesm93 commented Jun 4, 2019

Description

There are two ways to improve the way in which we compute the Jacobian matrix of an algebraic solver:

  • compute the Jacobian of the system with respect to the parameters and the unknown simultaneously, to use their shared expression tree.
  • compute the Jacobian-adjoint vector product directly, and thence only do one sweep of reverse-mode AD.

The two options are competing but should be fairly straightforward to implement and compare.
Note the procedure here will work for any algebraic solver where the sensitivities are computed via the implicit function theorem.

Expected Output

Increased speed when solving algebraic equations and computing their sensitivities. Will test on classical PK steady state problem and find the mode of a Laplace approximation.

Discussion on forum

https://discourse.mc-stan.org/t/better-computation-of-the-jacobian-matrix-for-algebraic-solver/8593

Current Version:

v2.19.1

@charlesm93 charlesm93 self-assigned this Jun 4, 2019
@bob-carpenter
Copy link
Member

If f:R^N -> R^M, then we can compute a Jacobian-adjoint with N directional derivative calculations (forward mode) or with M gradient calculations (reverse mode) and an explicit Jacobian-adjoint multiply. You seem to be indicating it can be done in a single sweep of reverse mode, but each sweep of reverse mode only computes derivatives with respect to a single output component of f.

@charlesm93
Copy link
Contributor Author

The reason we require M gradient calculations in reverse mode is because of the initial cotangents we use. Suppose the output is y, with adjoint v = (a, b, c), and that the input is x.

With Jacobian(), the initial cotangents are (1, 0, 0), (0, 1, 0) and (0, 0, 1). This gives J, and then we multiply it by v, to get Jv, the adjoint of x. But instead, we could use one initial cotangent, (a, b, c), and directly get Jv in one sweep. This avoids redundant calculations. My understanding is that controlling the initial cotangent can be done with the adjoint-Jacobian method @bbbales2 developed.

@charlesm93
Copy link
Contributor Author

We have three options here.

Let f(x, y) the function for which we want the root, x in R^M is the unknown, and y parameters in R^N. Let v be the adjoint of the output. The implicit function theorem gives us dx/dy = - [df/dx]^-1 df/dy. The options are:

  1. compute df/dx and df/dy separately --> 2M sweeps
  2. compute df/d(x + y) together --> M sweeps
  3. compute df/dx and df/dy * v --> M + 1 sweeps

According to @betanalpha, when N, the number of parameters, becomes large the third option becomes better. I've encountered both regimes (M > N, and M < N), but we'll need to pick. Option 2 is the simplest to implement.

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 5, 2019 via email

@wds15
Copy link
Contributor

wds15 commented Jun 5, 2019

Have you, @charelsm93, considered to apply the same trick as in the ode speedup pr which you reviewed recently? I mean, it sounds as if you do similar things here which is repeated gradients to get the Jacobian. So Parameters stay always the same and you can take advantage of that. Probably this is an additional thing to do here? For non stiff ode integration this gave me 20% speed for the sie example.

@charlesm93
Copy link
Contributor Author

@bob-carpenter Indeed we would need a more general grad(v). I'm guessing it would have other applications too.

@wds15 That's a very good point. The trick applies to both options. For option (1) we use the fact the input x is constant and for option (2) the fact that both x and y remain constant. I still want to do this incrementally: first one of the above-described improvements, then your trick on top of it.

@betanalpha
Copy link
Contributor

betanalpha commented Jun 7, 2019 via email

@charlesm93
Copy link
Contributor Author

Per the discussion during yesterday's Stan meeting, I'm going with option (2).

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 7, 2019 via email

@charlesm93
Copy link
Contributor Author

That's what I meant---writing a new function like grad(v) that computes an adjoint gradient. I don't think it's a good idea to try to generalize grad(v).

You mean in general, or for this specific issue?

@charlesm93
Copy link
Contributor Author

I implemented option 2 -- fairly straightforward -- and ran some performance test.

The first test is based on pharma steady states, as described in the appendix of the autodiff review paper (https://arxiv.org/abs/1811.05031). The second test involves finding the mode of a conditional distribution, as required when doing INLA on a Poisson process with a latent Gaussian variable.

These tests show two things:

  • the time to compute the Jacobian is small next to the time required to solve the algebraic system. For the Steady State problem, solving + differentiating takes about 0.03 s for 200 states, while constructing the Jacobian matrix requires 0.0075 s. But the latter is constructed using the grad function and looping over the 200 outputs. In practice, the Jacobian matrix is only calculated once, and this takes on the order e-05 s. Solving the equations however has order e-02s. Same with the INLA problem: for 500 states, the solver needs 633 seconds, and the Jacobian calculation is e-06 s.
  • This is more intriguing: I did not observe any speed-up when looking at the time to calculate the Jacobian matrix. See the this graph for the steady state problem. The red line is the current solve, the blue one the new implementation. For the INLA problem I went down from 3e-6 to 2e-6, but I only ran the experiment once and the difference is likely due to noise.

Hmmm... did I do something wrong? The number of sweeps is halved and there is shared computation so there should be some speedup. In any case, this change provides at best a marginal improvement. The code is written and the PR review should not take very long, but I'm not sure that it's worth it.

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 10, 2019 via email

@charlesm93
Copy link
Contributor Author

@bob-carpenter See the code for the chain method on lines 66 - 73 in algebra_solver.hpp.

There are some changes in algebra_system, lines 61 - 74, where the () operator is modified to allow a vector containing both unknowns and parameters to be passed as the input variable. This is required for the call to Jacobian().

I think you're right: costs are dominated by memory and branch prediction, as opposed to number of operations. I'm running other tests that corroborate this conjecture.

@bob-carpenter
Copy link
Member

One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.

The bigger issue is whether you can automate a lot of the stuff you built by hand using adjoint-Jacobian apply. The solver's chain() method is just doing that:

  void chain() {
    for (int j = 0; j < y_size_; j++)
      for (int i = 0; i < x_size_; i++)
        y_[j]->adj_ += theta_[i]->adj_ * Jx_y_[j * x_size_ + i];
  }

The Jacobian itself is calculated using a map and stored in J_x_y_ until the reverse pass:

   Map<MatrixXd>(&Jx_y_[0], x_size_, y_size_)
        = -mdivide_left(fx.get_jacobian(theta_dbl),
                        f_y(fs, f, theta_dbl, value_of(y), dat, dat_int, msgs)
                            .get_jacobian(value_of(y)))

Can you wait to construct the Jacobian until the chain() method is called? It should be more efficient as it'll cut down on the preallocated memory. Then you can also define it using holistic matrix operations rather than in a loop.

@charlesm93
Copy link
Contributor Author

One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.

Right... this came up last time I did a PR. Ok, I'll save that one for last.

Can you wait to construct the Jacobian until the chain() method is called?

There are actually a few complications.

  • First, more arguments of the constructor need to be stored in the class, since they are needed to construct the Jacobian matrix. Those elements are theta_dbl, the structures fs and f, and the arrays dat and dat_int. It is likely less memory than storing J, although I'm not sure what the cost of storing f and fs may be.
  • Secondly, aren't there issues running Jacobian() inside the chain() method? Based on the unit tests I conducted, this messes up the adjoints for the gradient. Specifically theta[i]->adj_ goes to 0 for all i's. I'm not sure how to best fix this, but using start_nested() doesn't seem to work.

Here's the prototype code, with some print statements I used for testing:

template <typename Fs, typename F, typename T>  //, typename Fx>
struct algebra_solver_vari : public vari {
  /** vector of parameters */
  vari** y_;
  /** number of parameters */
  int y_size_;
  /** number of unknowns */
  int x_size_;
  /** vector of solution */
  vari** theta_;
  /** double vector of solution */
  Eigen::VectorXd theta_dbl_;
  /** wrapper structure for computing Jacobians */
  Fs fs_;
  /** functor for system of algebraic equation */
  F f_;
  /** array with real data for the algebraic system */
  std::vector<double> dat_;
  /** array with integer data for the algebraic system */
  std::vector<int> dat_int_;

  Eigen::VectorXd y_dbl_;

  algebra_solver_vari(const Fs& fs, const F& f, const Eigen::VectorXd& x,
                      const Eigen::Matrix<T, Eigen::Dynamic, 1>& y,
                      const std::vector<double>& dat,
                      const std::vector<int>& dat_int,
                      const Eigen::VectorXd& theta_dbl, // Fx& fx,
                      std::ostream* msgs)
      : vari(theta_dbl(0)),
        y_(ChainableStack::instance().memalloc_.alloc_array<vari*>(y.size())),
        y_size_(y.size()),
        x_size_(x.size()),
        theta_(
            ChainableStack::instance().memalloc_.alloc_array<vari*>(x_size_)),
        theta_dbl_(theta_dbl), fs_(fs), f_(f), dat_(dat), dat_int_(dat_int),
        y_dbl_(value_of(y)) {
    using Eigen::Map;
    using Eigen::MatrixXd;
    for (int i = 0; i < y.size(); ++i)
      y_[i] = y(i).vi_;

    theta_[0] = this;
    for (int i = 1; i < x.size(); ++i)
      theta_[i] = new vari(theta_dbl(i), false);
  }

  void chain() {
    // compute Jacobian with respect to theta_dbl and y.
    typedef hybrj_functor_solver<Fs, F, double, double> f_all;
    // start_nested();
    Eigen::MatrixXd
      Jacobian_all = f_all(fs_, f_, theta_dbl_, y_dbl_,
                           dat_, dat_int_,
                           0).get_jacobian(append_row(theta_dbl_, y_dbl_));
    // recover_memory_nested();
    
    std::cout << "Marker 1" << std::endl;

    // start_nested();
    Eigen::MatrixXd sensitivities =
      - mdivide_left(block(Jacobian_all, 1, 1, x_size_, x_size_),
                       block(Jacobian_all, 1, x_size_ + 1,
                             x_size_, y_size_));
    // recover_memory_nested();

    std::cout << "Marker 2 " << std::endl;
    
    for (int i = 0; i < x_size_; i++)
      std::cout << theta_[i]->adj_ << " ";
    std::cout << std::endl;

    for (int j = 0; j < y_size_; j++)
      for (int i = 0; i < x_size_; i++)
        y_[j]->adj_ += theta_[i]->adj_ * sensitivities(i, j);
        // y_[j]->adj_ += theta_[i]->adj_ * Jx_y_[j * x_size_ + i];
  }
};

@wds15
Copy link
Contributor

wds15 commented Jun 13, 2019

Without too much knowledge about the details... have you considered looking into using IDAS from sundials? That package supports the forward sensitivity method and an adjoint sensitivity method. At least for ODE problems, the adjoint sensitivity approach is far better for problems with many parameters. I hope this comment is useful to point you in a useful direction...

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 13, 2019 via email

@charlesm93
Copy link
Contributor Author

Forward-mode over double values would work,
but not reverse.

That's good to know, I might use that for specialized solvers. I could also do the forward pass when instantiating the vari class if needed (the INLA problem is an example where forward diff would be much better than reverse mode, since the system is high dimensional but the model parameters are low dimensional).

@betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.

@wds15 Thanks for the pointers. I haven't looked at IDAS yet.

@betanalpha
Copy link
Contributor

betanalpha commented Jun 14, 2019 via email

@charlesm93
Copy link
Contributor Author

charlesm93 commented Jun 14, 2019

Take a look at how the grad function is implemented to
see where you can access the adjoints after the forward
sweep has been executed.

Accessing the adjoints is not my problem here. Once I have these adjoints I use them to build the initial cotangent, with something like follows:

    void chain() {
      // compute the initial cotangent vector
      Eigen::vectorXd J_l_theta(theta_size_);
      for (int i = 0; i < theta_size_; i++) J_l_theta(i) = theta_[i]->adj_;
      Eigen::vectorXd
        init_cotangent = - mdivide_left(J_f_theta_, J_l_theta);

The problem is that I then want to do an autodiff call inside the chain method, but this doesn't seem possible, at least not for the reverse call. I detailed what I wanted to do on this post.

To be clear, I think this scheme is useful for the INLA solver, but I'm not convinced it is the way to go in general, i.e. for algebra_solver(), as it competes with method (2), or other approaches we might consider.

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 14, 2019 via email

@bob-carpenter
Copy link
Member

bob-carpenter commented Jun 14, 2019 via email

@charlesm93
Copy link
Contributor Author

Linking to this relevant post on the forum: https://discourse.mc-stan.org/t/improved-differentiation-of-the-algebraic-solver/15040

@jgaeb
Copy link
Contributor

jgaeb commented Aug 19, 2021

@charlesm93, This issue has been resolved by #2421!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants