Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stmt.locked getter #424

Closed
lsemprini opened this issue Jun 25, 2020 · 5 comments
Closed

Implement stmt.locked getter #424

lsemprini opened this issue Jun 25, 2020 · 5 comments

Comments

@lsemprini
Copy link

This is related to #203 and #367 but not the same.

I am SO happy that you did the fix in #203 to allow multiple simultaneous read-only iterate()s (and read-write with an unsafe mode flag in not-yet-shipped code) as I need that.

But if I am understanding the results I am seeing correctly, it looks like the implementation of multiple simultaneous iterate()s prevents another useful optimization, re-using statement objects. Let me explain...

I am still hitting cases of iterate() returning "This statement is busy executing a query" on read-only queries even with v6.0.1.

I tracked it down and the behavior seems to occur when both of these are true:

  1. I use the same JavaScript statement object, e.g. from db.prepare('SELECT ...'), for two iterate()s at the same time, AND
  2. The iterator from the first iterate() has not reached the end of its iteration when I call the second iterate() (the second iterate() returns the error)

This even happens when there are NO other intervening statements being executed other than SELECTs (no read-write statements on any table).

The reason why I am re-using the same JavaScript statement object is that I understand it is expensive to parse the SQL and do other things in prepare() and not something one would want to do on every statement execution, so I have a cache like:

db_prepare = function (sql)
{
    let st = this.cached_statements[sql];
    if (undefined !== st)  return st; // cache hit
    // cache miss
    return this.cached_statements[sql] = this.db.prepare(sql);
};

However, when I hit the cache and a case happens to satisfy both part 1 and 2 above, I get the error. For my application, I DEFINITELY need to have a sizable number of iterators (say 10-100) that have not yet hit their end active simultaneously, and some of those will be from the same prepare()d SQL statement.

While I don't 100% understand the code, the restriction seems to come from node_modules/better-sqlite3/src/objects/statement-iterator.lzz:

    explicit StatementIterator(Statement* _stmt, bool _bound) : node::ObjectWrap(),
        ....
        stmt->locked = true;
    }
    void Cleanup() {
        ....
        stmt->locked = false;
        ....
    }

where a given stmt becomes completely locked up from subsequent iterate() until its Cleanup(). Cleanup() apparently happens from Throw(), from Return() (which called from the final JS_next() and Next() at end of iteration even for those not using for...of) and presumably also when the iterator is garbage-collected?

So, the questions are:

  • is my understanding of the problem correct?

  • is prepare() in fact expensive and worth caching?

  • how can we achieve multiple simultaneous iterate()s and also get the performance benefits of re-using prepare()? I could perhaps envision complicating my cache with some kind of "pool" of JavaScript statement objects for the same SQL string, but that seems like a sketchy venture...not sure I have enough info at the library user level to know when to add more objects for a given SQL and when to retire them.

  • is the limitation artificial at the better-sqlite3 implementation level? in sqlite3 at the C level, are the expensive parsing actions in fact separate from the identity of the object which does iteration, and can we somehow split those two at the better-sqlite3 API level to get the best of both worlds?

Thanks!

@JoshuaWise
Copy link
Member

JoshuaWise commented Jun 26, 2020

is my understanding of the problem correct?

You seem to have the right idea about what's going on.

is prepare() in fact expensive and worth caching?

It can be relatively expensive, yes. However, the only way to know if it's worth caching is to benchmark it yourself on the machine that you're actually using.

how can we achieve multiple simultaneous iterate()s and also get the performance benefits of re-using prepare()? I could perhaps envision complicating my cache with some kind of "pool" of JavaScript statement objects for the same SQL string, but that seems like a sketchy venture...not sure I have enough info at the library user level to know when to add more objects for a given SQL and when to retire them.

Your cache would need to manage multiple statements containing the same SQL. An array will probably work fine. You only need to create a new statement object when trying to use an SQL string whose associated statements are ALL in-use. In practice, that should only happen when nesting iterators. So unless someone is nesting hundreds of iterators (which nobody would do), the number of statements per SQL string should remain pretty small (~1-4). Therefore, I don't see a need to ever retire them.

If you're really worried, you could implement a rule that your cache will only keep a maximum of 10 statements per SQL string—any extras must be compiled on the spot. That would cause a performance hit in extreme (extremely unlikely) cases, but would protect your memory usage from ballooning (in extremely unlikely cases).

is the limitation artificial at the better-sqlite3 implementation level? in sqlite3 at the C level, are the expensive parsing actions in fact separate from the identity of the object which does iteration, and can we somehow split those two at the better-sqlite3 API level to get the best of both worlds?

This is not an artificial limitation of better-sqlite3. At the C level, a prepared statement can only be used for one iteration at a time. It's a limitation of SQLite itself.

@JoshuaWise
Copy link
Member

You only need to create a new statement object when trying to use an SQL string whose associated statements are ALL in-use

This means you need to keep track of which statements are in-use. You'll need to develop an API that allows you to keep track of this information. Here is one way to structure such a cache:

const cache = new Map();

// Used to borrow a statement from cache, or create a new one
function claim(sql) {
  let list = cache.get(sql);
  if (!list) cache.set(sql, list = []);

  return list.pop() || db.prepare(sql);
}

// Used to return a statement to cache (up to a maximum of 10 per SQL string)
function relinquish(stmt) {
  let list = cache.get(sql);
  if (list.length < 10) list.push(stmt);
}

You could have your application use these functions directly (borrow, relinquish). But it's pretty brittle right now, because you could forget to relinquish a statement, or an unexpected error could be thrown which prevents relinquish() from being called. To make it more robust, you have a few different options:

"Borrowing callback"

With the borrow function, you explicitly define where you're going to use a statement object. The statement object is provided in the callback, and you're expected not to use it anymore when the callback returns.

function borrow(sql, fn) {
  const stmt = claim(sql);
  try {
    return fn(stmt);
  } finally {
    relinquish(stmt);
  }
}

borrow('SELECT * FROM data', (stmt) => {
  // do stuff with stmt...
})

"Proxy class"

With a proxy Statement class, you gain the functionality that you dreamed of: nesting as many iterators as you'd like, with just one "statement" object. Of course, under the hood, it could be using many actual statement objects, but that's all managed by the cache. The main downside to this approach is that you can't use methods like stmt.pluck() because you can't control which statement object is actually used under the hood.

class Statement {
  constructor(sql) {
    this.sql = sql;
  }

  run(...args) {
    return borrow(this.sql, stmt => stmt.run(...args));
  }

  get(...args) {
    return borrow(this.sql, stmt => stmt.get(...args));
  }

  all(...args) {
    return borrow(this.sql, stmt => stmt.all(...args));
  }

  *iterate(...args) {
    const stmt = claim(sql);
    try {
      for (const row of stmt.iterate(...args)) {
        yield row;
      }
    } finally {
      relinquish(stmt);
    }
  }
}

And there are probably other approaches too. You must decide which abstraction is right for your application. Also, it's worth mentioning that all this caching and extra logic isn't free. You should definitely measure the performance of the approach you choose to know if it's worth doing at all.

@lsemprini
Copy link
Author

Wow, thanks for the detailed response and sample code!

It occurred to me that one very simple thing that could greatly reduce the brittle-ness of the first approach and obviate the need for the complexity of the second and third approach is a simple class Statement property or method locked that simply returns the stmt->locked value.

Using this primitive, it would be possible for us (or, perhaps in the future, better-sqlite3) to implement a cached_prepare() function that could quickly tell which of the cached statement objects for a given SQL string is currently busy (in an iterate() or whatever future things might cause statements to become busy), without requiring the caller of cached_prepare() to explicitly call a relinquish() function when they are done with the iterator, and without the need to completely wrap class Statement.

In a way, iterate() is that function (since it throws if locked is true) but it has the rather inconvenient side-effect of starting a query if the statement is not locked :) and it is not necessarily the case that the caller wants to iterate() with the statement object every time. It would be nice if the cache code which calls prepare() could correctly choose whether to re-use an old statement object or create a new one without any additional information from its caller about whether that statement object is going to be used with iterate() or not, and whether it has been used that way or not.

What do you think about adding a statement.locked property or statement.isLocked() method? The documentation would be:

.locked -> boolean true if you have called iterate() on this statement but you have not yet made the final call to iter.next() that returns { done: true }, nor have you called iter.return(). A Statement which is locked cannot be passed to iterate() again, nor can it be used to start any other kind of execution (e.g. get(), all()).

I took a look at what existing functions will throw when stmt->locked is true. The list is: JS_bind, JS_pluck, JS_expand, JS_raw, and JS_safeIntegers (and StatementIterator::JS_new). All of these seem to have pretty large side-effects so don't serve as a locked query.

By the way, in my application I will normally have 10-100 iterators that are going simultaneously on a small number of big tables. They are not really "nested" in any sense; they are scanning various tables in parallel, where the next iterator to be advanced depends on the values read from existing iterators (think walking a set of sorted lists in parallel). This is actually a pretty common operation for certain kinds of text processing, so is not that far-fetched. Most key-value database packages (the descendants of dbm) have some kind of lightweight cursor operation to facilitate this kind of parallel table-walking and it would be awesome if better-sqlite3 could do it too. It seems like sqlite3 has everything we need; we just have to find a clean and non-brittle way of knowing when we need to create more statement handles.

@lsemprini
Copy link
Author

lsemprini commented Jun 27, 2020

There's another way I thought of to get the statement.locked information, though it's definitely not as good as having statement.locked built into better-sqlite3 as it is still brittle.

The idea is that rather than calling statement.iterate(), the user calls a utility function which calls statement.iterate() and hacks a new flag in the statement object itself (say, statement.myhack_locked = true). Then, the utility function does not return the original iterator from iterate() but rather it wraps the returned iterator in another outer iterator. The outer iterator next() and return() call the original iterator. If the outer iterator sees the original iterator next() return { done: true } or sees any call to iterator return(), then it knows that the original iterate() process is done so it sets statement.myhack_locked = false again. In theory the outer iterator should also handle iterator.throw() but this is so poorly documented (existing dox on the net are even worse than return(), which MDN doesn't even mention) that I have no idea how to implement it.

This produces a ghetto version of statement.locked called statement.myhack_locked which a prepare_cached() routine can use to decide whether to return a statement cached for a given SQL or create a new one.

It's still brittle because

  • iterate() is likely not going to be the only thing in the future that locks a statement
  • the user needs to remember to call the utility version of iterate() instead of calling the real statement.iterate
  • the hack pollutes the object property namespace of better-sqlite3's statement object, leading to possible property name conflicts. To avoid this problem would mean wrapping the better-sqlite3 statement object, which is even more brittle because it brings back the same heavyweight requirement to wrap all Class Statement functions and force the user to use separate functions for all statement execution (not just for iterate()).
  • that whole throw() thing, ugh

@JoshuaWise JoshuaWise changed the title Enjoying Multiple iterators on one statement while ALSO re-using JavaScript statement objects Implement stmt.locked getter May 5, 2021
@JoshuaWise
Copy link
Member

JoshuaWise commented Jan 19, 2022

I know it's a bit late, but this is now supported via the stmt.busy property, as of better-sqlite3 version 7.5.0.

Docs: https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/api.md#properties-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants