Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iterate(): small but important doc clarification #406

Open
lsemprini opened this issue Jun 1, 2020 · 8 comments
Open

iterate(): small but important doc clarification #406

lsemprini opened this issue Jun 1, 2020 · 8 comments

Comments

@lsemprini
Copy link

Thanks for your awesome library! The KISS design and lack of async and other zombie borg "you will be assimilated" thinking is truly a breath of fresh air!

The dox for iterate() at

https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/api.md

don't actually say what the memory usage of iterate() is.

If iterate() guarantees to only hold a certain constant number of rows in memory at once in the worst-case (all the way down to the C library level), then this is SUPER useful to know and should be documented.

If Iterate() is actually a wrapper around all(), or otherwise could hold the whole record set in memory in the worst-case (at any level) then this is ALSO super useful to know and should be documented.

Right now, there is a comment about performance vs. all() which makes me lean towards the second interpretation; whatever is true should be explicitly documented.

@JoshuaWise
Copy link
Member

JoshuaWise commented Jun 1, 2020

So here's the thing. The iterate() function is supposed to work as your describe in your first example. However, some of the memory that's allocated is managed by the JavaScript garbage collector, which we don't have control over. So theoretically, you should be able to iterate over a huge number of rows without too many being held in memory at a time, but I can't provide such guarantees without also having such a guarantee from the garbage collector, which I don't have because I don't own that code.

But for all practical purposes, it will probably work as you expect in your first example. Any memory that this library allocates (that isn't managed by v8), is de-allocated after each loop iteration. The iterate() function is not a wrapper around all().

Also:

The KISS design and lack of async and other zombie borg "you will be assimilated"

This made me laugh.

@lsemprini
Copy link
Author

lsemprini commented Jun 1, 2020

Interesting. The main question is "when fetching a large result set, what is the wost-case amount of memory that better-sqlite3 will allocate and hold onto during the full iteration, and is that amount proportional to the number of results?" Unreferenced memory held by the JS engine is less of a concern because there should only be able to be a constant amount of it before the GC kicks in, so that is not proportional to the number of results. So some dox like this will do the trick:

Unlike all(), iterate() lets go of its reference to the memory for a given result each time you iterate forward to the next result. So, when iterating through large result sets with iterate() (and assuming you also let go of your references to the older results), the peak amount of memory referenced at any given moment will be proportional to the size of the current result, but not proportional to the number of results or the total size of all results combined.

This is important if you are using better-sqlite3 with large result sets in strict memory-limited environments (e.g. if the JavaScript engine itself is configured with memory limits via --max-old-space, or the JavaScript engine process is subject to being killed when it exceeds certain memory limits, e.g. UNIX rlimits from a web server in a CGI script). In such an environment, it is still possible for the JavaScript engine to hold onto operating system memory (memory that nobody references but that the JavaScript engine has not yet garbage-collected) but the worst-case amount of such memory should be constant and not proportional to the number of records in the result.

@lsemprini
Copy link
Author

I think it would be helpful to include some dox like the example one in the last message in the official dox for iterate() at:

https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/api.md

@pelletier197
Copy link

I agree it would be helpful to have that in the documentation. It got me a little confused, because what I observe seems to be much more like the second example.

When I execute iterate on a query that returns 10m rows, the query first takes a whopping full minute to return the first result, and during that minute, you can see the memory increasing steadily up to 6Gb of RAM 👀. And then, once the iteration begins and I receive the first result, the memory starts decreasing slowly as i'm iterating through the records.

This question is a bit more than three years old, so maybe this is not the best place to ask this, but has something changed since then ? And this should definitely be documented if this is the actual behavior, because I used this function thinking it would be awesome and use little memory, and I got quite disappointed when I saw that😅.

@JoshuaWise
Copy link
Member

JoshuaWise commented Jul 21, 2023

@pelletier197 I am not able to reproduce the behavior you described.

I filled a database with 10,000,000 rows (~5GiB of data), using the following script:

const db = require('better-sqlite3')('test.db');
db.pragma('journal_mode = MEMORY'); // Makes it faster
db.pragma('locking_mode = EXCLUSIVE'); // Makes it faster
db.exec('BEGIN EXCLUSIVE; COMMIT'); // // Needed for locking_mode = EXCLUSIVE

db.exec(`
	CREATE TABLE data (
		id INTEGER PRIMARY KEY,
		chunk BLOB NOT NULL
	) STRICT;
`);

const chunk = Buffer.alloc(512);
const stmt = db.prepare('INSERT INTO data (id, chunk) VALUES (?, ?)');
for (let id = 0; id < 10000000; ++id) {
	stmt.run(id, chunk);
}

db.close();

Then I ran the following script to iterate through all the rows:

console.log('opening database...');
const db = require('better-sqlite3')('test.db', { readonly: true });

console.log('preparing statement...');
const stmt = db.prepare('SELECT chunk FROM data ORDER BY id ASC').pluck();

console.log('running query...');
const iterator = stmt.iterate();

let index = 0;
console.log('getting row #0...');
for (const _ of iterator) {
	console.log(`getting row #${++index}...`);
}

console.log('done!');

The process memory usage never went above ~70MiB. Then I tried replacing .iterate() with .all() and the memory usage went to 5GiB as expected.

@JoshuaWise
Copy link
Member

Note that if you're using RETURNING in your query, SQLite will buffer all the results internally: https://www.sqlite.org/lang_returning.html#processing_order

@pelletier197
Copy link

@JoshuaWise thank you for your help. I realised that there was an ORDER BY in the query that SQLite didn't like much, which got it to store the whole query result in memory. That does make plenty of sense when you think about it, and I wasn't aware of that order by clause. I removed it and now it runs smoothly.

@LRagji
Copy link

LRagji commented Nov 3, 2024

@pelletier197 , That seems an interesting find of orderby clause, the question is where did you get that clue, and does it also happens for numeric primary key columns cause that's what I assume becomes clustered Index for table in which they are stored.. also doesnt SQLite implement merge sort on disk for large dataset? A need for every database cause they never know what size the result will be.. or I am missing something..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants