Skip to content
This repository was archived by the owner on Sep 12, 2018. It is now read-only.

Consider an alternative disk representation for datom flags #29

Open
rnewman opened this issue Aug 13, 2016 · 8 comments
Open

Consider an alternative disk representation for datom flags #29

rnewman opened this issue Aug 13, 2016 · 8 comments
Labels
A-core A-design Planning and overall architecture. A-transact Issues or requests in the transactor. size

Comments

@rnewman
Copy link
Collaborator

rnewman commented Aug 13, 2016

We have four fields in datoms: index_vaet, index_avet, index_fulltext and unique_value.

These fields appear in datoms and in the indices idx_datoms_eavt and idx_datoms_aevt.

Some rows also appear in idx_datoms_avet, idx_datoms_vaet, idx_datoms_fulltext, and idx_datoms_unique_value.

That is: each datom is responsible for anywhere from 12 to 28 TINYINTs in the DB. Sometimes these won't contribute to space (if they're zero, I've read that sqlite should pack them down to nothing), but assuming one byte per field, half a million datoms will add up to 14MB to the DB just for these flags.

These flags are entirely derived from the schema: a datom's attribute is the sole determinant (AFAICS) of whether these flags are 1 or 0 for a row.

Now, partial indexes cannot refer to other tables or call functions, so we can't simply join against the schema in the CREATE INDEX … WHERE clause. But we could use a more complex operator-driven expression, including bitmasks, to compress these flags.

We could also consider approaches to simply removing columns:

  • index_vaet and index_fulltext are mutually exclusive, so they could be two integer values in a single field.
  • index_vaet corresponds to :db/valueType :db.type/ref, which is already implicitly represented by a value_type_tag of 0, so we can filter on that instead.

Finally, we could consider direct schema creation as an approach: rather than having idx_datoms_fulltext, for example, we could create per-attribute indices when we register a schema fragment that includes :fulltext true:

CREATE INDEX idx_datoms_fulltext_page_title ON datoms (value_type_tag, v, a, e) WHERE a = 65538 -- The interned attribute.

I don't know if sqlite will happily use a collection of such indices (perhaps it'd be quicker for real-world queries!), but it's worth exploring.

@rnewman
Copy link
Collaborator Author

rnewman commented Aug 13, 2016

Note that each of these columns also appears transiently when inserting datoms into tx_lookup, so this has a bearing on #28, too.

@rnewman
Copy link
Collaborator Author

rnewman commented Oct 24, 2016

SQLite 3.15 allows the use of deterministic functions in partial index expressions, so we can now go whole-hog on bitfields if we choose.

TryGhost/node-sqlite3#724 adds this for node-sqlite3.

@rnewman
Copy link
Collaborator Author

rnewman commented Oct 24, 2016

@rnewman
Copy link
Collaborator Author

rnewman commented Oct 24, 2016

Hooray: https://bugzilla.mozilla.org/show_bug.cgi?id=1310361 already landed.

@rnewman
Copy link
Collaborator Author

rnewman commented Oct 24, 2016

0.2.0-SNAPSHOT now includes SQLite 3.15.

@rnewman rnewman modified the milestone: Production-ready Apr 24, 2017
@rnewman rnewman added A-core A-transact Issues or requests in the transactor. A-design Planning and overall architecture. labels Mar 22, 2018
@thomcc
Copy link
Contributor

thomcc commented Aug 20, 2018

I did this in https://github.com/thomcc/mentat/commit/ba29de5998d8a905e024339e9e5167ee56f2551c (note that not all tests pass, but it was just a test to see how beneficial it was). Sadly, it does not appear to make a meaningful difference in the on-disc database size :(

@rnewman
Copy link
Collaborator Author

rnewman commented Aug 21, 2018

You need to make unique the least significant bit. It’s common, and here it guarantees you a byte, so versus separate-zeroes it’ll be of little benefit.

@thomcc
Copy link
Contributor

thomcc commented Aug 21, 2018

Yeah, it's not nothing (average of 2-3 bytes per datom), but not really clear that it's worth the version breakage or additional complexity. It probably saves more for things using :db/index true though!

$ ls -al
<snip>
-rw-r--r--   1 thom  staff  194564096 Aug 20 20:50 mentat_places_with_flag_opt.db
-rw-r--r--   1 thom  staff  197406720 Aug 20 20:50 mentat_places.db
<snip>
$ sqlite3 mentat_places.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> select count(*) from datoms;
1119529
$ node
> 197406720 - 194564096
2842624
> (197406720 - 194564096) / 1119529
2.5391249355755856

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-core A-design Planning and overall architecture. A-transact Issues or requests in the transactor. size
Projects
None yet
Development

No branches or pull requests

2 participants