Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt cache reads of native format over HTTP #1957

Open
dehorsley opened this issue Feb 14, 2025 · 1 comment
Open

Corrupt cache reads of native format over HTTP #1957

dehorsley opened this issue Feb 14, 2025 · 1 comment

Comments

@dehorsley
Copy link

What happens?

When querying a native DuckDB file over HTTP (with range requests), I am seeing corrupt reads when hitting the cache. i.e.:

  1. Load page — query works
  2. Refresh (sometimes needs more than once), query fails with the following error:
Error: IO Error: Corrupt database file: computed checksum 16933857704068960742 does not match stored checksum 0 in block at location 4206592
    at ma.startPendingQuery (bindings_base.ts:188:19)
    at Fo.onMessage (worker_dispatcher.ts:228:51)
    at Wc.globalThis.onmessage (duckdb-browser-eh.worker.ts:29:19)
  1. Clear cache — query works again for one load (go to state 2)

Some observations:

  • The stored checksum in the error message is always 0.

  • It seems like the file and the query result need to ~100 MB to trigger.

  • Likely needs HTTP range request. I've been testing with Node http-server package, but also seen the same behaviour on IIS, so assuming it's a browser/library issue.

  • Sometimes the location in the TypeScript is different, but probably a red-herring, eg:

    Error: IO Error: Corrupt database file: computed checksum 11427024748155090702 does not match stored checksum 0 in block at location 337653760
        at ma.pollPendingQuery (bindings_base.ts:201:19)
        at Fo.onMessage (worker_dispatcher.ts:245:51)
        at Wc.globalThis.onmessage (duckdb-browser-eh.worker.ts:29:19)
    

To Reproduce

Reproducible example (not minimal)

  1. Create a largish dummy database with Python for eg:
"""
create a dummy duckdb native database with a timeseries table
"""

import duckdb # v1.2.0
import pandas as pd
import numpy as np


con = duckdb.connect("prices.db")

con.sql(
    """
CREATE OR REPLACE TABLE prices (
    datetime TIMESTAMP,
    forecast_scenario int64,
    member int64,
    price float
)
"""
)

t = pd.date_range("2020-01-01", end="2026", freq="h")
df = pd.concat(
    (
        pd.DataFrame(
            {
                "datetime": t,
                "forecast_scenario": j,
                "member": i,
                "price": np.random.rand(len(t)),
            }
        )
        for i in range(100)
        for j in range(10)
    )
)


con.execute("""INSERT INTO prices select * from df
orderby (forecast_scenario, datetime, member)
""")

con.query("SELECT * FROM prices LIMIT 10")
con.close()
  1. Serve the created db file and following HTML with Node http-server:
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
</head>

<body>
    <script>
        const getDb = async () => {
            const duckdb = window.duckdb;
            // @ts-ignore
            if (window._db) return window._db;
            const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();

            // Select a bundle based on browser checks
            const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

            const worker_url = URL.createObjectURL(
                new Blob([`importScripts("${bundle.mainWorker}");`], {
                    type: "text/javascript",
                })
            );

            // Instantiate the asynchronus version of DuckDB-wasm
            const worker = new Worker(worker_url);
            // const logger = null //new duckdb.ConsoleLogger();
            const logger = new duckdb.ConsoleLogger();
            const db = new duckdb.AsyncDuckDB(logger, worker);
            await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
            URL.revokeObjectURL(worker_url);
            window._db = db;
            return db;
        };
    </script>
    <script type="module">
        import * as duckdb from 'https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/+esm';
        window.duckdb = duckdb;
        getDb().then(async (db) => {
            await db.registerFileURL('prices.db', new URL('../prices.db', window.location.href).href, 4)
            const conn = await db.connect();
            await conn.query(`ATTACH 'prices.db' (READ_ONLY)`)
            for await (const batch of await conn.send(`
                SELECT * FROM prices.prices WHERE 
                datetime > '2025-01-01' and datetime <= '2025-01-02';
            `)) {
                console.log(batch);
            }
        });
    </script>

    <div id="output"></div>
</body>

</html>
  1. Load page in browser — should succeed
  2. Refresh (potentially more than once) — hit above error
  3. Clear browser cache and repeat — works again.

Browser/Environment:

Chrome 131

Device:

Windows 10 x86-64

DuckDB-Wasm Version:

1.29.0

DuckDB-Wasm Deployment:

JSDelivr

Full Name:

David Horsley

Affiliation:

Hydro Tasmania

@carlopi
Copy link
Collaborator

carlopi commented Feb 15, 2025

Thanks for the report, this is indeed quite strange / unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants