Large column value not split into chunks #91

pspi · 2020-04-15T03:48:56Z

When reading a large column value, only one large chunk covering the complete value gets emitted. I was expecting the stream to emit multiple smaller chunks.

Test

const { Client } = require("pg");
const { to: copyTo } = require("pg-copy-streams");

(async () => {
  const client = new Client();
  await client.connect();

  await client.query(`create temporary table temp (field text)`);
  await client.query(`insert into temp(field) values (repeat('-', 1048576))`);
  
  const fromDatabase = client.query(
    copyTo(`
    COPY (SELECT field FROM temp)
      TO STDOUT`)
  );

  fromDatabase.on("data", (chunk) => {
    console.log(`read ${chunk.length} bytes`);
  });

  fromDatabase.on("end", () => {
    client.end();
  });
})();

Output

read 1048577 bytes

For large values (for example BYTEA column that holds a file) emitting a single chunk slows down performance significantly because the emitted Buffer has to be allocated for several megabytes. To give you an idea, running this same test with larger field sizes:

field size	running time
1MB	0.1 s
10MB	0.9s
20MB	3.1s
30MB	6.4s

Versions: pg 8.0.2, pg-copy-streams 2.2.2, node v13.8.0, PostgreSQL 12.1 on x86_64-apple-darwin16.7.0

The text was updated successfully, but these errors were encountered:

jeromew · 2020-04-15T16:37:26Z

Hello,
thanks for the report.

I browsed the copy to implementation here - https://github.com/brianc/node-pg-copy-streams/blob/master/copy-to.js#L41
It is true that currently we gather a full tuple from the database before emitting it so I understand that with huge rows (or fields) the implementation is going to be memory-hungry (and perhaps time hungry too if memory allocation becomes slow after many rows).

I fear that some people using the library currently expect it to output full rows so we would have to think about the backward compatibility if we were to modify this behaviour.

I did not check the postgres documentation about this I remember that there was cases when postgres itself was handling full rows.

can I ask what your use case is ?

pspi · 2020-04-15T17:50:43Z

I fear that some people using the library currently expect it to output full rows so we would have to think about the backward compatibility if we were to modify this behaviour.

That is very valid fear. I'm thinking whether the used abstraction is ideal. I was under the impression that the contract for streams are that they deliver a larger whole efficiently in small chunks and avoid holding anything in memory in its entirety. One effectively receives parts of the end result.

For contract that emits full rows, would EventEmitter-approach be a more intuitive solution? There one could explicitly state what's arriving:

rowsemitter.on('row', ...); // process single row
rowsemitter.on('end', ....); // all rows have been read

can I ask what your use case is ?

To stream an image stored in BYTEA column to client through Node.js server.

I browsed the copy to implementation here - https://github.com/brianc/node-pg-copy-streams/blob/master/copy-to.js#L41

I did not check the postgres documentation about this I remember that there was cases when postgres itself was handling full rows.

Inspecting this further, pg pushes 64kB chunks to _transform(). The chunks are interspersed with control data (CopyOutResponse, CopyData messages (always one per row) and CopyDone) but the chunks are not aligned at row boundaries - CopyData representing one row can span over several chunks.

In theory, the pg-copy-streams implementation could strip control data and push payload from those 64kB chunks forward as they arrive. Would you be open to this kind of solution? I could create a PR.

jeromew · 2020-04-16T15:58:18Z

Hello I was re-reading the protocol spec regarding COPY

and I think that the current implementation was mislead by this sentence about CopyData : "Data that forms part of a COPY data stream. Messages sent from the backend will always correspond to single data rows, but messages sent by frontends might divide the data stream arbitrarily."

the fact that usually, 64kB fits many rows so nobody really bothered.

Since the goal of this library is to have high throughput and low memory requirements, I could envision a major 3.0 version that would break the backward compatibility and more or less keep the chunk size that postgres outputs. Some people might not be happy but we could document and make a recommendation for a simple streaming csv parser that would aggregate the csv compatible chunks - https://csv.js.org/parse/ might do the trick we need to test + add a warning with an upgrade path in the changelog

I am not in favor of outputting 'row' events since we would need to buffer the whole row to pass it in the event which is exactly what we want to avoid.

If you want to give it a try and send a PR I will find time to review it.

As a reminder we will have to check the following boxes before a 3.0 can be published :

stop buffering row data in copy to
make all tests pass
Add a test that highlights the fact that postgres chunks rows
Add a test with a streaming csv parser that shows that un-chunking works with the recommended solution
Modify the README to add now mandatory the streaming csv parser
Modify Changelog to mention the breaking change and clarify the upgrade path

Regarding the storage of images as BYTEA inside postgres, I have been tinkering for some time with ideas around this (small images as a BYTEA field, or big images split over many lines in a table with an offset for re-ordering). Maybe the problem you uncovered can lead to an acceptable solution for a "1 row per image" solution, whatever the image size. But there might still be a problem with > 1GB images because the oversized fields are TOASTED - https://www.postgresql.org/docs/9.5/storage-toast.html . Have you already encountered this issue ?

Keep me posted on this

brianc · 2020-04-16T16:25:58Z

Ohhh yeah it makes sense to emit fixed sized chunks. Would be subtle backwards compatible break but I think probably overall better for efficiency as @jeromew said. I agree with all his thoughts FWIW! 😄

jeromew · 2020-04-20T09:57:30Z

Hello @pspi have you started working on a PR for this ? I could do it but do not want to collide with your work.

pspi · 2020-04-20T11:14:06Z

@jeromew Yep, I'm working on it. The radio silence was because I don't have anything to show yet.

jeromew · 2020-04-20T12:33:53Z

@pspi great ! thanks for your work on this.

pspi · 2020-04-20T12:39:00Z

@jeromew You know what. You probably know around this module way better than I do. And this issue looks like a tricky nut to crack. If you're eager, why don't you give it a shot as well. We could join forces and compare notes off-channel. You can find my contact info in my profile.

jeromew · 2020-04-20T12:48:24Z

ok i'll take a look in parallel. I'll send you an email.

jeromew · 2020-04-20T15:04:17Z

After reading the code once again, I better understand the problem :

postgres is sending chunks of ~64kB
we prepare a buffer of 64kB
for all full rows in the chunk we extract the relevant data rows inside this buffer
once done, we push the extracted rows
if there is a leftover on the chunk, we prefix it to the next postgres chunk, creating a bigger chunk.

rows are extracted only when a full row has been seen so when there is a huge row (>64kB) the buffer that we prefix to the pg chunk keeps growing until a full row can be seen. This is probably the degredation that you experiment with a huge BYTEA, since the module needs to buffer at least the whole BYTEA before pushing it.

pspi · 2020-05-02T04:46:53Z

This issue was resolved with #94.

A test was also added in 8eff422 that verifies the functionality.

jeromew · 2020-05-04T19:30:14Z

I temporarily re-open this issue to make sure we have a common understanding of what is happening.

pg-copy-stream version 3.0.0 now streams the data payload of a copyOut without buffering the whole CopyData messages (if a CopyData message that spans across N chunks these chunks will be streamed as they flow). This seems to fix the issue @pspi created.

but after reading pg's internals, I understand that pg is still buffering the whole CopyData message in this code https://github.com/brianc/node-postgres/blob/master/packages/pg/lib/connection.js#L110

@brianc do you confirm that even with the line https://github.com/brianc/node-pg-copy-streams/blob/master/copy-to.js#L30 this.connection.removeAllListeners('copyData') pg will buffer the whole CopyData message in packet-reader ? Is there a way we could avoid this ?

This would mean that even with pg-copy-stream 3.0.0, after COPYing a 100MB image out of postgres, pg will have allocated a 100MB buffer and will keep it until the connection is closed.

I suppose that this does not trigger the same memory/GC issue that @pspi was seeing because packet-reader is designed to grow and re-use an internal buffer which does not trigger the constant memory alloc/free that was causing the initial problem but this is not ideal memory-wise.

Maybe we can find a solution for this

jeromew · 2020-05-11T13:22:32Z

I'll close this as I think it is fixed with 4.0.0

jeromew mentioned this issue Apr 22, 2020

style guide and development tools for pg related modules brianc/node-postgres#2172

Open

pspi closed this as completed May 2, 2020

jeromew reopened this May 4, 2020

jeromew mentioned this issue May 6, 2020

Avoid double-parsing by pg during copy-to #101

Merged

jeromew closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large column value not split into chunks #91

Large column value not split into chunks #91

pspi commented Apr 15, 2020 •

edited

Loading

jeromew commented Apr 15, 2020

pspi commented Apr 15, 2020

jeromew commented Apr 16, 2020

brianc commented Apr 16, 2020

jeromew commented Apr 20, 2020

pspi commented Apr 20, 2020

jeromew commented Apr 20, 2020

pspi commented Apr 20, 2020

jeromew commented Apr 20, 2020

jeromew commented Apr 20, 2020

pspi commented May 2, 2020

jeromew commented May 4, 2020 •

edited

Loading

jeromew commented May 11, 2020

Large column value not split into chunks #91

Large column value not split into chunks #91

Comments

pspi commented Apr 15, 2020 • edited Loading

jeromew commented Apr 15, 2020

pspi commented Apr 15, 2020

jeromew commented Apr 16, 2020

brianc commented Apr 16, 2020

jeromew commented Apr 20, 2020

pspi commented Apr 20, 2020

jeromew commented Apr 20, 2020

pspi commented Apr 20, 2020

jeromew commented Apr 20, 2020

jeromew commented Apr 20, 2020

pspi commented May 2, 2020

jeromew commented May 4, 2020 • edited Loading

jeromew commented May 11, 2020

pspi commented Apr 15, 2020 •

edited

Loading

jeromew commented May 4, 2020 •

edited

Loading