Skip to content

cannot stream cursor on very large table #462

Open
@AlJohri

Description

@AlJohri
  • asyncpg version: 0.18.3

  • PostgreSQL version: 9.6.8

  • Do you use a PostgreSQL SaaS? If so, which? Can you reproduce
    the issue with a local PostgreSQL install?
    : AWS Aurora Postgres. Tried connecting to both writer and read replica. Did not try local postgres.

  • Python version: 3.7.3

  • Platform: macOS 10.14.5

  • Do you use pgbouncer?: No

  • Did you install asyncpg with pip?: Yes

  • If you built asyncpg locally, which version of Cython did you use?: N/A

  • Can the issue be reproduced under both asyncio and
    uvloop?
    : Did not try.

I'm running into an issue with using cursors on a very large postgres table. The table has approximately 60 million rows. The table has only 8 columns and each row is very small. I'm trying to use a server side cursor to stream the results in (ordered) but it just completely freezes while establishing the con.cursor. This is my code:

import asyncio
import asyncpg

dsn = 'postgresql://...:[email protected]/...'

async def run():
    con = await asyncpg.connect(dsn)
    print(con)
    async with con.transaction():
        async for record in con.cursor('SELECT * FROM pageviews ORDER BY user_id', prefetch=10, timeout=2):
            print(record)
    await con.close()

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

I tried using both the iterable cursor (with explicit prefetch) and the regular cursor and neither worked. The following psycopg2 code using a "named cursor" (server-side cursor) works properly:

import datetime
import psycopg2
from psycopg2.extras import RealDictCursor

cursor_name = datetime.datetime.now().isoformat()

dsn = 'postgresql://...:[email protected]/...'
conn = psycopg2.connect(dsn, cursor_factory=RealDictCursor)

with conn.cursor(cursor_name) as cursor:
    sql = "SELECT * FROM pageviews ORDER BY user_id;"
    cursor.execute(sql)
    for i, row in enumerate(cursor):
        print(i, row)

I need to stream the entire ordered table and perform a streaming groupby and then run some computation on each grouped object. I'm looking to asyncpg to decrease the IO time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions