Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctypes.POINTER memory leak in _buffer_reader / _buffer_writer #346

Open
1 task done
richardhundt opened this issue Mar 9, 2025 · 7 comments · May be fixed by #347
Open
1 task done

ctypes.POINTER memory leak in _buffer_reader / _buffer_writer #346

richardhundt opened this issue Mar 9, 2025 · 7 comments · May be fixed by #347

Comments

@richardhundt
Copy link

richardhundt commented Mar 9, 2025

Checklist

  • I confirm to be using an official package of pypdfium2 from PyPI or GitHub/pypdfium2-team.

Description

In the callbacks for reading and writing PDF files, ctypes.POINTER objects are being created at runtime:

        block = ctypes.cast(data, ctypes.POINTER(ctypes.c_ubyte * size))

This is problematic since pointer types are cached, and since size can vary in the last chunk, there's an unbounded growth of the type cache.

This leads to a 500Mb process eventually growing to 2.6Gb after ~8hrs of read/write operations on PDFs.

Running tracemalloc.compare_to confirms that this is where all the memory is going.

A workaround is to call ctypes._reset_cache() after closing the PDF (or after context __exit__), but that seems like a bit of a hack.
I'm not really familiar with the CFFI, so not sure what the canonical way of reading/writing buffers is given a pointer and a size, but it seems like there should be another way which doesn't require declaring pointer types whenever you read a chunk into a buffer. No idea.

Either way, memory is stable if you do a ctypes._reset_cache() when releasing PDF resources after I/O operations.

Install Info

Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
Linux-6.8.0-1024-aws-x86_64-with-glibc2.35
pypdfium2 4.30.1
pdfium 133.0.6899.0 at /opt/app/code/_venv/lib/python3.10/site-packages/pypdfium2_raw/libpdfium.so

Name: pypdfium2
Version: 4.30.1
Summary: Python bindings to PDFium
Home-page: https://github.com/pypdfium2-team/pypdfium2
Author: pypdfium2-team
Author-email: [email protected]
License: BSD-3-Clause, Apache-2.0, PdfiumThirdParty
Location: /opt/app/code/_venv/lib/python3.10/site-packages
Requires: 
Required-by:
@mara004
Copy link
Member

mara004 commented Mar 9, 2025

That's crazy, POINTER(type * size) is really fundamental ctypes and used all around the library.
Isn't this more of an issue with ctypes not limiting its cache size?
But otherwise, yes, thanks for the info. I will think about what can be done. Maybe we can use (type * size).from_*(...) or something instead.

As for ctypes._reset_cache(), that's technically a non-public API, so I'd be hesitant to use it in pypdfium2.

@mara004 mara004 pinned this issue Mar 9, 2025
@mara004
Copy link
Member

mara004 commented Mar 9, 2025

Well, we could do

addr = ctypes.cast(p_buf_first, ctypes.c_void_p).value
buf = (ctypes.c_char * size).from_address(addr)

but this is a bit sketchy as it mixes up pointer and address. Pointer values are usually memory addresses, but they don't necessarily have to be according to the C spec. Yet, in practice, we could probably assume they are.

@mara004
Copy link
Member

mara004 commented Mar 9, 2025

Ah wait, how about .from_address( addressof(p_buf_first.contents) ) ? That should work cleanly.

mara004 added a commit that referenced this issue Mar 9, 2025
@mara004 mara004 linked a pull request Mar 9, 2025 that will close this issue
mara004 added a commit that referenced this issue Mar 9, 2025
@mara004 mara004 linked a pull request Mar 9, 2025 that will close this issue
@mara004
Copy link
Member

mara004 commented Mar 9, 2025

BTW, this is not only _buffer_reader/_buffer_writer.
PdfBitmap.from_raw() rsp. PdfBitmap._get_buffer() and their callers are also affected.

Edit: Previously, this comment stated that PdfPage.render() would be affected, but I figured this is not the case with the default PdfBitmap.new_native bitmap maker.

@mara004
Copy link
Member

mara004 commented Mar 9, 2025

Isn't this more of an issue with ctypes not limiting its cache size?

CC python/cpython#100926

@richardhundt
Copy link
Author

Yeah, I agree that calling a private method to clear the cache is nasty. We still get dynamic class construction:

>>> ctypes.c_char * 32
<class '__main__.c_char_Array_32'>
>>> ctypes.c_char * 33
<class '__main__.c_char_Array_33'>

Perhaps those are GC'd correctly though. I'll test once a new release is out.

Thanks for a fast turn-around on this.

@mara004
Copy link
Member

mara004 commented Mar 10, 2025

Well, that's just the ctypes way of creating arrays: https://docs.python.org/3/library/ctypes.html#arrays
We need to specify the size at some point, and that's the API to do it.
If these array types are also cached then I'm at my wit's end. But I don't think they are ;)
AFAICS the docs mention caching only for POINTER(),1 not arrays.

Footnotes

  1. what they don't mention is that the cache is unbounded...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants