Description
Bug report
I've got a bunch of tar files containing millions of small files. Never mind how we got here - I need to process those tars, handling each of the files inside. Furthermore, I need to process a lot of these tars, and I'd like to do it relatively quickly on cheapish hardware, so I'm at least a little sensitive to memory consumption.
The natural thing to do is to iterate through the tarfile, extracting each file one at a time, and carefully closing them when done:
with tarfile.open(filepath, "r:gz") as tar:
for member in tar:
file_buf = tar.extractfile(member)
try:
handle(file_buf)
finally:
file_buf.close()
This looks like it should handle each small file and discard it when done, so memory should stay pretty tame. I was very surprised to discover that this actually uses gigabytes of memory. That's fixed if you do this:
with tarfile.open(filepath, "r:gz") as tar:
for member in tar:
file_buf = tar.extractfile(member)
try:
handle(file_buf)
finally:
file_buf.close()
tar.members = [] # evil!
That works because tarinfo.TarFile has a cache, self.members
. That cache is appended to in TarFile.next()
, which in turn is used in TarFile.__iter__
.
That cache is storing TarInfo
s, which are headers describing each file. They're not very large, but with lots and lots of files, those headers can add up.
The net result is that it's not possible to stream a tarfile's contents without memory growing linearly with the number of files in the tarfile. This has been partially addressed in the past (see #46334, from way back in 2008), but never fully resolved. It shows up in StackOverflow and probably elsewhere, with a clumsy recommended solution of resetting tar.members
each time, but there ought to be a better way.
Your environment
CPython 3.10, mostly; I don't think OS or architecture etc are relevant.