tarfile's cache balloons in memory when streaming a big tarfile

# Bug report

I've got a bunch of tar files containing millions of small files. Never mind how we got here - I need to process those tars, handling each of the files inside. Furthermore, I need to process a lot of these tars, and I'd like to do it relatively quickly on cheapish hardware, so I'm at least a little sensitive to memory consumption.

The natural thing to do is to iterate through the tarfile, extracting each file one at a time, and carefully closing them when done:
```python
with tarfile.open(filepath, "r:gz") as tar:
    for member in tar:
        file_buf = tar.extractfile(member)
        try:
            handle(file_buf)
        finally:
            file_buf.close()
```

This looks like it should handle each small file and discard it when done, so memory should stay pretty tame. I was very surprised to discover that this actually uses gigabytes of memory. That's fixed if you do this:

```python
with tarfile.open(filepath, "r:gz") as tar:
    for member in tar:
        file_buf = tar.extractfile(member)
        try:
            handle(file_buf)
        finally:
            file_buf.close()
            tar.members = []  # evil!
```
That works because tarinfo.TarFile has a cache, `self.members`. That cache is appended to in [`TarFile.next()`](https://github.com/python/cpython/blob/024ac542d738f56b36bdeb3517a10e93da5acab9/Lib/tarfile.py#L2381), which in turn is used in [`TarFile.__iter__`](https://github.com/python/cpython/blob/024ac542d738f56b36bdeb3517a10e93da5acab9/Lib/tarfile.py#L2471). 

That cache is storing `TarInfo`s, which are headers describing each file. They're not very large, but with lots and lots of files, those headers can add up.

The net result is that it's not possible to stream a tarfile's contents without memory growing linearly with the number of files in the tarfile. This has been partially addressed in the past (see #46334, from way back in 2008), but never fully resolved. It shows up in [StackOverflow](https://stackoverflow.com/questions/21039974/high-memory-usage-with-pythons-native-tarfile-lib) and probably elsewhere, with a clumsy recommended solution of resetting `tar.members` each time, but there ought to be a better way.

# Your environment

CPython 3.10, mostly; I don't think OS or architecture etc are relevant.


### Linked PRs
* gh-102128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tarfile's cache balloons in memory when streaming a big tarfile #102120

Bug report

Your environment

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

tarfile's cache balloons in memory when streaming a big tarfile #102120

Description

Bug report

Your environment

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions