Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: proof of concept, parallel installation of wheels #12816

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Jul 1, 2024

Hello,

This is a proof of concept to install wheels in parallels.
Mostly to demonstrate that it's possible.

That took a whole 10 minutes of ChatGPT to parallelize your code :D (and a whole week of debugging pip for other performance issues to understand how to go about it).
I don't expect that to be merged, however if somebody wants to add a proper optional argument --parallel-install=N, it should be possible to merge.

There are two ways to go about parallelizing the pip installation:

  • Either you parallelize the extraction of files within a wheel. The save() function. I think it's not worthwhile because a lot of files are just kilobytes and a wheel might just have a few files, so there might be a lot of overhead to manage threads and other bits. Besides, the Zipfile extractor is very suboptimal and would be better to be rewritten, with maybe a bit of async I/O,
  • Or you parallelize the extraction of wheels. This is what this PR is doing.

It's pretty easy to do with a ThreadPool. It could be a done with a ProcessPool for a lot more performance improvements, but it's problematic to manage errors with sub processes and I'm not sure fork/processpool are implemented in all systems that pip has to support.
So let's investigate what gain you can get with a simple ThreadPool :)

Turns out, there isn't much to get because of the GIL, unless your filesystem has very high latency (NFS).
You get a small gain with 2 threads, usually no more improvement with 3 threads, it gets worse with 4+ threads.

I note there are multiple critical bugs in pip that makes the extraction very slow and inefficient and hold the GIL.
There should be room for more improvements after all these PRs are merged:

pending PR: #12803
pending PR: #12782
pending PR for download, don't try to parallelize downloads without that fix #12810

some benchmarks below on different types of disks:

image
above: run on NFS, NFS is high latency and the gains are substantial

image
above: run on two different disks, /tmp that should be in memory or kind of, and a local volume that is block storage on a virtual machine.

image
above: run with larger packages to compare.
tensorflow is 1-2 GB extracted
pandas+numpy numba+llvmlite scikitlearn are around 50-100 MB each.

I do note that pip seems to install packages alphabetically, which is not ideal.
tensorflow (and torch) fall toward the very end and I'd rather they start toward the beginning. A lot of the extraction time is waiting at the end for the final package tensorflow to finish extracting. It would complete sooner if it started sooner.

@notatallshaw
Copy link
Member

notatallshaw commented Jul 1, 2024

I recently opened an issue on this given uv did this with almost no reported issues.

One issue I did notice since I posted was this one: astral-sh/uv#4328

So it's definitely a test case that should be checked.

There are other edge cases I would be concerned about as well, e.g. What about two packages of the same name? What if one was editable and the other was not?

@morotti
Copy link
Contributor Author

morotti commented Jul 1, 2024

astral-sh/uv#4328

That one is an issue on windows. It's hard to tell what uv is doing because it's working from some cache with hardlinks or symlinks.

pip extraction always erases an existing file (call to unlink()) and write again.
if 2 threads concurrently try to erase/rewrite the same init file.

  • I think it works on Linux because it allows the opened file to be removed.
  • I think it doesn't work on Windows because it doesn't allow the opened file to be deleted while it's being written to.

Quick thoughts, I wonder if a solution would be to have a write lock per package directory, venv/python3.x/site-packages/<packagename>, that would allow parallel extraction and handle packages that erase each other. (I think we have to avoid a lock per file because the lock would be slower than the extraction).

There are other edge cases I would be concerned about as well, e..g. What about two packages of the same name? What if one was editable and the other was not?

I think that makes no sense from the installation perspective. The installation loop runs after the resolver.

The resolver should have resolved what packages to install. i.e. one package per name.
I'm not sure if there is a way where you can force pip to install conflicting packages like pip install ./pandas-1.2.3.whl ./pandas-1.2.4.whl ./pandas-1.2.5.whl -e ./gitclone/pandas

@morotti morotti mentioned this pull request Jul 17, 2024
1 task
@morotti
Copy link
Contributor Author

morotti commented Dec 11, 2024

Some newer benchmark, with the latest main pip branch.

Running with the python interpreters from the manywheel official container image.
Running on a VM, the location is some sort of block storage for VM. Not sure what.

The previous benchmarks in the discussion were made around pip 24.1
I made a lot of patches over pip 24.2 and pip 24.3 that made pip more than twice as fast and reduced CPU usage to a fraction.
BEFORE, there was a fair gain (as much as 10% percent) when using 2 threads, then 3 threads and above were gradually getting
worse than the baseline.
NOW, there is a massive performance drop when using 2 threads and much worse with every extra thread. I guess I've optimized all the extremely inefficient I/O operations that were releasing the GIL.

Unfortunately it demonstrates that any parallelization of the wheel extraction is utterly useless because of the GIL :D
(except on NFS or any extremely slow high latency filesystem).

Besides that, I wanted to try python 3.13t with free threading (no more GIL) so see if pip install can take advantage of it.
It works! pip install can scale parallel extraction to many threads on 3.13t. :D

The fastest install doesn't get below 6 seconds because the last 2 seconds is waiting for the last larger package (pandas) to finish extracting. Making an efficient parallel installation would require to better pick the order of packages to install, probably need to start with largest packages by size or by number of files.
(uv has the same problem where it can install many packages quickly, then you wait the last minute to finish downloading and extracting tensorflow/torch that are multiple GB).

the set of packages is just a bunch of stuff that could install on 3.13t. There are very few packages compatible with 3.13t.
image

@ichard26
Copy link
Member

This isn't something that is on our radar, but nonetheless, your testing and data is appreciated!

@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
@ichard26
Copy link
Member

I'm going to convert this to a draft as it isn't meant to be merged anyway. I'm trying to filter the open PRs by PRs that have a chance of being merged (in an effort to shrink the PR backlog) and I've seen this too many times :)

@ichard26 ichard26 marked this pull request as draft December 28, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: performance Commands take too long to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants