PERF: proof of concept, parallel installation of wheels #12816

morotti · 2024-07-01T14:28:47Z

Hello,

This is a proof of concept to install wheels in parallels.
Mostly to demonstrate that it's possible.

That took a whole 10 minutes of ChatGPT to parallelize your code :D (and a whole week of debugging pip for other performance issues to understand how to go about it).
I don't expect that to be merged, however if somebody wants to add a proper optional argument --parallel-install=N, it should be possible to merge.

There are two ways to go about parallelizing the pip installation:

Either you parallelize the extraction of files within a wheel. The save() function. I think it's not worthwhile because a lot of files are just kilobytes and a wheel might just have a few files, so there might be a lot of overhead to manage threads and other bits. Besides, the Zipfile extractor is very suboptimal and would be better to be rewritten, with maybe a bit of async I/O,
Or you parallelize the extraction of wheels. This is what this PR is doing.

It's pretty easy to do with a ThreadPool. It could be a done with a ProcessPool for a lot more performance improvements, but it's problematic to manage errors with sub processes and I'm not sure fork/processpool are implemented in all systems that pip has to support.
So let's investigate what gain you can get with a simple ThreadPool :)

Turns out, there isn't much to get because of the GIL, unless your filesystem has very high latency (NFS).
You get a small gain with 2 threads, usually no more improvement with 3 threads, it gets worse with 4+ threads.

I note there are multiple critical bugs in pip that makes the extraction very slow and inefficient and hold the GIL.
There should be room for more improvements after all these PRs are merged:

pending PR: #12803
pending PR: #12782
pending PR for download, don't try to parallelize downloads without that fix #12810

some benchmarks below on different types of disks:

above: run on NFS, NFS is high latency and the gains are substantial

above: run on two different disks, /tmp that should be in memory or kind of, and a local volume that is block storage on a virtual machine.

above: run with larger packages to compare.
tensorflow is 1-2 GB extracted
pandas+numpy numba+llvmlite scikitlearn are around 50-100 MB each.

I do note that pip seems to install packages alphabetically, which is not ideal.
tensorflow (and torch) fall toward the very end and I'd rather they start toward the beginning. A lot of the extraction time is waiting at the end for the final package tensorflow to finish extracting. It would complete sooner if it started sooner.

notatallshaw · 2024-07-01T16:45:05Z

I recently opened an issue on this given uv did this with almost no reported issues.

One issue I did notice since I posted was this one: astral-sh/uv#4328

So it's definitely a test case that should be checked.

There are other edge cases I would be concerned about as well, e.g. What about two packages of the same name? What if one was editable and the other was not?

morotti · 2024-07-01T17:17:15Z

astral-sh/uv#4328

That one is an issue on windows. It's hard to tell what uv is doing because it's working from some cache with hardlinks or symlinks.

pip extraction always erases an existing file (call to unlink()) and write again.
if 2 threads concurrently try to erase/rewrite the same init file.

I think it works on Linux because it allows the opened file to be removed.
I think it doesn't work on Windows because it doesn't allow the opened file to be deleted while it's being written to.

Quick thoughts, I wonder if a solution would be to have a write lock per package directory, venv/python3.x/site-packages/<packagename>, that would allow parallel extraction and handle packages that erase each other. (I think we have to avoid a lock per file because the lock would be slower than the extraction).

There are other edge cases I would be concerned about as well, e..g. What about two packages of the same name? What if one was editable and the other was not?

I think that makes no sense from the installation perspective. The installation loop runs after the resolver.

The resolver should have resolved what packages to install. i.e. one package per name.
I'm not sure if there is a way where you can force pip to install conflicting packages like pip install ./pandas-1.2.3.whl ./pandas-1.2.4.whl ./pandas-1.2.5.whl -e ./gitclone/pandas

morotti · 2024-12-11T18:44:12Z

Some newer benchmark, with the latest main pip branch.

Running with the python interpreters from the manywheel official container image.
Running on a VM, the location is some sort of block storage for VM. Not sure what.

The previous benchmarks in the discussion were made around pip 24.1
I made a lot of patches over pip 24.2 and pip 24.3 that made pip more than twice as fast and reduced CPU usage to a fraction.
BEFORE, there was a fair gain (as much as 10% percent) when using 2 threads, then 3 threads and above were gradually getting
worse than the baseline.
NOW, there is a massive performance drop when using 2 threads and much worse with every extra thread. I guess I've optimized all the extremely inefficient I/O operations that were releasing the GIL.

Unfortunately it demonstrates that any parallelization of the wheel extraction is utterly useless because of the GIL :D
(except on NFS or any extremely slow high latency filesystem).

Besides that, I wanted to try python 3.13t with free threading (no more GIL) so see if pip install can take advantage of it.
It works! pip install can scale parallel extraction to many threads on 3.13t. :D

The fastest install doesn't get below 6 seconds because the last 2 seconds is waiting for the last larger package (pandas) to finish extracting. Making an efficient parallel installation would require to better pick the order of packages to install, probably need to start with largest packages by size or by number of files.
(uv has the same problem where it can install many packages quickly, then you wait the last minute to finish downloading and extracting tensorflow/torch that are multiple GB).

the set of packages is just a bunch of stuff that could install on 3.13t. There are very few packages compatible with 3.13t.

ichard26 · 2024-12-12T01:10:50Z

This isn't something that is on our radar, but nonetheless, your testing and data is appreciated!

ichard26 · 2024-12-28T17:44:59Z

I'm going to convert this to a draft as it isn't meant to be merged anyway. I'm trying to filter the open PRs by PRs that have a chance of being merged (in an effort to shrink the PR backlog) and I've seen this too many times :)

PERF: proof of concept, parallel installation of wheels

3e81ec8

morotti mentioned this pull request Jul 17, 2024

Install packages in parallel #12742

Open

1 task

ichard26 added the type: performance Commands take too long to run label Dec 26, 2024

ichard26 marked this pull request as draft December 28, 2024 17:45

ichard26 mentioned this pull request Feb 20, 2025

Improve UX and Performance of Install step #12712

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: proof of concept, parallel installation of wheels #12816

PERF: proof of concept, parallel installation of wheels #12816

morotti commented Jul 1, 2024 •

edited

Loading

notatallshaw commented Jul 1, 2024 •

edited

Loading

morotti commented Jul 1, 2024 •

edited

Loading

morotti commented Dec 11, 2024 •

edited

Loading

ichard26 commented Dec 12, 2024

ichard26 commented Dec 28, 2024

PERF: proof of concept, parallel installation of wheels #12816

Are you sure you want to change the base?

PERF: proof of concept, parallel installation of wheels #12816

Conversation

morotti commented Jul 1, 2024 • edited Loading

notatallshaw commented Jul 1, 2024 • edited Loading

morotti commented Jul 1, 2024 • edited Loading

morotti commented Dec 11, 2024 • edited Loading

ichard26 commented Dec 12, 2024

ichard26 commented Dec 28, 2024

morotti commented Jul 1, 2024 •

edited

Loading

notatallshaw commented Jul 1, 2024 •

edited

Loading

morotti commented Jul 1, 2024 •

edited

Loading

morotti commented Dec 11, 2024 •

edited

Loading