Int8 pipeline parallelism #1482

psinger · 2025-01-22T09:04:36Z

I am trying to work with cuda streams for pipeline parallelism, i.e. executing different parts of a model at the same time on different gpus.
And with int4, float16, bfloat16 everything seems to work as expected.

However, with int8 there appears to be something blocking, and gpus execute sequentially.
As int4 works, I am wondering if anyone knows if there is some blocking operation in int8.

Thanks!

TimDettmers · 2025-02-28T15:24:49Z

For int4 you are also using bitsandbytes code or is this only for int8? There are some operations on bitsandbytes that forces the cuda device before c-calls because this sometimes introduced bugs. It might be that this is causing your problems.

This behavior was changed in 0.45. Can you check your bitsandbytes version and see if you still have this problem with the newer version?

matthewdouglas · 2025-02-28T16:25:15Z

One further thing to note is that int8 has a host-device synchronization that is forced when decomposing the problem into separate int8 and fp16 matmuls. Using threshold=0.0 should avoid that, and will be faster in general, at the potential cost of accuracy.

TimDettmers added High Risk Risk of bugs in transformers and other libraries medium priority (will be worked on after all high priority issues) labels Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int8 pipeline parallelism #1482

Int8 pipeline parallelism #1482

psinger commented Jan 22, 2025

TimDettmers commented Feb 28, 2025

matthewdouglas commented Feb 28, 2025

Int8 pipeline parallelism #1482

Int8 pipeline parallelism #1482

Comments

psinger commented Jan 22, 2025

TimDettmers commented Feb 28, 2025

matthewdouglas commented Feb 28, 2025