-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Optional AVX2 Support, Cache Alignment, and Enhances Model Export Speed #94
base: master
Are you sure you want to change the base?
Adds Optional AVX2 Support, Cache Alignment, and Enhances Model Export Speed #94
Conversation
…ment in Config This commit includes: 1. Optional AVX2 support for matmul and rmsnorm functions. 2. Fused matrix multiplies with new matmul2 and matmul3 functions. 3. Cache aligned allocations for better performance and compatibility with SIMD/Vector intrinsics. 4. Updated Config struct to support cache alignment. NOTE: Previous models should be re-exported due to this change. 5. Enhanced performance of serialization in Llama export code. 6. Updated Makefile to support AVX2 and OMP AVX2 builds.
This changes are great but they add tons of complexity to the run.c which makes the project not that easy to follow for less experienced developers, nevertheless I think most of them can be applied:
|
Yes I understand the tradeoffs. Totally get where you are coming from Thank you for your thoughtful feedback on my PR. I understand the concern regarding added complexity, particularly with the introduction of AVX/AVX2, and I appreciate your perspective on maintaining accessibility for less experienced developers. Here are my thoughts on your points: Export Script: I agree that the changes to the export script can be separated into a different merge request. I will proceed with that and appreciate your support for this enhancement. Aligned Alloc and Local Variables: Thank you for recognizing this improvement. I agree that the difference in aligned memory allocation between POSIX and Windows adds a layer of complexity. However, this is a necessary change for significant performance enhancements and compatibility with SIMD/Vector intrinsics. I would be glad to explore options to keep this manageable and clear in the code. AVX/AVX2: I understand your concerns here. The AVX2 support does indeed add complexity, and adding further support for other SIMD architectures like Neon, AVX512, etc., would increase it further. However, the AVX2 support provides a 30-40% speed increase at least, a significant improvement that might warrant the added complexity for users who need high performance. While a BLAS-based approach could potentially offer a cleaner solution, it may not provide the same speed benefits, especially given that BLAS doesn't support FP16, which AVX easily can. Also, while BLAS could enable GPU acceleration, that would add another layer of complexity and dependency, moving away from the simplicity of the current "just C" solution. My intention with this PR was to provide an optional speed enhancement for those who require it, while keeping the base version accessible for less experienced developers or those who don't require the speed-up. I'm open to suggestions on how we can best balance performance and complexity in this project, and am willing to make necessary changes to ensure the improvements can be integrated effectively. Adding to my previous points, I also view this project as an excellent piece of educational material for developers who are looking to deepen their understanding of these types of systems. To this end, introducing advanced optimization techniques, such as those enabled by AVX2, can provide an invaluable learning experience. These techniques are widely used in the industry and offering exposure to them in this project can significantly benefit developers in their learning journey. To balance the goals of education and high performance, we could consider maintaining two versions of the project - a 'vanilla' version for newer developers or those who prefer simplicity, and an 'advanced' version that includes these performance optimizations. This way, users can choose the version that best suits their needs and experience level. |
Thanks for your comments, I totally understand where you coming from. There are defo possible trade offs to be made. |
Thank you for your thoughtful feedback. I completely agree with your points on the importance of both simplicity and performance. And I understand the need to maintain a balance so that the project remains accessible to developers of all skill levels while still performing optimally. My primary intention with these changes was to introduce performance optimization techniques and demonstrate how they can coexist within an accessible codebase like llama2.c. I also wanted to ensure that newer developers have the opportunity to see and learn from these techniques, as they are often crucial in real-world applications. Your suggestion to obtain performance numbers to better understand the trade-offs is a great idea. I am willing to run performance tests and provide more data to help inform the decision-making process. I do believe that the substantial speed gains we're seeing (up to 30-40%) could justify the added complexity, especially since these optimizations can be wrapped in an #ifdef guard to allow users to enable or disable them according to their needs. Also, I appreciate your comment on having an advanced version like llama.cpp. It makes sense to have a high-performance variant alongside a simpler, more educational variant. However, given that llama.cpp introduces additional complexities such as GPUs, BLast, quantization, etc., there might be value in considering llama2.c as a middle ground, where we introduce more advanced techniques such as SIMD while keeping the rest of the codebase relatively simple. Ultimately, the direction of the project lies with @karpathy and the community's vision. But I'm hopeful we can find a way to incorporate these performance enhancements in a way that aligns with that vision. I look forward to further discussions and feedback. I've created the PR for the exporter changes. Very much hope it works out for you. |
Here are some preliminary benchmarks for your consideration Baseline f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
Once upon a time, there was a boy named Timmy. Timmy loved to play outside and look for treasure. One day, he found a big chest buried in the ground. He was so excited!
Timmy ran home to show his mom the chest. "Mommy, mommy, look what I found!" he said, holding up the chest.
His mom looked at the chest and said, "Wow, that's a big chest! Let's open it!"
Inside the chest, there was a stuffed bear that was very hairy. Timmy loved the bear and hugged it tight.
But then, Timmy's little sister came in and wanted to play with the bear too. Timmy didn't want to share, so he folded the bear up upside down and said, "No, it's mine!"
His mom reminded him, "Timmy, it's important to share with others. Remember, it's nice to share."
Timmy thought about what his mom said and realized she was right. He cut the bear in half and gave one half to his sister. They both hugged
achieved tok/s: 20.202020 Fast f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
One day, a little girl named Sue wanted to help her mom. Her mom was making tasty food in the kitchen. Sue asked, "Mom, can I help you make the food?" Her mom said, "Yes, you can help me cut the celery."
Sue was very happy to help. She took the celery and started to cut it. But then, something unexpected happened. A big, funny dog came into the kitchen. The dog saw the celery and wanted some too.
The dog jumped up and took the celery from Sue. Sue was sad and cried. The dog ran away with the celery and Sue did not get any tasty food that day. The dog ate the celery and Sue was still sad.
<s>
One day, a little girl named Lily went for a walk. She saw a wide tree with a happy face. The tree was smiling at her. She was very happy to see the tree.
Lily saw a big red ball under the tree. She wanted to play with it. She tried to push the ball, but it was too wide. Lily tried and tried, but she could not move the ball.
Finally, Lily had an
achieved tok/s: 53.906085 Fast AVX2 f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
Once upon a time, there was a little girl. Her name was Mary. Mary was very small but she was also very brave.
One sunny day, Mary was playing in the garden when she saw a big, hairy bug on the wall.
"Oh, look at the bug!" Mary said.
When Mary realized what the bug was, she was so surprised. She had never seen a hairy bug before.
"What kind of bug is it?" Mary said.
She walked over to it and jumped up and down. The bug didn't move.
Then her mom called out, "Mary, come here, I want to give you a hug!"
So Mary hugged her mom tight and then she looked at the hairy bug.
"That bug is so hairy," Mary said.
"Yes," said her mom, "And how did it get up so high?"
Mary smiled and said, "I just followed it down to the ground and then put my arm under it."
And so Mary they did just that, just like Mary did that day.
<s>
Once upon a time, there was a little boy named Timmy. Timmy loved to eat fruit, especially app
achieved tok/s: 78.383344 OMP f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
Once upon a time, there was a little girl named Lily. She loved to watch cartoons on TV. One day, she saw a funny cartoon about a silly dog. She laughed and laughed until her mom came in.
"Mommy, can we wrap this cake for your birthday?" asked Lily.
"Sure, sweetie," replied her mom. "Let's go to the kitchen and get the wrapping paper."
Lily helped her mom wrap the cake and put it in the oven. Later that day, Lily and her friends went on a parade. They saw a parade with lots of funny animals like a clown and a princess.
"Look at those funny clowns!" said Lily.
"They're not very normal," agreed her friend, Sarah.
Lily didn't mind though, she loved spending time with her friends and watching the parade. When they got home, they had a big piece of cake and it was the best birthday ever.
<s>
Once upon a time, there was a little boy named Timmy. Timmy loved to play outside with his friends. One day, Timmy and
achieved tok/s: 254.220457 OMP/AVX2 f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
Once upon a time, there was a messy dog named Spot. Spot loved to play outside and make his toys lay all over the place. One day, he found a can of paint and decided to have some fun and spread it all over his toys.
As Spot played with his toys, he made a big mess. The paint dripped down the toys like rain. Spot didn't know he was making a mess, so he kept on playing and getting it on his toys.
When Spot's mom saw the mess, she was sad. She told Spot that he needed to clean up and not make a mess for real long. Spot learned that it was better to have fun and not make a mess than to make things right.
<s>
Once upon a time, there was a little girl named Lily. She was very scared of ghosts. One night, she saw a ghost in her room. She screamed and ran to her mom.
"Mommy, there's a ghost in my room!" Lily cried.
"It's okay, Lily. Ghosts aren't real. Let's measure how far away you go,"
achieved tok/s: 290.249433 |
Nice, thank you. I think the trades off are more clear now. |
No, yours was a separate PR so I haven't integrated that yet. The loop-unrolling stuff is a good idea. Will post benchmarks on your PR |
BTW one possible middle ground solution which potentially wouldn't add too much complexity and would be portable use Vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). That would allow to use SIMD in a portable way with minimal changes via:
and with your changes to the aligned memory. The most perf is gain either via limiting memory bound or accelerating matmul, so there is still tons of space to improve. It just finding the balance between complexity and performance. |
Really appreciate this work. As both of you mentioned I'd like to keep this project readable and not go too wild on optimizations. That space will be best served by llama.cpp. So I'm looking for the simplest tweaks that maintain readability, maintain cross-platformness, as a solid foundation and this should go into master. I'm sure there will be some proliferation of forks/other repos that create extensions with slightly different tradeoffs, which is great. |
Really awesome PR @Foundation42 thank you for posting this. I've implemented the fused matmul and the aligned allocation based on your PR in my project and seen a 12% lift in token/s. I've added you to credits, thanks for teaching me something new! |
Bleh I tried to cherrypick minimal changes from this PR to use AVX2 but I get Segmentation Fault... Is there no way to just surgically use a single AVX2 matmul:
in the code with minal changes? Is it necessary to do the cache alignment part, and the model re-export parts? |
Yes I am pretty sure for using these intrinsics you need to have the alignment set for 32 bytes. If you want to avoid having to change the allocation patterns, you can try using the unaligned load operations ( #ifdef __AVX2__
#include <immintrin.h> // AVX2
void matmul(float* o, const float* x, const float* w, int n, int d) {
// W (d,n) @ x (n,) -> o (d,)
int nn = n / 8 * 8; // ensure n is a multiple of 8
#pragma omp parallel for
for (int i = 0; i < d; i++) {
__m256 sum_vec = _mm256_setzero_ps(); // for AVX2, sum of 8 floats
int i_n = i * n;
for (int j = 0; j < nn; j += 8) {
// Load 8 values from w and x
__m256 w_vec = _mm256_loadu_ps(&w[i_n + j]);
__m256 x_vec = _mm256_loadu_ps(&x[j]);
// Multiply and accumulate
__m256 prod_vec = _mm256_mul_ps(w_vec, x_vec);
sum_vec = _mm256_add_ps(sum_vec, prod_vec);
}
// Perform horizontal add
sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
float vals[8];
_mm256_storeu_ps(vals, sum_vec);
float val = vals[0] + vals[4];
// handle remainder if n is not a multiple of 8
for (int j = nn; j < n; j++) {
val += w[i_n + j] * x[j];
}
o[i] = val;
}
}
#else
void matmul(float* xout, float* x, float* w, int n, int d) {
// W (d,n) @ x (n,) -> xout (d,)
// by far the most amount of time is spent inside this little function
int i;
#pragma omp parallel for private(i)
for (i = 0; i < d; i++) {
float val = 0.0f;
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
}
xout[i] = val;
}
}
#endif Must pass |
Thank you @cgbur , that worked!! On my machine this takes 110M model from 26 tok/s -> 33 tok/s. For around 27% boost. Very cool. I like AVX2. I will look into memory alignment. |
Random note I love that via llama2c I've learned so much that I didn't know previously. I haven't spent as much time "below Python", but it's really fun here. |
I don't fully understand which part of the export guarantees the memory alignment to 32 bytes. There's a number of changes there (e.g. change to HalfFloat) mixed in. What is the minimal diff that would be needed to make the export work with aligned avx2 intrinsics? |
Im not too too familiar with the details of the PR changes but it looks like by ensuring that the header is aligned properly the rest of it becomes aligned correctly? # header magic version integer added for two reasons
# 1) so that we can version the header
# 2) so that the struct maintains strict cache alignment
# which is necessary so that the weights that follow the header are also cache aligned
header_magic_version = 0x42000000
header = struct.pack('iiiiiiii', header_magic_version, p['dim'], hidden_dim, n_layers, p['n_heads'],
n_kv_heads, -p['vocab_size'], p['max_seq_len']) // Config structure needs to be CACHE ALIGNED (Typically 32 Bytes)
// If you change this, it is important that export_meta_llama_bin.py is updated as well iirc your current config is 28 bytes which might throw off the rest of the weights. Once again, just a guess. I will have to play with it later to be more certain. |
Ahhh maybe that makes sense. With the addition of header_magic_version we'd be writing 8 bytes not 7 bytes in the header so 8*4 = 32. Then it still has to be the case that in run.c later, the pointer that holds the weights is aligned. The RunState is aligned because of the use of speciall calloc function. But the weights don't seem to be aligned, but they are still checked if they are aligned. :\ |
Minor nit: the author says cache aligned, but most cache lines today at 64 byte aligned. I think more appropriate is to say that its vector width aligned. A value of 32 here means that you are supporting 32*8=256 bit instructions. |
I think they are depending on the dimensions of the model to be sane enough to stumble into being aligned. I assume these models are designed for gpus with much greater alignment requirements so its probably a decent bet that the model weights will fit nicely. Instead of putting magic, you could write a u32 that is the alignment of the model by checking all the dimensions? Just occurred to me. But then I guess its probably safer to make the config 64 bytes to support avx512. I would need to test these things before making more claims :). Another reason for stumbling so successfully is that mmap aligns to the page boundary when null is passed which run.c is doing. I realized this in my implementation because zig required me to specify page alignment on the return type. This is why ensuring that the weights data has a good alignment is enough to ensure that the in memory mapping is aligned properly. const data: []align(mem.page_size) u8 = try std.os.mmap(null, file_size, std.os.PROT.READ, std.os.MAP.PRIVATE, mapped_checkpoint.handle, 0); |
This pull request introduces several key optimizations and enhancements to the Llama2.c codebase.
Key Changes:
These changes collectively result in a substantial performance boost and make the code more flexible and efficient. I hope these enhancements prove to be beneficial to the project. I look forward to your feedback and am ready to make any necessary adjustments.
Please be aware that due to the new cache alignment requirements, model re-exporting is necessary. This PR introduces a version field to the file header, which simultaneously serves as padding for cache alignment. As such, the pre-existing models will need to be re-exported to ensure proper functionality.