Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

even moar GCC flags - adding PGO optimizations #207

Closed
wants to merge 1 commit into from

Conversation

bhaktatejas922
Copy link

@bhaktatejas922 bhaktatejas922 commented Aug 1, 2023

.PHONY: runfastpgo
runfast: run.c
	$(CC) -Ofast -fprofile-instr-generate -o run run.c -lm
	./run stories15M.bin
	xcrun llvm-profdata merge -sparse default.profraw -o default.profdata
	$(CC) -Ofast -fprofile-instr-use=default.profdata -o run run.c -lm

No noticeable speed difference from runfast on m1 max, but may generally be useful, maybe with different/larger weights or larger runs. Only tested on the default run.

DNM, needs more benchmarks

@bhaktatejas922 bhaktatejas922 changed the title adding PGO optimizations even moar GCC flags - adding PGO optimizations Aug 1, 2023
@mrsteyk
Copy link

mrsteyk commented Aug 1, 2023

xcrun is MacOS specific. llvm-profdata isn't.

@karpathy
Copy link
Owner

karpathy commented Aug 1, 2023

If it makes no difference then why add 14 lines to the repo...

@twobob
Copy link

twobob commented Aug 1, 2023

TLDR: 3 build types. Best of three runs. Windows box deterministic 0 temp used

Processors: 4
Processor speed: 2.0 GHz
Processor type: AMD Athlon(tm) 5350 APU with Radeon(tm) R3
Physical memory: 3530 MB

x86_64-w64-mingw32-gcc
x86_64-w64-mingw32-gcc -march=native -fopenmp -Ofast -D_WIN32 -o run.exe -I. run.c win.c
set OMP_NUM_THREADS=4 && mingw_run.exe ../out/model110M.bin 0 1000 "and away they went" False
achieved tok/s: 62.4
achieved tok/s: 64.6
achieved tok/s: 65.4

(additional details below about how the profiling was done)
clang -Ofast -fopenmp -D_WIN32 -fprofile-instr-use=default.profdata
set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 1000 "and away they went" False
achieved tok/s: 68.9
achieved tok/s: 68.2
achieved tok/s: 67.3
note: (5 threads achieved tok/s: 58.0 3 threads achieved tok/s: 64.3, 4 is best consistently)

(because of
clang_rt.profile-x86_64.lib(WindowsMMap.c.obj) : error LNK2005: mmap already defined in win-827fb0.o
clang_rt.profile-x86_64.lib(WindowsMMap.c.obj) : error LNK2005: munmap already defined in win-827fb0.o
clang_rt.profile-x86_64.lib(WindowsMMap.c.obj) : error LNK2005: msync already defined in win-827fb0.o
I magicked mem stuff out the way hence win_clang.c
)
clang -Ofast -fopenmp -D_WIN32 -fprofile-instr-generate -o run.exe -I. run.c win_clang.c
set OMP_NUM_THREADS=4 && ./run.exe ../out/model110M.bin
llvm-profdata merge -sparse default.profraw -o default.profdata
clang -Ofast -fopenmp -D_WIN32 -fprofile-instr-use=default.profdata -o run.exe -I. run.c win.c

and finally just a regular clang build for reference
clang
clang -Ofast -fopenmp -D_WIN32 -o run_clang.exe -I. run.c win.c

set OMP_NUM_THREADS=4 && run_clang.exe ../out/model110M.bin 0 1000 "and away they went"
achieved tok/s: 67.9
achieved tok/s: 68.8
achieved tok/s: 69.0

Takeaways: Clang seems to be the "better" compiler.
it's a faff to have the macros when running clang_rt.profile-x86_64.lib
the profiling did not seem to bring any benefits on the face of it.

I'm no expert but you looked like you might want some metrics. hope it helps

¯\(ツ)

eh full disclosure those speed are with pull #95 implemented but that doesn't affect the relative speeds

@bhaktatejas922
Copy link
Author

If it makes no difference then why add 14 lines to the repo...

True, opened this PR in a sleep depraved stupor. Closing since the optimizations only seem to help ~10% on gcc and not at all on clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants