Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pi and e to Float32 and Float16 #559

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Pi and e to Float32 and Float16 #559

wants to merge 4 commits into from

Conversation

christiangnrd
Copy link
Member

Bypasses the conversion to BigFloat when converting pi and e to Float32 and Float16 on gpu. Values are taken from the constants in Tables 6.5 and 6.6 of the Metal shading language specification.

Close #551

Copy link
Contributor

github-actions bot commented Mar 4, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/test/device/intrinsics/math.jl b/test/device/intrinsics/math.jl
index de7ab2f4..34095e3b 100644
--- a/test/device/intrinsics/math.jl
+++ b/test/device/intrinsics/math.jl
@@ -312,32 +312,32 @@ end
         @test occursin(Regex("@air\\.sign\\.f$(8*sizeof(T))"), ir)
     end
 
-    # Borrowed from the Julia "Irrationals compared with Rationals and Floats" testset
-    @testset "Comparisons with $irr" for irr in (π, ℯ)
-        function convert_test_32(res)
-            res[1] = Float32(irr,RoundDown) < irr
-            res[2] = Float32(irr,RoundUp) > irr
-            res[3] = !(Float32(irr,RoundDown) > irr)
-            res[4] = !(Float32(irr,RoundUp) < irr)
-            return nothing
+        # Borrowed from the Julia "Irrationals compared with Rationals and Floats" testset
+        @testset "Comparisons with $irr" for irr in (π, ℯ)
+            function convert_test_32(res)
+                res[1] = Float32(irr, RoundDown) < irr
+                res[2] = Float32(irr, RoundUp) > irr
+                res[3] = !(Float32(irr, RoundDown) > irr)
+                res[4] = !(Float32(irr, RoundUp) < irr)
+                return nothing
+            end
+
+            res_32 = MtlArray(zeros(Bool, 4))
+            Metal.@sync @metal convert_test_32(res_32)
+            @test all(Array(res_32))
+
+            function convert_test_16(res)
+                res[1] = Float16(irr, RoundDown) < irr
+                res[2] = Float16(irr, RoundUp) > irr
+                res[3] = !(Float16(irr, RoundDown) > irr)
+                res[4] = !(Float16(irr, RoundUp) < irr)
+                return nothing
+            end
+
+            res_16 = MtlArray(zeros(Bool, 4))
+            Metal.@sync @metal convert_test_16(res_16)
+            @test all(Array(res_16))
         end
-
-        res_32 = MtlArray(zeros(Bool,4))
-        Metal.@sync @metal convert_test_32(res_32)
-        @test all(Array(res_32))
-
-        function convert_test_16(res)
-            res[1] = Float16(irr,RoundDown) < irr
-            res[2] = Float16(irr,RoundUp) > irr
-            res[3] = !(Float16(irr,RoundDown) > irr)
-            res[4] = !(Float16(irr,RoundUp) < irr)
-            return nothing
-        end
-
-        res_16 = MtlArray(zeros(Bool,4))
-        Metal.@sync @metal convert_test_16(res_16)
-        @test all(Array(res_16))
-    end
 end
 end
 

@christiangnrd
Copy link
Member Author

The Metal constants seem to have the same values as RoundNearest. Should I instead implement the RoundUp/RoundDown behaviour that the cpu comparison uses (irrationals.jl in julia base), or should I leave it as the constants like the default Metal behaviour?

julia> Float32(π, RoundUp)
3.1415927f0

julia> Float32(π, RoundDown)
3.1415925f0

julia> Float16(π, RoundDown)
Float16(3.14)

julia> Float16(π, RoundUp)
Float16(3.143)

julia> Float16(π, RoundNearest)
Float16(3.14)

julia> Float32(π, RoundNearest)
3.1415927f0

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: df7aae2 Previous: 4903c64 Ratio
private array/construct 26923.666666666668 ns 26097.25 ns 1.03
private array/broadcast 262562.5 ns 464250 ns 0.57
private array/random/randn/Float32 499666 ns 802521.5 ns 0.62
private array/random/randn!/Float32 486083 ns 634417 ns 0.77
private array/random/rand!/Int64 455749.5 ns 555166.5 ns 0.82
private array/random/rand!/Float32 358333 ns 593917 ns 0.60
private array/random/rand/Int64 694875 ns 786666.5 ns 0.88
private array/random/rand/Float32 377292 ns 607291.5 ns 0.62
private array/copyto!/gpu_to_gpu 227208 ns 651959 ns 0.35
private array/copyto!/cpu_to_gpu 253417 ns 816583 ns 0.31
private array/copyto!/gpu_to_cpu 252500 ns 695209 ns 0.36
private array/accumulate/1d 7986834 ns 1337584 ns 5.97
private array/accumulate/2d 961916 ns 1415584 ns 0.68
private array/iteration/findall/int 7972541.5 ns 2090292 ns 3.81
private array/iteration/findall/bool 7985875 ns 1820792 ns 4.39
private array/iteration/findfirst/int 1161312 ns 1682334 ns 0.69
private array/iteration/findfirst/bool 1150166 ns 1668062.5 ns 0.69
private array/iteration/scalar 1524271 ns 3837708 ns 0.40
private array/iteration/logical 8099917 ns 3204958.5 ns 2.53
private array/iteration/findmin/1d 1178667 ns 1767791.5 ns 0.67
private array/iteration/findmin/2d 890083 ns 1357208 ns 0.66
private array/reductions/reduce/1d 462459 ns 1034666.5 ns 0.45
private array/reductions/reduce/2d 470292 ns 666375 ns 0.71
private array/reductions/mapreduce/1d 480792 ns 1037104 ns 0.46
private array/reductions/mapreduce/2d 462916.5 ns 668667 ns 0.69
private array/permutedims/4d 1440458 ns 2542895.5 ns 0.57
private array/permutedims/2d 748666 ns 1024354.5 ns 0.73
private array/permutedims/3d 1124604 ns 1585750 ns 0.71
private array/copy 348041.5 ns 618375 ns 0.56
latency/precompile 9080074000 ns 9065026583 ns 1.00
latency/ttfp 3633120458 ns 3618215084 ns 1.00
latency/import 1243107375 ns 1245634417 ns 1.00
integration/metaldevrt 531458 ns 715042 ns 0.74
integration/byval/slices=1 1587937 ns 1564979.5 ns 1.01
integration/byval/slices=3 10447709 ns 10783271 ns 0.97
integration/byval/reference 1483209 ns 1534875 ns 0.97
integration/byval/slices=2 2461354.5 ns 2619125 ns 0.94
kernel/indexing 240458 ns 468687.5 ns 0.51
kernel/indexing_checked 237895.5 ns 468687 ns 0.51
kernel/launch 50916.666666666664 ns 9437.666666666666 ns 5.40
metal/synchronization/stream 14250 ns 14125 ns 1.01
metal/synchronization/context 14750 ns 14708 ns 1.00
shared array/construct 26416.666666666668 ns 24409.75 ns 1.08
shared array/broadcast 253916.5 ns 460208 ns 0.55
shared array/random/randn/Float32 503146 ns 880875 ns 0.57
shared array/random/randn!/Float32 419083.5 ns 636250 ns 0.66
shared array/random/rand!/Int64 430020.5 ns 551291 ns 0.78
shared array/random/rand!/Float32 411750 ns 594125 ns 0.69
shared array/random/rand/Int64 715750 ns 789250 ns 0.91
shared array/random/rand/Float32 342500 ns 634542 ns 0.54
shared array/copyto!/gpu_to_gpu 85542 ns 83625 ns 1.02
shared array/copyto!/cpu_to_gpu 82000 ns 83041 ns 0.99
shared array/copyto!/gpu_to_cpu 84375 ns 82250 ns 1.03
shared array/accumulate/1d 7989271 ns 1340667 ns 5.96
shared array/accumulate/2d 961625 ns 1394125 ns 0.69
shared array/iteration/findall/int 7976667 ns 1845750 ns 4.32
shared array/iteration/findall/bool 7989229.5 ns 1576917 ns 5.07
shared array/iteration/findfirst/int 940166.5 ns 1392979.5 ns 0.67
shared array/iteration/findfirst/bool 925708 ns 1375687.5 ns 0.67
shared array/iteration/scalar 153542 ns 153000 ns 1.00
shared array/iteration/logical 8054708 ns 2990542 ns 2.69
shared array/iteration/findmin/1d 976625 ns 1483333.5 ns 0.66
shared array/iteration/findmin/2d 895709 ns 1364208.5 ns 0.66
shared array/reductions/reduce/1d 377583 ns 730667 ns 0.52
shared array/reductions/reduce/2d 474500 ns 670583 ns 0.71
shared array/reductions/mapreduce/1d 371250 ns 734062.5 ns 0.51
shared array/reductions/mapreduce/2d 478292 ns 670875 ns 0.71
shared array/permutedims/4d 1444000 ns 2547458.5 ns 0.57
shared array/permutedims/2d 747125 ns 1023687 ns 0.73
shared array/permutedims/3d 1119000 ns 1588812.5 ns 0.70
shared array/copy 241729.5 ns 238750 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Mar 11, 2025

The Metal constants seem to have the same values as RoundNearest

What do you mean with the Metal constants? As coincidentally being discussed in JuliaGPU/CUDA.jl#2644 (comment), I guess it's better to be consistent with other Julia code rather than with Metal C.

@christiangnrd
Copy link
Member Author

Latest push makes the Metal behaviour the same as cpu behaviour at least for comparisons.

Comment on lines 11 to 22
### Constants
# π
@device_override Core.Float32(::typeof(π), ::RoundingMode) = reinterpret(Float32, 0x40490fdb) # 3.1415927f0 reinterpret(UInt32,Float32(reinterpret(Float64,0x400921FB60000000)))
@device_override Core.Float32(::typeof(π), ::RoundingMode{:Down}) = reinterpret(Float32, 0x40490fda) # 3.1415925f0 prevfloat(reinterpret(UInt32,Float32(reinterpret(Float64,0x400921FB60000000))))
@device_override Core.Float16(::typeof(π), ::RoundingMode{:Up}) = reinterpret(Float16, 0x4249) # Float16(3.143)
@device_override Core.Float16(::typeof(π), ::RoundingMode) = reinterpret(Float16, 0x4248) # Float16(3.14)

# ℯ
@device_override Core.Float32(::typeof(ℯ), ::RoundingMode{:Up}) = reinterpret(Float32, 0x402df855) # 2.718282f0 nextfloat(reinterpret(UInt32,Float32(reinterpret(Float64,0x4005BF0A80000000))))
@device_override Core.Float32(::typeof(ℯ), ::RoundingMode) = reinterpret(Float32, 0x402df854) # 2.7182817f0 reinterpret(UInt32,Float32(reinterpret(Float64,0x4005BF0A80000000)))
@device_override Core.Float16(::typeof(ℯ), ::RoundingMode) = reinterpret(Float16, 0x4170) # Float16(2.719)
@device_override Core.Float16(::typeof(ℯ), ::RoundingMode{:Down}) = reinterpret(Float16, 0x416f) # Float16(2.717)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to generate those definitions with some metaprogramming, computing the constants on the fly, instead of hard-coding them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best I could come up with (includes a definition for cpu that woudn't make it to the PR:

macro _const_convert(irr,T,r)
    :($T($irr, $r))
end
for T in (:Float32, :Float16), irr in (:π, :ℯ), r in (:RoundUp, :RoundDown)
    @eval begin
        @device_override $T(::typeof($irr), ::typeof($r)) = @_const_convert($irr, $T, $r)
    end
end

And while maybe not the best approach, the @code_llvm for the CPU is:

; Function Signature: newFloat32(Base.Irrational{:π}, Base.Rounding.RoundingMode{:Up})
;  @ REPL[9]:4 within `newFloat32`
define float @julia_newFloat32_6871() #0 {
top:
  ret float 0x400921FB60000000
}

But when I try to run it, I get a GPUCompiler error:

julia> @device_code_llvm @metal convert_test_32(res_32)
; GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}(MethodInstance for convert_test_32(::MtlDeviceVector{Bool, 1}), CompilerConfig for GPUCompiler.MetalCompilerTarget, 0x0000000000006877)
ERROR: old function still has uses (via a constant expr)
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] add_global_address_spaces!(job::GPUCompiler.CompilerJob, mod::LLVM.Module, entry::LLVM.Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/metal.jl:414
  [3] finish_ir!(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget}, mod::LLVM.Module, entry::LLVM.Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/metal.jl:166
  [4] finish_ir!(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, mod::LLVM.Module, entry::LLVM.Function)
    @ Metal ~/.julia/dev/Metal/src/compiler/compilation.jl:14
  [5] macro expansion
    @ ~/.julia/dev/GPUCompiler/src/driver.jl:284 [inlined]
  [6] emit_llvm(job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/utils.jl:110
  [7] emit_llvm(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/utils.jl:108
  [8] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:95
  [9] compile_unhooked
    @ ~/.julia/dev/GPUCompiler/src/driver.jl:80 [inlined]
 [10] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:67
 [11] compile
    @ ~/.julia/dev/GPUCompiler/src/driver.jl:55 [inlined]
 [12] (::GPUCompiler.var"#235#236"{Bool, Symbol, Bool, GPUCompiler.CompilerJob{…}, GPUCompiler.CompilerConfig{…}})(ctx::Context)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/reflection.jl:191
 [13] JuliaContext(f::GPUCompiler.var"#235#236"{Bool, Symbol, Bool, GPUCompiler.CompilerJob{…}, GPUCompiler.CompilerConfig{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:34
 [14] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:25
 [15] code_llvm(io::Base.TTY, job::GPUCompiler.CompilerJob; optimize::Bool, raw::Bool, debuginfo::Symbol, dump_module::Bool, kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/reflection.jl:190
 [16] code_llvm
    @ ~/.julia/dev/GPUCompiler/src/reflection.jl:186 [inlined]
 [17] (::GPUCompiler.var"#hook#246"{GPUCompiler.var"#hook#245#247"})(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}; io::Base.TTY, kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/reflection.jl:337
 [18] (::GPUCompiler.var"#hook#246"{GPUCompiler.var"#hook#245#247"})(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/reflection.jl:335
 [19] var"#3#outer_hook"(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams})
    @ Main ~/.julia/dev/GPUCompiler/src/reflection.jl:246
 [20] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:64
 [21] compile
    @ ~/.julia/dev/GPUCompiler/src/driver.jl:55 [inlined]
 [22] (::Metal.var"#155#163"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}})(ctx::Context)
    @ Metal ~/.julia/dev/Metal/src/compiler/compilation.jl:108
 [23] JuliaContext(f::Metal.var"#155#163"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:34
 [24] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:25
 [25] macro expansion
    @ ~/.julia/dev/Metal/src/compiler/compilation.jl:107 [inlined]
 [26] macro expansion
    @ ~/.julia/packages/ObjectiveC/TgrW6/src/os.jl:264 [inlined]
 [27] compile(job::GPUCompiler.CompilerJob)
    @ Metal ~/.julia/dev/Metal/src/compiler/compilation.jl:105
 [28] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/execution.jl:245
 [29] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/execution.jl:159
 [30] macro expansion
    @ ~/.julia/dev/Metal/src/compiler/execution.jl:189 [inlined]
 [31] macro expansion
    @ ./lock.jl:273 [inlined]
 [32] mtlfunction(f::typeof(convert_test_32), tt::Type{Tuple{MtlDeviceVector{Bool, 1}}}; name::Nothing, kwargs::@Kwargs{})
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:184
 [33] mtlfunction(f::typeof(convert_test_32), tt::Type{Tuple{MtlDeviceVector{Bool, 1}}})
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:182
 [34] macro expansion
    @ ~/.julia/dev/Metal/src/compiler/execution.jl:85 [inlined]
 [35] top-level scope
    @ ~/.julia/dev/GPUCompiler/src/reflection.jl:257
 [36] top-level scope
    @ ~/.julia/dev/Metal/src/initialization.jl:79
Some type information was truncated. Use `show(err)` to see complete types.

While writing this I also tried:

for T in (:Float32, :Float16), irr in (:π, :ℯ), r in (:RoundUp, :RoundDown)
    @eval begin
        @device_override $T(::typeof($irr), ::typeof($r)) = Base.Rounding._convert_rounding($T, $irr, $r)
    end
end

But that also gives the "constant expression still has uses" error

Copy link
Member Author

@christiangnrd christiangnrd Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This errors in some really weird ways when running tests in CI and locally, so here's a copy-paste code snippet:

using Metal; begin
function convert_test_32(res)
    res[1] = Float32(irr,RoundDown) < irr
    res[2] = Float32(irr,RoundUp) > irr
    res[3] = !(Float32(irr,RoundDown) > irr)
    res[4] = !(Float32(irr,RoundUp) < irr)
    return nothing
end
res_32 = MtlArray(zeros(Bool,4))
Metal.@sync @metal convert_test_32(res_32)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't compare Float32 with pi on Metal
2 participants