Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove use of threadid #429

Open
devmotion opened this issue Oct 17, 2022 · 9 comments
Open

Remove use of threadid #429

devmotion opened this issue Oct 17, 2022 · 9 comments

Comments

@devmotion
Copy link
Member

Just noticed that ThreadSafeVarInfo makes heavy use of threadid:

function acclogp!!(vi::ThreadSafeVarInfo, logp)
vi.logps[Threads.threadid()] += logp
return vi
end
function acclogp!!(vi::ThreadSafeVarInfoWithRef, logp)
vi.logps[Threads.threadid()][] += logp
return vi
end

That should be changed since it is broken in Julia >= 1.8 if users don't use @threads :static for ... (see e.g. https://discourse.julialang.org/t/behavior-of-threads-threads-for-loop/76042).

@devmotion
Copy link
Member Author

Generally, I feel the general use of ThreadSafeVarInfo with non-mutable varinfo objects probably is broken and should maybe be disallowed. Everytime different threads would call any of the !! methods (or the ones without !! that still return a new varinfo object), different threads would continue working with different varinfo objects.

The more general use with mutable varinfo objects could be fixed, I assume, by restricting access with a lock. But maybe that's not the most efficient way.

@torfjelde
Copy link
Member

Generally, I feel the general use of ThreadSafeVarInfo with non-mutable varinfo objects probably is broken and should maybe be disallowed.

So it should still be functional for evaluation because, yes, even though the returned __varinfo__ is thread-local, the logp field which is mutated is still the same.

For the immutable varinfos we actually have the opportunity to even parallelize sampling by just implementing a merge for the varinfos, though it would require some manual labour from the users side, e.g. mapreduce.

The more general use with mutable varinfo objects could be fixed, I assume, by restricting access with a lock.

Seems like a good idea 👍

@torfjelde
Copy link
Member

Btw, regarding this, how can :dynamic cause issues here?

It seems like :dynamic means that tasks can migrate between threads, but how is it then possible that the two different tasks try to mutate the same index at the same time? For this to happen, it seems that you'd need one Task to yield on the line

vi.logps[Threads.threadid()] += logp

and then another Task, which is moved to the same thread, also hits this line before going back to "complete" this line in the original Task.

But there's no way the above line will actually cause the Task to yield back to the scheduler, right? So even though, yes, now there won't be a specific Task per thread, there still doesn't seem to be a scenario where the same thread will try to perform the above accumulation using two different tasks simultaenously?

Oooor am I misunderstanding something here?

@devmotion
Copy link
Member Author

True, it seems when opening the issue I thought it's worse than it actually is. Generally though, IMO using threadid is bad style even in cases where currently it does not introduce concurrency issues. I think the explanation in https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid is quite good. Currently this specific line does not yield to the scheduler AFAIK (e.g., it does not involve any IO operation or logging) but that's not guaranteed to be the case in future Julia versions.

@torfjelde
Copy link
Member

torfjelde commented Feb 2, 2023

I think the explanation in https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid is quite good

Haha, this is exactly the explanation I was looking at that made me question your original worry 😅

But no, I agree with you that the usage of threadid because it could potentially introduce issues like those mentioned 👍

@devmotion
Copy link
Member Author

JuliaLang/julia#48542

@torfjelde
Copy link
Member

torfjelde commented Feb 6, 2023

Ah neat!

Something like Dictionaries.jl would probably also be useful if you already know the return-type and the it's consistent across tasks.

EDIT: On second thought, usage of Dictionaries.jl is probably not a great idea as it probably expands the underlying Vector on demand, which would surely lead to issues when multiple threads are trying to write to it 😅

@torfjelde
Copy link
Member

torfjelde commented Feb 6, 2023

How bad of an idea is it to create a random buffer of Atomic and then use atomic_add!? I.e.

julia> Threads.nthreads()
16

julia> f(x) = log(x)
f (generic function with 1 method)

julia> g(n) = sum(f, 1:n)
g (generic function with 1 method)

julia> function g_t(n; buffer_size=Threads.nthreads())
           buffers = [Threads.Atomic{Float64}(0) for _ = 1:buffer_size]
           Threads.@threads for i = 1:n
               buffer_idx = rand(1:buffer_size)
               Threads.atomic_add!(buffers[buffer_idx], f(i))
           end
           return sum(deref, buffers)
       end
g_t (generic function with 1 method)

julia> @btime $g($(10_000_000));
  74.448 ms (0 allocations: 0 bytes)

julia> @btime $g_t($(10_000_000); buffer_size=$(1));
  613.850 ms (101 allocations: 8.56 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(2));
  386.434 ms (101 allocations: 8.55 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(4));
  351.702 ms (102 allocations: 8.56 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(16));
  152.913 ms (114 allocations: 8.84 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(32));
  91.895 ms (130 allocations: 9.22 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(64));

  88.032 ms (162 allocations: 9.98 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(128));
  83.266 ms (227 allocations: 11.52 KiB)

julia> @btime $g_t($(10_000_000); buffer_size=$(1024));
  67.109 ms (1122 allocations: 32.55 KiB)

🙃

EDIT: A probably better solution would be to have buffer_size number of locks that we could just check if are currently acquired or not, and then we just iterate through until we hit one that isn't locked (with some max-tries of course before just waiting). The above was just a quick and dirty way to do something that would still improve efficiency quite drastically vs. a single Atomic.

@torfjelde
Copy link
Member

Oh and another thing: Atomics won't work once you put AD in there, rigth? E.g. Atomic seems to have a rather limited support, and so I'd expect even something like Dual to not be usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants