Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[java] update master table locations cache #10

Closed
wants to merge 1 commit into from

Conversation

zhangyifan27
Copy link
Owner

@zhangyifan27 zhangyifan27 commented Jul 4, 2022

Recently a master in our cluster is down because of network issues and somehow the server didn't close the connections to some clients. Then these clients keep trying to connect to the dead master but can't receive response until timeout, even when this server is up the client still send rpc through the old channel and can't connect to the server and the new leader master. The only solution is to restart clients.

This patch fixes the issue that java client can't invalidate stale locations of the leader master. Maybe in this case we also need a better way to trigger connection shutdown for an inactive channel.

@zhangyifan27
Copy link
Owner Author

A better way is to disconnect the old channel.

@zhangyifan27 zhangyifan27 force-pushed the fix_java_master_failover branch 2 times, most recently from cb5457b to 8bd6406 Compare July 7, 2022 03:03
Recently a master in our cluster was down because of network issues and somehow
the server didn't close the connections to some clients. Then these clients
keep trying to connect to the dead master but can't receive response until rpc
timeout, even when this server is up the client still send rpc through the old
channel and can't connect to the server and the new leader master. The only
solution is to restart clients.

This patch fixes the issue that java client can't invalidate stale locations of
the leader master. Maybe in this case we also need a better way to trigger
connection shutdown for an inactive channel.

Change-Id: Ia2877518866ac4c2d1dda6427ce57d08df48a864
@zhangyifan27 zhangyifan27 force-pushed the fix_java_master_failover branch from 8bd6406 to 469b6b4 Compare July 7, 2022 03:42
zhangyifan27 pushed a commit that referenced this pull request Jun 12, 2024
It turned out that auto leader rebalancing task wasn't explicitly
shutdown upon shutting down catalog manager.  That lead to race
conditions as reported by TSAN, at least in test scenarios (see below).
This patch addresses the issue.

  WARNING: ThreadSanitizer: data race (pid=23827)
    Write of size 1 at 0x7b4000008208 by main thread:
      #0 AnnotateRWLockDestroy thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp:264 (auto_rebalancer-test+0x33575e)
      #1 kudu::rw_spinlock::~rw_spinlock() src/kudu/util/locks.h:89:5 (libmaster.so+0x359376)
      #2 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:108:1 (libmaster.so+0x4ad201)
      #3 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:107:25 (libmaster.so+0x4ad229)
      #4 std::__1::default_delete<kudu::master::TSManager>::operator()(kudu::master::TSManager*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x407ce7)
      #5 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::reset(kudu::master::TSManager*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x40157d)
      #6 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::~unique_ptr() thirdparty/installed/tsan/include/c++/v1/memory:2471:19 (libmaster.so+0x4015eb)
      #7 kudu::master::Master::~Master() src/kudu/master/master.cc:263:1 (libmaster.so+0x3f7a4a)
      #8 kudu::master::Master::~Master() src/kudu/master/master.cc:261:19 (libmaster.so+0x3f7dc9)
      #9 std::__1::default_delete<kudu::master::Master>::operator()(kudu::master::Master*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x435627)
      #10 std::__1::unique_ptr<kudu::master::Master, std::__1::default_delete<kudu::master::Master> >::reset(kudu::master::Master*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x42e6ed)
      #11 kudu::master::MiniMaster::Shutdown() src/kudu/master/mini_master.cc:120:13 (libmaster.so+0x4c2612)
    ...
    Previous atomic write of size 4 at 0x7b4000008208 by thread T439 (mutexes: write M1141235379631443968):
      #0 __tsan_atomic32_compare_exchange_strong thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:780 (auto_rebalancer-test+0x33eb60)
      #1 base::subtle::Release_CompareAndSwap(int volatile*, int, int) /src/kudu/gutil/atomicops-internals-tsan.h:88:3 (libmaster.so+0x2e2b34)
      #2 kudu::rw_semaphore::unlock_shared() src/kudu/util/rw_semaphore.h:91:19 (libmaster.so+0x2e29c8)
      #3 kudu::rw_spinlock::unlock_shared() src/kudu/util/locks.h:99:10 (libmaster.so+0x2e28ef)
      #4 std::__1::shared_lock<kudu::rw_spinlock>::~shared_lock() /thirdparty/installed/tsan/include/c++/v1/shared_mutex:369:19 (libmaster.so+0x2e23e0)
      #5 kudu::master::TSManager::GetAllDescriptors(std::__1::vector<std::__1::shared_ptr<kudu::master::TSDescriptor>, std::__1::allocator<std::__1::shared_ptr<kudu::master::TSDescriptor> > >*) const src/kudu/master/ts_manager.cc:206:1 (libmaster.so+0x4adeb6)
      #6 kudu::master::AutoLeaderRebalancerTask::RunLeaderRebalancer() src/kudu/master/auto_leader_rebalancer.cc:405:16 (libmaster.so+0x2fb51b)
      #7 kudu::master::AutoLeaderRebalancerTask::RunLoop() src/kudu/master/auto_leader_rebalancer.cc:445:7 (libmaster.so+0x2fbaa9)

This is a follow-up to 10efaf2.

Change-Id: Iccd66d00280d22b37386230874937e5260f07f3b
Reviewed-on: http://gerrit.cloudera.org:8080/21417
Reviewed-by: Wang Xixu <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Yifan Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant