[java] update master table locations cache #10

zhangyifan27 · 2022-07-04T09:45:37Z

Recently a master in our cluster is down because of network issues and somehow the server didn't close the connections to some clients. Then these clients keep trying to connect to the dead master but can't receive response until timeout, even when this server is up the client still send rpc through the old channel and can't connect to the server and the new leader master. The only solution is to restart clients.

This patch fixes the issue that java client can't invalidate stale locations of the leader master. Maybe in this case we also need a better way to trigger connection shutdown for an inactive channel.

java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java

java/kudu-client/src/main/java/org/apache/kudu/client/TableLocationsCache.java

zhangyifan27 · 2022-07-04T09:54:13Z

A better way is to disconnect the old channel.

Recently a master in our cluster was down because of network issues and somehow the server didn't close the connections to some clients. Then these clients keep trying to connect to the dead master but can't receive response until rpc timeout, even when this server is up the client still send rpc through the old channel and can't connect to the server and the new leader master. The only solution is to restart clients. This patch fixes the issue that java client can't invalidate stale locations of the leader master. Maybe in this case we also need a better way to trigger connection shutdown for an inactive channel. Change-Id: Ia2877518866ac4c2d1dda6427ce57d08df48a864

It turned out that auto leader rebalancing task wasn't explicitly shutdown upon shutting down catalog manager. That lead to race conditions as reported by TSAN, at least in test scenarios (see below). This patch addresses the issue. WARNING: ThreadSanitizer: data race (pid=23827) Write of size 1 at 0x7b4000008208 by main thread: #0 AnnotateRWLockDestroy thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp:264 (auto_rebalancer-test+0x33575e) #1 kudu::rw_spinlock::~rw_spinlock() src/kudu/util/locks.h:89:5 (libmaster.so+0x359376) #2 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:108:1 (libmaster.so+0x4ad201) #3 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:107:25 (libmaster.so+0x4ad229) #4 std::__1::default_delete<kudu::master::TSManager>::operator()(kudu::master::TSManager*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x407ce7) #5 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::reset(kudu::master::TSManager*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x40157d) #6 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::~unique_ptr() thirdparty/installed/tsan/include/c++/v1/memory:2471:19 (libmaster.so+0x4015eb) #7 kudu::master::Master::~Master() src/kudu/master/master.cc:263:1 (libmaster.so+0x3f7a4a) #8 kudu::master::Master::~Master() src/kudu/master/master.cc:261:19 (libmaster.so+0x3f7dc9) #9 std::__1::default_delete<kudu::master::Master>::operator()(kudu::master::Master*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x435627) #10 std::__1::unique_ptr<kudu::master::Master, std::__1::default_delete<kudu::master::Master> >::reset(kudu::master::Master*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x42e6ed) #11 kudu::master::MiniMaster::Shutdown() src/kudu/master/mini_master.cc:120:13 (libmaster.so+0x4c2612) ... Previous atomic write of size 4 at 0x7b4000008208 by thread T439 (mutexes: write M1141235379631443968): #0 __tsan_atomic32_compare_exchange_strong thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:780 (auto_rebalancer-test+0x33eb60) #1 base::subtle::Release_CompareAndSwap(int volatile*, int, int) /src/kudu/gutil/atomicops-internals-tsan.h:88:3 (libmaster.so+0x2e2b34) #2 kudu::rw_semaphore::unlock_shared() src/kudu/util/rw_semaphore.h:91:19 (libmaster.so+0x2e29c8) #3 kudu::rw_spinlock::unlock_shared() src/kudu/util/locks.h:99:10 (libmaster.so+0x2e28ef) #4 std::__1::shared_lock<kudu::rw_spinlock>::~shared_lock() /thirdparty/installed/tsan/include/c++/v1/shared_mutex:369:19 (libmaster.so+0x2e23e0) #5 kudu::master::TSManager::GetAllDescriptors(std::__1::vector<std::__1::shared_ptr<kudu::master::TSDescriptor>, std::__1::allocator<std::__1::shared_ptr<kudu::master::TSDescriptor> > >*) const src/kudu/master/ts_manager.cc:206:1 (libmaster.so+0x4adeb6) #6 kudu::master::AutoLeaderRebalancerTask::RunLeaderRebalancer() src/kudu/master/auto_leader_rebalancer.cc:405:16 (libmaster.so+0x2fb51b) #7 kudu::master::AutoLeaderRebalancerTask::RunLoop() src/kudu/master/auto_leader_rebalancer.cc:445:7 (libmaster.so+0x2fbaa9) This is a follow-up to 10efaf2. Change-Id: Iccd66d00280d22b37386230874937e5260f07f3b Reviewed-on: http://gerrit.cloudera.org:8080/21417 Reviewed-by: Wang Xixu <[email protected]> Tested-by: Alexey Serbin <[email protected]> Reviewed-by: Yifan Zhang <[email protected]>

zhangyifan27 commented Jul 4, 2022

View reviewed changes

java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java Outdated Show resolved Hide resolved

zhangyifan27 commented Jul 4, 2022

View reviewed changes

java/kudu-client/src/main/java/org/apache/kudu/client/TableLocationsCache.java Show resolved Hide resolved

zhangyifan27 force-pushed the fix_java_master_failover branch 2 times, most recently from cb5457b to 8bd6406 Compare July 7, 2022 03:03

zhangyifan27 force-pushed the fix_java_master_failover branch from 8bd6406 to 469b6b4 Compare July 7, 2022 03:42

zhangyifan27 closed this Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[java] update master table locations cache #10

[java] update master table locations cache #10

zhangyifan27 commented Jul 4, 2022 •

edited

Loading

zhangyifan27 commented Jul 4, 2022

[java] update master table locations cache #10

[java] update master table locations cache #10

Conversation

zhangyifan27 commented Jul 4, 2022 • edited Loading

zhangyifan27 commented Jul 4, 2022

zhangyifan27 commented Jul 4, 2022 •

edited

Loading