Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kudu-2915 #6

Closed
wants to merge 1 commit into from
Closed

Kudu-2915 #6

wants to merge 1 commit into from

Conversation

zhangyifan27
Copy link
Owner

No description provided.

@zhangyifan27 zhangyifan27 force-pushed the KUDU-2915 branch 2 times, most recently from aa53871 to e3eb2e2 Compare January 6, 2022 14:35
@zhangyifan27 zhangyifan27 force-pushed the KUDU-2915 branch 4 times, most recently from 66899b6 to 022de63 Compare January 14, 2022 11:37
@zhangyifan27 zhangyifan27 force-pushed the KUDU-2915 branch 3 times, most recently from 3cf21d9 to 525927c Compare January 18, 2022 14:02
@zhangyifan27 zhangyifan27 force-pushed the KUDU-2915 branch 2 times, most recently from 51f9c1d to 565c2ea Compare January 19, 2022 09:21
Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

It removes the dead tserver from master's in-memory map and persisted
catalog by default. It's also possible to unregister a tserver which is
not presumed dead by adding '-force_unregister_live_tserver', and keep
tserver's persisted state by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
zhangyifan27 pushed a commit that referenced this pull request Jun 12, 2024
It turned out that auto leader rebalancing task wasn't explicitly
shutdown upon shutting down catalog manager.  That lead to race
conditions as reported by TSAN, at least in test scenarios (see below).
This patch addresses the issue.

  WARNING: ThreadSanitizer: data race (pid=23827)
    Write of size 1 at 0x7b4000008208 by main thread:
      #0 AnnotateRWLockDestroy thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp:264 (auto_rebalancer-test+0x33575e)
      #1 kudu::rw_spinlock::~rw_spinlock() src/kudu/util/locks.h:89:5 (libmaster.so+0x359376)
      #2 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:108:1 (libmaster.so+0x4ad201)
      #3 kudu::master::TSManager::~TSManager() src/kudu/master/ts_manager.cc:107:25 (libmaster.so+0x4ad229)
      #4 std::__1::default_delete<kudu::master::TSManager>::operator()(kudu::master::TSManager*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x407ce7)
      #5 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::reset(kudu::master::TSManager*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x40157d)
      #6 std::__1::unique_ptr<kudu::master::TSManager, std::__1::default_delete<kudu::master::TSManager> >::~unique_ptr() thirdparty/installed/tsan/include/c++/v1/memory:2471:19 (libmaster.so+0x4015eb)
      #7 kudu::master::Master::~Master() src/kudu/master/master.cc:263:1 (libmaster.so+0x3f7a4a)
      #8 kudu::master::Master::~Master() src/kudu/master/master.cc:261:19 (libmaster.so+0x3f7dc9)
      #9 std::__1::default_delete<kudu::master::Master>::operator()(kudu::master::Master*) const thirdparty/installed/tsan/include/c++/v1/memory:2262:5 (libmaster.so+0x435627)
      #10 std::__1::unique_ptr<kudu::master::Master, std::__1::default_delete<kudu::master::Master> >::reset(kudu::master::Master*) thirdparty/installed/tsan/include/c++/v1/memory:2517:7 (libmaster.so+0x42e6ed)
      #11 kudu::master::MiniMaster::Shutdown() src/kudu/master/mini_master.cc:120:13 (libmaster.so+0x4c2612)
    ...
    Previous atomic write of size 4 at 0x7b4000008208 by thread T439 (mutexes: write M1141235379631443968):
      #0 __tsan_atomic32_compare_exchange_strong thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_atomic.cpp:780 (auto_rebalancer-test+0x33eb60)
      #1 base::subtle::Release_CompareAndSwap(int volatile*, int, int) /src/kudu/gutil/atomicops-internals-tsan.h:88:3 (libmaster.so+0x2e2b34)
      #2 kudu::rw_semaphore::unlock_shared() src/kudu/util/rw_semaphore.h:91:19 (libmaster.so+0x2e29c8)
      #3 kudu::rw_spinlock::unlock_shared() src/kudu/util/locks.h:99:10 (libmaster.so+0x2e28ef)
      #4 std::__1::shared_lock<kudu::rw_spinlock>::~shared_lock() /thirdparty/installed/tsan/include/c++/v1/shared_mutex:369:19 (libmaster.so+0x2e23e0)
      #5 kudu::master::TSManager::GetAllDescriptors(std::__1::vector<std::__1::shared_ptr<kudu::master::TSDescriptor>, std::__1::allocator<std::__1::shared_ptr<kudu::master::TSDescriptor> > >*) const src/kudu/master/ts_manager.cc:206:1 (libmaster.so+0x4adeb6)
      #6 kudu::master::AutoLeaderRebalancerTask::RunLeaderRebalancer() src/kudu/master/auto_leader_rebalancer.cc:405:16 (libmaster.so+0x2fb51b)
      #7 kudu::master::AutoLeaderRebalancerTask::RunLoop() src/kudu/master/auto_leader_rebalancer.cc:445:7 (libmaster.so+0x2fbaa9)

This is a follow-up to 10efaf2.

Change-Id: Iccd66d00280d22b37386230874937e5260f07f3b
Reviewed-on: http://gerrit.cloudera.org:8080/21417
Reviewed-by: Wang Xixu <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Yifan Zhang <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Oct 11, 2024
The race condition was reported by the TSAN like the following
(with some information omitted):

  WARNING: ThreadSanitizer: data race (pid=1924273)
    Write of size 8 at 0x7b30002fe7c0 by thread T6 (mutexes: write M247597861, write M247597860, write M247597300):
      #0 std::__1::enable_if<(...), void>::type std::__1::swap<kudu::BlockId*>(...) thirdparty/installed/tsan/include/c++/v1/type_traits:4076:9
      ...
      #4 kudu::tablet::RowSetMetadata::CommitRedoDeltaDataBlock(...) src/kudu/tablet/rowset_metadata.cc:197:22
      #5 kudu::tablet::DeltaTracker::FlushDMS(...) src/kudu/tablet/delta_tracker.cc:826:23
      #6 kudu::tablet::DeltaTracker::Flush(...) src/kudu/tablet/delta_tracker.cc:877:14
      #7 kudu::tablet::DiskRowSet::FlushDeltas(...) src/kudu/tablet/diskrowset.cc:552:26
      ...

    Previous read of size 8 at 0x7b30002fe7c0 by thread T34 (mutexes: write M247598319, write M919714229363433616, write M303002710007881612):
      #0 std::__1::vector<...>::size() const thirdparty/installed/tsan/include/c++/v1/vector:658:61
      #1 kudu::tablet::RowSetMetadata::GetAllBlocks() const src/kudu/tablet/rowset_metadata.cc:306:37
      #2 kudu::tablet::TabletMetadata::UpdateUnlocked(...) src/kudu/tablet/tablet_metadata.cc:677:40
      #3 kudu::tablet::TabletMetadata::UpdateAndFlush(...) src/kudu/tablet/tablet_metadata.cc:549:5
      #4 kudu::tablet::Tablet::FlushMetadata(...) src/kudu/tablet/tablet.cc:1992:21
      #5 kudu::tablet::Tablet::HandleEmptyCompactionOrFlush() src/kudu/tablet/tablet.cc:2308:3
      #6 kudu::tablet::Tablet::DeleteAncientDeletedRowsets() src/kudu/tablet/tablet.cc:3084:3
      ...

Change-Id: I07103269526d0ee98b0bb19e76e11f7d47a5b217
Reviewed-on: http://gerrit.cloudera.org:8080/21799
Reviewed-by: Abhishek Chennaka <[email protected]>
Tested-by: Alexey Serbin <[email protected]>
zhangyifan27 pushed a commit that referenced this pull request Feb 5, 2025
The thread pool of the DNS resolver should be shut down along with the
messenger in ServerBase to prevent retrying of RPCs that failed as a
collateral of the shutdown process in progress.  Those RPCs might be
retried by invoking rpc::Proxy::RefreshDnsAndEnqueueRequest(), etc.

On the related note, I also added a guard to protect ThreadPool::tokens_
in the destructor of the ThreadPool class, as elsewhere.  I also snuck
in an update to call DCHECK() in a loop only when DCHECK_IS_ON()
macro evaluates to 'true'.

This addresses flakiness reported at least in one of the RemoteKsckTest
scenarios (e.g., TestFilterOnNotabletTable in [1]).  One of the related
TSAN reports looked like below:

RemoteKsckTest.TestFilterOnNotabletTable: WARNING: ThreadSanitizer: data race
  Read of size 8 at 0x7b54001e5118 by main thread:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::size() const
    #1 std::__1::unordered_set<kudu::ThreadPoolToken*, ...>::size() const
    #2 kudu::ThreadPool::~ThreadPool()
    ...
    #6 kudu::kserver::KuduServer::~KuduServer()
    #7 kudu::tserver::TabletServer::~TabletServer()
    ...

  Previous write of size 8 at 0x7b54001e5118 by thread T262 ...:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::remove(...)
    ...
    #4 kudu::ThreadPool::ReleaseToken(...)
    #5 kudu::ThreadPoolToken::~ThreadPoolToken()
    ...
    apache#24 kudu::consensus::LeaderElection::~LeaderElection()
    ...
    apache#35 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    ...
    apache#41 kudu::DnsResolver::RefreshAddressesAsync()
    ...

  Thread T262 'dns-resolver [w' (tid=29102, running) created by thread T182 at:
    #0 pthread_create
    #1 kudu::Thread::StartThread(...)
    #2 kudu::Thread::Create(...)
    #3 kudu::ThreadPool::CreateThread()
    #4 kudu::ThreadPool::DoSubmit(..., kudu::ThreadPoolToken*)
    #5 kudu::ThreadPool::Submit(...)
    #6 kudu::DnsResolver::RefreshAddressesAsync(..)
    #7 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    #8 kudu::rpc::Proxy::AsyncRequest(...)
    ...
    #15 kudu::rpc::OutboundCall::CallCallback()
    apache#16 kudu::rpc::OutboundCall::SetFailed()
    apache#17 kudu::rpc::Connection::Shutdown()
    apache#18 kudu::rpc::ReactorThread::ShutdownInternal()
    ...
    apache#25 kudu::rpc::ReactorThread::RunThread()
    ...

[1] http://dist-test.cloudera.org:8080/test_drilldown?test_name=ksck_remote-test

Change-Id: I525f1078a349dbd2926938bb4fcc3e80888dfbb4
Reviewed-on: http://gerrit.cloudera.org:8080/22434
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Abhishek Chennaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant