At HubSpot we use dba workload on Vitess for migrations.
We discovered that sometimes certain shards will hang while doing a planned reparent and eventually fail.
We were able to isolate the hanging behavior to a draining tx_pool on vttablet
goroutine 19866670 [semacquire, 5 minutes]:
sync.runtime_notifyListWait(0xc420430390, 0xc400000000)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc420430380)
/usr/local/go/src/sync/cond.go:56 +0x80
vitess.io/vitess/go/pools.(*Numbered).WaitForEmpty(0xc420440240)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/pools/numbered.go:182 +0x66
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TxPool).WaitForEmpty(0xc4202de700)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tx_pool.go:185 +0x2f
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TxEngine).Close(0xc420391200, 0xc421647e00)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tx_engine.go:195 +0xce
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).waitForShutdown(0xc4200de100)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:565 +0x67
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).gracefulStop(0xc4200de100)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:525 +0x63
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).SetServingType(0xc4200de100, 0x1, 0x0, 0x0, 0x0, 0xc42147b800, 0x0, 0x0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:418 +0x1db
vitess.io/vitess/go/vt/vttablet/tabletmanager.(*ActionAgent).DemoteMaster(0xc4200dec00, 0x7fa630d54aa8, 0xc421862420, 0x0, 0x0, 0x0, 0x0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletmanager/rpc_replication.go:307 +0x16a
vitess.io/vitess/go/vt/vttablet/grpctmserver.(*server).DemoteMaster(0xc42017c890, 0x7fa630d54aa8, 0xc421862420, 0x1b7a1b8, 0xc42166b220, 0x7fa630d54aa8, 0xc421062ab0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/grpctmserver/server.go:366 +0x182
vitess.io/vitess/go/vt/proto/tabletmanagerservice._TabletManager_DemoteMaster_Handler(0x10abc40, 0xc42017c890, 0x7fa630d54aa8, 0xc4218623c0, 0xc422ae1810, 0x0, 0x0, 0x0, 0x12000, 0x198eeb0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/proto/tabletmanagerservice/tabletmanagerservice.pb.go:1289 +0x276
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680, 0xc4202fba10, 0x1999ed8, 0x0, 0x0, 0x0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:923 +0x92d
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).handleStream(0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680, 0x0)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:1148 +0x1528
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420f73120, 0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680)
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:637 +0x9f
created by vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:635 +0xa1
- We discovered that planned reparent does not gracefully handle dba txns because it assume all txns will deadline exceed
- There are no indicators/stats for transactions like this existing in the tx_pool since it is just a boolean flag (https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletserver/tx_pool.go#L256)
- We discovered that because txns are not tied to rpc connections, a dba connection can be leaked if the client dies and rollback/commit never occurs