Skip to content

Instantly share code, notes, and snippets.

@leoxlin
Created September 18, 2018 19:32
Show Gist options
  • Save leoxlin/788cac7fddfb4579a5e6364fc3b15051 to your computer and use it in GitHub Desktop.
Save leoxlin/788cac7fddfb4579a5e6364fc3b15051 to your computer and use it in GitHub Desktop.

Problem

At HubSpot we use dba workload on Vitess for migrations.

We discovered that sometimes certain shards will hang while doing a planned reparent and eventually fail.

Deep diving

We were able to isolate the hanging behavior to a draining tx_pool on vttablet

goroutine 19866670 [semacquire, 5 minutes]:
sync.runtime_notifyListWait(0xc420430390, 0xc400000000)
	/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc420430380)
	/usr/local/go/src/sync/cond.go:56 +0x80
vitess.io/vitess/go/pools.(*Numbered).WaitForEmpty(0xc420440240)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/pools/numbered.go:182 +0x66
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TxPool).WaitForEmpty(0xc4202de700)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tx_pool.go:185 +0x2f
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TxEngine).Close(0xc420391200, 0xc421647e00)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tx_engine.go:195 +0xce
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).waitForShutdown(0xc4200de100)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:565 +0x67
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).gracefulStop(0xc4200de100)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:525 +0x63
vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).SetServingType(0xc4200de100, 0x1, 0x0, 0x0, 0x0, 0xc42147b800, 0x0, 0x0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:418 +0x1db
vitess.io/vitess/go/vt/vttablet/tabletmanager.(*ActionAgent).DemoteMaster(0xc4200dec00, 0x7fa630d54aa8, 0xc421862420, 0x0, 0x0, 0x0, 0x0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/tabletmanager/rpc_replication.go:307 +0x16a
vitess.io/vitess/go/vt/vttablet/grpctmserver.(*server).DemoteMaster(0xc42017c890, 0x7fa630d54aa8, 0xc421862420, 0x1b7a1b8, 0xc42166b220, 0x7fa630d54aa8, 0xc421062ab0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/vttablet/grpctmserver/server.go:366 +0x182
vitess.io/vitess/go/vt/proto/tabletmanagerservice._TabletManager_DemoteMaster_Handler(0x10abc40, 0xc42017c890, 0x7fa630d54aa8, 0xc4218623c0, 0xc422ae1810, 0x0, 0x0, 0x0, 0x12000, 0x198eeb0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/go/vt/proto/tabletmanagerservice/tabletmanagerservice.pb.go:1289 +0x276
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680, 0xc4202fba10, 0x1999ed8, 0x0, 0x0, 0x0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:923 +0x92d
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).handleStream(0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680, 0x0)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:1148 +0x1528
vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420f73120, 0xc4200d6580, 0x19bbec0, 0xc42159fe00, 0xc4211b7680)
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:637 +0x9f
created by vitess.io/vitess/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/usr/share/hubspot/build/workspace/vitess-internal/rpm_builder/vitess-build/src/vitess.io/vitess/vendor/google.golang.org/grpc/server.go:635 +0xa1

Issues

  • We discovered that planned reparent does not gracefully handle dba txns because it assume all txns will deadline exceed
  • There are no indicators/stats for transactions like this existing in the tx_pool since it is just a boolean flag (https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletserver/tx_pool.go#L256)
  • We discovered that because txns are not tied to rpc connections, a dba connection can be leaked if the client dies and rollback/commit never occurs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment