HA Active Task Handoff Between Nodes (for Drain)
Currently I believe that "drain" on HA clusters simply waits for tasks to finish. This is okay when we are draining during scheduled downtime / low deployment time.
However, as CI and automation mean deployments are happening constantly this creates a few issues.
Scenario 1: Hard reboots need to wait for a full drain - We had a rogue deployment spitting out 200k lines of text on a HA node effectively bringing it to its knees. Even task cancellation was failing due to all the excessive buffering. Hard resetting the node is a problem with active tasks.
This scenario also applies to hardware upgrades (even in VMs) as well. We have dozens of teams with 100s of projects.
Furthermore this feature could potentially allow one to dedicate a node as a "task handler" if nodes behind the LB are being hammered with API/ajax requests.
Hi Jai, that's an interesting thought. Maybe we could signal that a task should "pause" - very similar to a manual intervention or guided failure does today - to be picked up by another node.
One thing you can do in the meantime is to dedicate some nodes in your cluster handling HTTP requests only, by setting their task cap to 0, and some to running tasks by setting the task cap > 0. You could then direct HTTP requests to the API nodes, and let the task running nodes do all the background work.
We are also working on some ideas to make the "dozens of teams with 100s of projects" easier to scale and manage - watch our blog for more details.