Improve transactional integrity for starting controller jobs in dispatcherd (#16300)

Remove SELECT FOR UPDATE from job dispatch to reduce transaction rollbacks
                                                                                                                                                                                                                                                                                           
  Move status transition from BaseTask.transition_status (which used
  SELECT FOR UPDATE inside transaction.atomic()) into                                                                                                                                                                                                                                      
  dispatch_waiting_jobs. The new approach uses filter().update() which                                                                                                                                                                                                                     
  is atomic at the database level without requiring explicit row locks,
  reducing transaction contention and rollbacks observed in perfscale
  testing.

  The transition_status method was an artifact of the feature flag era
  where we needed to support both old and new code paths. Since
  dispatch_waiting_jobs is already a singleton
  (on_duplicate='queue_one') scoped to the local node, the
  de-duplication logic is unnecessary.

  Status is updated after task submission to dispatcherd, so the job's
  UUID is in the dispatch pipeline before being marked running —
  preventing the reaper from incorrectly reaping jobs during the
  handoff window. RunJob.run() handles the race where a worker picks
  up the task before the status update lands by accepting waiting and
  transitioning it to running itself.

  Signed-off-by: Seth Foster <fosterbseth@gmail.com>
  Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Seth Foster
2026-02-26 14:16:36 -05:00
committed by GitHub
parent a21f9fbdb8
commit 2c71bcda32
2 changed files with 43 additions and 32 deletions

View File

@@ -29,3 +29,30 @@ def test_cancel_flag_on_start(jt_linked, caplog):
job = Job.objects.get(id=job.id)
assert job.status == 'canceled'
@pytest.mark.django_db
def test_runjob_run_can_accept_waiting_status(jt_linked, mocker):
"""Test that RunJob.run() can accept a job in 'waiting' status and transition it to 'running'
before the pre_run_hook is called"""
job = jt_linked.create_unified_job()
job.status = 'waiting'
job.save()
status_at_pre_run = None
def capture_status(instance, private_data_dir):
nonlocal status_at_pre_run
instance.refresh_from_db()
status_at_pre_run = instance.status
mock_pre_run = mocker.patch.object(RunJob, 'pre_run_hook', side_effect=capture_status)
task = RunJob()
try:
task.run(job.id)
except Exception:
pass
mock_pre_run.assert_called_once()
assert status_at_pre_run == 'running'