• qiang zhu's avatar
    v11.3: fix lock_ref lifecycle in rank-0-decide materialize path · 4891f404
    qiang zhu authored
    v11p2 crashed mid-request (PP1+ only, PP0 fine):
      AssertionError: dec_lock_ref on node with node.mamba_lock_ref=0
      (mamba_radix_cache.cache_unfinished_req -> dec_lock_ref)
    
    Root cause: rank 0's PrefillAdder.add_one_req -> _req_inc_lock_ref takes a
    persistent inc_lock_ref on req.last_node at first admission; that lock is what
    cache_unfinished_req / cache_finished_req later dec. v11 (<=v11p2) skipped the
    whole PrefillAdder path on rank 1+ (materialize instead), so the lock was never
    taken -> first dec_lock_ref tripped the assert.
    
    'runs a while then crashes' = two stacked bugs:
     1. new admits never inc_lock_ref (may silently steal another req's lock first)
     2. continuing chunked_req re-matched via init_next_round_input(tree_cache),
        reassigning last_node away from the locked node (rank 0 calls it with NO
        tree_cache, scheduler.py:2702)
    
    Fix (0004-v11p3-lockref-fix.patch, single file +62/-15): mirror rank 0's two
    branches in _pp_pd_materialize_batch_from_decision:
     - continuing chunked_req: init_next_round_input() (no tree_cache, keep lock)
     - new admit: init_next_round_input(tree_cache, cap) + inc_lock_ref(last_node)
    
    docker tag: v0.5.12-pp-v11p2-rank0-decide -> v0.5.12-pp-v11p3-rank0-decide
    4891f404