-
qiang zhu authored
v11p2 crashed mid-request (PP1+ only, PP0 fine): AssertionError: dec_lock_ref on node with node.mamba_lock_ref=0 (mamba_radix_cache.cache_unfinished_req -> dec_lock_ref) Root cause: rank 0's PrefillAdder.add_one_req -> _req_inc_lock_ref takes a persistent inc_lock_ref on req.last_node at first admission; that lock is what cache_unfinished_req / cache_finished_req later dec. v11 (<=v11p2) skipped the whole PrefillAdder path on rank 1+ (materialize instead), so the lock was never taken -> first dec_lock_ref tripped the assert. 'runs a while then crashes' = two stacked bugs: 1. new admits never inc_lock_ref (may silently steal another req's lock first) 2. continuing chunked_req re-matched via init_next_round_input(tree_cache), reassigning last_node away from the locked node (rank 0 calls it with NO tree_cache, scheduler.py:2702) Fix (0004-v11p3-lockref-fix.patch, single file +62/-15): mirror rank 0's two branches in _pp_pd_materialize_batch_from_decision: - continuing chunked_req: init_next_round_input() (no tree_cache, keep lock) - new admit: init_next_round_input(tree_cache, cap) + inc_lock_ref(last_node) docker tag: v0.5.12-pp-v11p2-rank0-decide -> v0.5.12-pp-v11p3-rank0-decide4891f404