Bug Report
Description
Generating music with lm_repetition_penalty != 1.0 always fails with:
RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.
You can make a clone to get a normal tensor before doing inplace update.
Root Cause
In acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py, self.run_model() returns a tensor created inside torch.inference_mode(). The code then does an inplace assignment on a slice of that tensor before cloning it:
logits = self.run_model(input_ids, positions, is_prefill) # inference tensor
reset_context()
# ... inside the repetition penalty block:
logits[i] = torch.where(token_mask, penalty_scores, logits[i]) # CRASH: inplace on inference tensor
# ... only later:
logits = logits.clone() # too late - clone comes AFTER the inplace write
The clone() call even has the comment # Clone logits to avoid in-place update issues in inference mode, confirming awareness of the issue - but it's placed after the problematic line.
Fix
Move logits = logits.clone() to immediately after reset_context(), before any inplace writes:
logits = self.run_model(input_ids, positions, is_prefill)
reset_context()
logits = logits.clone() # clone before any inplace writes
if self.rank == 0:
if repetition_penalties is not None:
for i, seq in enumerate(seqs):
# repetition penalty logic is now safe
logits[i] = torch.where(token_mask, penalty_scores, logits[i]) # OK
# Remove the old misplaced clone
for i, seq in enumerate(seqs):
# logits processor ...
Steps to Reproduce
- Call any generation endpoint with
lm_repetition_penalty set to any value other than 1.0
- 100% failure rate
Impact
Any call with repetition penalty enabled (the default recommended value to avoid token loops is 1.3) always fails. Feature is completely unusable.
Bug Report
Description
Generating music with
lm_repetition_penalty != 1.0always fails with:Root Cause
In
acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py,self.run_model()returns a tensor created insidetorch.inference_mode(). The code then does an inplace assignment on a slice of that tensor before cloning it:The
clone()call even has the comment# Clone logits to avoid in-place update issues in inference mode, confirming awareness of the issue - but it's placed after the problematic line.Fix
Move
logits = logits.clone()to immediately afterreset_context(), before any inplace writes:Steps to Reproduce
lm_repetition_penaltyset to any value other than1.0Impact
Any call with repetition penalty enabled (the default recommended value to avoid token loops is
1.3) always fails. Feature is completely unusable.