Bug: RuntimeError inplace update to inference tensor in model_runner.py (repetition penalty)

## Bug Report

### Description

Generating music with `lm_repetition_penalty != 1.0` always fails with:

```
RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.
You can make a clone to get a normal tensor before doing inplace update.
```

### Root Cause

In `acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py`, `self.run_model()` returns a tensor created inside `torch.inference_mode()`. The code then does an inplace assignment on a slice of that tensor **before** cloning it:

```python
logits = self.run_model(input_ids, positions, is_prefill)  # inference tensor
reset_context()

# ... inside the repetition penalty block:
logits[i] = torch.where(token_mask, penalty_scores, logits[i])  # CRASH: inplace on inference tensor

# ... only later:
logits = logits.clone()  # too late - clone comes AFTER the inplace write
```

The `clone()` call even has the comment `# Clone logits to avoid in-place update issues in inference mode`, confirming awareness of the issue - but it's placed after the problematic line.

### Fix

Move `logits = logits.clone()` to immediately after `reset_context()`, before any inplace writes:

```python
logits = self.run_model(input_ids, positions, is_prefill)
reset_context()
logits = logits.clone()  # clone before any inplace writes

if self.rank == 0:
    if repetition_penalties is not None:
        for i, seq in enumerate(seqs):
            # repetition penalty logic is now safe
            logits[i] = torch.where(token_mask, penalty_scores, logits[i])  # OK

    # Remove the old misplaced clone
    for i, seq in enumerate(seqs):
        # logits processor ...
```

### Steps to Reproduce

1. Call any generation endpoint with `lm_repetition_penalty` set to any value other than `1.0`
2. 100% failure rate

### Impact

Any call with repetition penalty enabled (the default recommended value to avoid token loops is `1.3`) always fails. Feature is completely unusable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: RuntimeError inplace update to inference tensor in model_runner.py (repetition penalty) #403

Bug Report

Description

Root Cause

Fix

Steps to Reproduce

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: RuntimeError inplace update to inference tensor in model_runner.py (repetition penalty) #403

Description

Bug Report

Description

Root Cause

Fix

Steps to Reproduce

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions