Incorrect Decoder Compilation for DEIMv2 on neuronx

## Summary

The AWS Neuron compiler (`torch-neuronx`) produces incorrect outputs when compiling a transformer-based object detection decoder with discrete attention mechanism. The compiled model produces vastly different logits and bounding boxes compared to PyTorch, resulting in zero detections on real images.

## Environment

```bash
# System info
OS: Ubuntu (Linux 6.8.0-1040-aws)
Instance: inf2.xlarge (tested on CPU instance for compilation)

# Python packages (confirmed versions)
Python: 3.10.12
torch: 2.8.0
torch-neuronx: 2.8.0.2.10.16998+e9bf8a50
torch-xla: 2.8.1
neuronx-cc: 2.21.33363.0+82129205
torchvision: 0.23.0
```

## Problem Description

### Expected Behavior
When compiling a transformer decoder with discrete attention to Neuron format, the compiled model should produce outputs numerically close to PyTorch (within typical BF16 precision tolerance of ~1e-2).

### Actual Behavior
The Neuron-compiled decoder produces completely incorrect outputs:
- **Logits max difference**: 7.84 (compared to ~1e-2 expected)
- **Logits mean difference**: 1.25 (compared to ~1e-4 expected)
- **Relative difference**: 160,087% (!)
- **Detection count**: 0 detections (PyTorch: 60 detections)
- **Top-5 confidence scores**: [0.25, 0.24, 0.23, 0.21, 0.21] (PyTorch: [0.93, 0.90, 0.89, 0.84, 0.84])

### Components Affected
- **Backbone (HGNetv2)**: ✓ Works correctly (max diff ~0.7)
- **Encoder (HybridEncoder)**: ✓ Works correctly (max diff ~0.18)
- **Decoder (DEIMTransformer with discrete attention)**: ✗ Completely broken

## Reproduction

### 1. Model Architecture

The decoder is a DETR-style transformer decoder with:
- Multi-scale deformable attention using **discrete indexing** (not F.grid_sample)
- 3 decoder layers
- 300 object queries
- 80 classes (COCO)
- 2 feature levels

**Key detail**: The model uses `cross_attn_method: 'discrete'` which implements deformable attention via discrete tensor indexing instead of `F.grid_sample()` (which is not supported by Neuron).


### 3. Full Reproduction

Complete reproduction code available at: `/home/ubuntu/sh_deimv2/`

Key files:
- `convert_components_neuronx.py`: Conversion script ([View on GitHub Gist](https://gist.github.com/cricket1/65c14f61644d312532172f1e9939b252))
- `verify_neuronx_on_image.py`: Verification on real image ([View on GitHub Gist](https://gist.github.com/cricket1/5aa2c3fb2611e2cde983262ec8f827d1))
- `compare_all_methods.py`: Comprehensive comparison ([View on GitHub Gist](https://gist.github.com/cricket1/e03f92661eb1b2773a954ade95046b70))

To reproduce:
```bash
# Convert decoder to Neuron
python convert_components_neuronx.py \\
    --checkpoint models/deimv2_hgnetv2_n_coco.pth \\
    --config configs/deimv2/deimv2_hgnetv2_n_coco.yml \\
    --component all

# Verify (shows the bug)
python verify_neuronx_on_image.py --image example.jpg
```

## Investigation Results

### What We've Tried

1. ✗ **Different compiler optimizations**: Tested `-O0`, `-O1`, `-O2`
2. ✗ **FP32 precision**: `--fp32-cast=all`, `--fp32-cast=matmult`
3. ✗ **Disabled auto-casting**: `--auto-cast=none`
4. ✗ **Model type hint**: `--model-type=transformer`
5. ✗ **Various combinations**: See `try_neuronx_compiler_settings.py`

None of the above produced correct outputs.

### Key Observations

1. **Backbone and encoder compile correctly** - The issue is isolated to the decoder
2. **PyTorch discrete attention works perfectly** - The implementation is correct in PyTorch
3. **Error is not precision-related** - Differences are 1000x larger than BF16 tolerance
4. **Consistent failure** - Fails on both random inputs and real images


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Decoder Compilation for DEIMv2 on neuronx #116

Summary

Environment

Problem Description

Expected Behavior

Actual Behavior

Components Affected

Reproduction

1. Model Architecture

3. Full Reproduction

Investigation Results

What We've Tried

Key Observations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incorrect Decoder Compilation for DEIMv2 on neuronx #116

Description

Summary

Environment

Problem Description

Expected Behavior

Actual Behavior

Components Affected

Reproduction

1. Model Architecture

3. Full Reproduction

Investigation Results

What We've Tried

Key Observations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions