Skip to content

Incorrect Decoder Compilation for DEIMv2 on neuronx #116

@cricket1

Description

@cricket1

Summary

The AWS Neuron compiler (torch-neuronx) produces incorrect outputs when compiling a transformer-based object detection decoder with discrete attention mechanism. The compiled model produces vastly different logits and bounding boxes compared to PyTorch, resulting in zero detections on real images.

Environment

# System info
OS: Ubuntu (Linux 6.8.0-1040-aws)
Instance: inf2.xlarge (tested on CPU instance for compilation)

# Python packages (confirmed versions)
Python: 3.10.12
torch: 2.8.0
torch-neuronx: 2.8.0.2.10.16998+e9bf8a50
torch-xla: 2.8.1
neuronx-cc: 2.21.33363.0+82129205
torchvision: 0.23.0

Problem Description

Expected Behavior

When compiling a transformer decoder with discrete attention to Neuron format, the compiled model should produce outputs numerically close to PyTorch (within typical BF16 precision tolerance of ~1e-2).

Actual Behavior

The Neuron-compiled decoder produces completely incorrect outputs:

  • Logits max difference: 7.84 (compared to ~1e-2 expected)
  • Logits mean difference: 1.25 (compared to ~1e-4 expected)
  • Relative difference: 160,087% (!)
  • Detection count: 0 detections (PyTorch: 60 detections)
  • Top-5 confidence scores: [0.25, 0.24, 0.23, 0.21, 0.21] (PyTorch: [0.93, 0.90, 0.89, 0.84, 0.84])

Components Affected

  • Backbone (HGNetv2): ✓ Works correctly (max diff ~0.7)
  • Encoder (HybridEncoder): ✓ Works correctly (max diff ~0.18)
  • Decoder (DEIMTransformer with discrete attention): ✗ Completely broken

Reproduction

1. Model Architecture

The decoder is a DETR-style transformer decoder with:

  • Multi-scale deformable attention using discrete indexing (not F.grid_sample)
  • 3 decoder layers
  • 300 object queries
  • 80 classes (COCO)
  • 2 feature levels

Key detail: The model uses cross_attn_method: 'discrete' which implements deformable attention via discrete tensor indexing instead of F.grid_sample() (which is not supported by Neuron).

3. Full Reproduction

Complete reproduction code available at: /home/ubuntu/sh_deimv2/

Key files:

To reproduce:

# Convert decoder to Neuron
python convert_components_neuronx.py \\
    --checkpoint models/deimv2_hgnetv2_n_coco.pth \\
    --config configs/deimv2/deimv2_hgnetv2_n_coco.yml \\
    --component all

# Verify (shows the bug)
python verify_neuronx_on_image.py --image example.jpg

Investigation Results

What We've Tried

  1. Different compiler optimizations: Tested -O0, -O1, -O2
  2. FP32 precision: --fp32-cast=all, --fp32-cast=matmult
  3. Disabled auto-casting: --auto-cast=none
  4. Model type hint: --model-type=transformer
  5. Various combinations: See try_neuronx_compiler_settings.py

None of the above produced correct outputs.

Key Observations

  1. Backbone and encoder compile correctly - The issue is isolated to the decoder
  2. PyTorch discrete attention works perfectly - The implementation is correct in PyTorch
  3. Error is not precision-related - Differences are 1000x larger than BF16 tolerance
  4. Consistent failure - Fails on both random inputs and real images

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inf2bugSomething isn't workingcompilationCompilation doesn't work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions