Skip to content

[Bug] ALLREDUCE assert failed #682

Description

@zhangandy0727-jpg

torchrun --nproc_per_node=8 --master_port=29503 test/torch/correctness_test.py --collective allreduce --nelem 10556587 --dtype float

github.com/microsoft/mscclpp/apps/nccl/src/allreduce.hpp:679: void allreduce11(const void *, void *, void *, mscclpp::BaseMemoryChannelDeviceHandle *, mscclpp::SwitchChannelDeviceHandle *, unsigned long, unsigned long, int, int) [with T = float]: block: [31,0,0], thread: [943,0,0] Assertion sizePerRank % alignment == 0 failed.

any plan to provide a fallback version of allreduce ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions