Skip to content

Fork safety #7918

Description

@Flamefire

It looks like deepspeed is not fork-safe:

  • During import it checks builder.is_compatible() for all ops
  • Some ops use torch.cuda.get_device_properties
  • This initializes the CUDA context

See

compatible_ops[op_name] = op_compatible

When the process then forks to run in parallel any access to CUDA through PyTorch will fail:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I'm not sure why this doesn't seem to be fully consistent but I can reproduce it with:
pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py -k 'test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1]' -s

Where it will fail when invoking skip_on_arch(8 if dtype == torch.bfloat16 else 7) which calls torch.cuda.get_device_properties, now in the forked subprocess.

This is an issue in general: Fork-multiprocessing can not be used after importing deepspeed

But it also contradicts the documentation:

Note that pytest-forked and the --forked flag are required to test CUDA functionality in distributed tests.

It seems the opposite is true: The flag must not be used.

Or am I missing anything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions