Skip to content

Migrate OMP use to Aten/Parallel.h (#5947)#5947

Closed
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D109460231
Closed

Migrate OMP use to Aten/Parallel.h (#5947)#5947
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D109460231

Conversation

@q10

@q10 q10 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2862

This is the only direct use of omp in the library, and I believe that if we migrate to using Aten/Parallel.h it will use omp under the hood if pytorch is built with AT_PARALLEL_OPENMP set to 1 anyway, so this allows fbgemm_gpu users to set their threading model consistently based on their PyTorch build.

Test Plan:
Ran the CPU unit test that directly exercises the rewritten csr2csc_template_ (the function this diff migrates from #pragma omp to at::parallel_for), covering both index_t instantiations:

buck2 test fbcode//mode/opt fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:cpu_kernel

Result: Pass 2. Fail 0. (CpuKernelTest.csr2csc_test_int32, CpuKernelTest.csr2csc_test_int64)
Test session: https://www.internalfb.com/intern/testinfra/testrun/26458647838875303

Higher-level TBE suites that exercise csr2csc indirectly:

buck2 test fbcode//mode/opt \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:backward_sgd

Not exercised locally: the Python test runtime fails with cannot execute binary file: Exec format error (signal 126) on this aarch64 devserver, including the CPU-only cases (e.g. test_forward_cpu_fp32, test_backward_sgd_fp32_pmNONE_cpu). This is a host arch/toolchain limitation, not a code issue. Run these on an x86_64 GPU host or via Sandcastle CI.

Differential Revision: D109460231

Pulled By: q10

@meta-cla meta-cla Bot added the cla signed label Jun 23, 2026
@meta-codesync

meta-codesync Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109460231.

@meta-codesync meta-codesync Bot changed the title Migrate OMP use to Aten/Parallel.h (#5943) Migrate OMP use to Aten/Parallel.h (#5947) Jun 23, 2026
@q10 q10 force-pushed the export-D109460231 branch from 6a3a3bd to 8c963a9 Compare June 23, 2026 23:20
q10 pushed a commit to q10/FBGEMM that referenced this pull request Jun 23, 2026
Summary:

X-link: facebookresearch/FBGEMM#2862

This is the only direct use of `omp` in the library, and I believe that if we migrate to using `Aten/Parallel.h` it will use `omp` under the hood if `pytorch` is built with `AT_PARALLEL_OPENMP` set to 1 anyway, so this allows `fbgemm_gpu` users to set their threading model consistently based on their PyTorch build.


Test Plan:
Ran the CPU unit test that directly exercises the rewritten `csr2csc_template_` (the function this diff migrates from `#pragma omp` to `at::parallel_for`), covering both `index_t` instantiations:

```
buck2 test fbcode//mode/opt fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:cpu_kernel
```

Result: Pass 2. Fail 0.  (CpuKernelTest.csr2csc_test_int32, CpuKernelTest.csr2csc_test_int64)
Test session: https://www.internalfb.com/intern/testinfra/testrun/26458647838875303

Higher-level TBE suites that exercise csr2csc indirectly:

```
buck2 test fbcode//mode/opt \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:backward_sgd
```

Not exercised locally: the Python test runtime fails with `cannot execute binary file: Exec format error` (signal 126) on this aarch64 devserver, including the CPU-only cases (e.g. test_forward_cpu_fp32, test_backward_sgd_fp32_pmNONE_cpu). This is a host arch/toolchain limitation, not a code issue. Run these on an x86_64 GPU host or via Sandcastle CI.

Differential Revision: D109460231

Pulled By: q10
Summary:
Pull Request resolved: pytorch#5947

X-link: https://github.com/facebookresearch/FBGEMM/pull/2862

This is the only direct use of `omp` in the library, and I believe that if we migrate to using `Aten/Parallel.h` it will use `omp` under the hood if `pytorch` is built with `AT_PARALLEL_OPENMP` set to 1 anyway, so this allows `fbgemm_gpu` users to set their threading model consistently based on their PyTorch build.

Pull Request resolved: pytorch#5943

Test Plan:
Ran the CPU unit test that directly exercises the rewritten `csr2csc_template_` (the function this diff migrates from `#pragma omp` to `at::parallel_for`), covering both `index_t` instantiations:

```
buck2 test fbcode//mode/opt fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:cpu_kernel
```

Result: Pass 2. Fail 0.  (CpuKernelTest.csr2csc_test_int32, CpuKernelTest.csr2csc_test_int64)
Test session: https://www.internalfb.com/intern/testinfra/testrun/26458647838875303

Higher-level TBE suites that exercise csr2csc indirectly:

```
buck2 test fbcode//mode/opt \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward \
  fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:backward_sgd
```

Not exercised locally: the Python test runtime fails with `cannot execute binary file: Exec format error` (signal 126) on this aarch64 devserver, including the CPU-only cases (e.g. test_forward_cpu_fp32, test_backward_sgd_fp32_pmNONE_cpu). This is a host arch/toolchain limitation, not a code issue. Run these on an x86_64 GPU host or via Sandcastle CI.

Differential Revision: D109460231

Pulled By: q10
@q10 q10 force-pushed the export-D109460231 branch from 8c963a9 to 08afaaf Compare June 23, 2026 23:27
@meta-codesync meta-codesync Bot closed this in a632f25 Jun 24, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 24, 2026
@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@q10 merged this pull request in a632f25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants