Add multithreading to table lookup (#5849)#5849
Open
ShuyangLiu wants to merge 1 commit into
Open
Conversation
Contributor
|
@ShuyangLiu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102867249. |
21bca1a to
b863b50
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 8, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 8, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: q10 Differential Revision: D102867249
93771b1 to
6cac33b
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 10, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 10, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: q10 Differential Revision: D102867249
6cac33b to
fbd093a
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 12, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
c028af0 to
6b10e0f
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 12, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
6b10e0f to
00dd8d3
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 15, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 15, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
a18e209 to
d668d4e
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 17, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 17, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
e1a5014 to
af6ea34
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 19, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 19, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
b67a2f3 to
bf90b49
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 22, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 22, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
bf90b49 to
02c92f4
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 23, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
02c92f4 to
4aaaa93
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 23, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
4aaaa93 to
ecafadc
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 23, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ecafadc to
efff636
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 23, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
340e16a to
772a375
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 24, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 24, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
772a375 to
769e134
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 24, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
337bcb0 to
25580b6
Compare
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 25, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Two env knobs control it: - TBE_TABLE_THREADS (default 1): the thread-count cap. =1 keeps the unchanged sequential behavior; =N>1 distributes tables over up to N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). - TBE_TABLES_PER_THREAD (default 16): work-granularity for the per-call thread count (see the work-size guard below). ## Work-size guard (skip threading for small few-table lookups) A uniform thread count *regresses* small few-table TBE lookups: splitting a ~7–13 table loop across threads adds OpenMP fork/join + cross-call contention with no parallelism benefit. (Measured on dpa: its three small remote_ro_event lookups slowed down under threading, while its single 358-table remote_ro lookup sped up.) choose_table_threads() therefore both GATES and GRADES the per-call thread count on the work size via a single formula: N = clamp(work_units / G, 1, cap) (cap=TBE_TABLE_THREADS, G=TBE_TABLES_PER_THREAD) A call with fewer than 2*G tables runs serial; larger calls scale one thread per G tables, up to the cap. With the default G=16 the threading onset is 32 tables, matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to restore unconditional threading. The pure decision (choose_table_threads_impl) is split into fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without touching the process environment. ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1 before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return that runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard. So the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - The cap and granularity are read once from the env vars and cached (thread-safe static init); the per-call thread count from choose_table_threads() is clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op). - New unit tests in this diff: - test/tbe:embedding_cpu_threading_test — C++ tests of the pure choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial regardless of work/granularity; onset at 2*G=32; grading scales to the cap; G=1 threads every call. - test/tbe:nbit_forward_threading — runs the real CPU int-nbit forward in separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical (torch.equal) — i.e. table-threading does not change the result. - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
ShuyangLiu
pushed a commit
to ShuyangLiu/FBGEMM-1
that referenced
this pull request
Jun 25, 2026
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Two env knobs control it: - TBE_TABLE_THREADS (default 1): the thread-count cap. =1 keeps the unchanged sequential behavior; =N>1 distributes tables over up to N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). - TBE_TABLES_PER_THREAD (default 16): work-granularity for the per-call thread count (see the work-size guard below). ## Work-size guard (skip threading for small few-table lookups) A uniform thread count *regresses* small few-table TBE lookups: splitting a ~7–13 table loop across threads adds OpenMP fork/join + cross-call contention with no parallelism benefit. (Measured on dpa: its three small remote_ro_event lookups slowed down under threading, while its single 358-table remote_ro lookup sped up.) choose_table_threads() therefore both GATES and GRADES the per-call thread count on the work size via a single formula: N = clamp(work_units / G, 1, cap) (cap=TBE_TABLE_THREADS, G=TBE_TABLES_PER_THREAD) A call with fewer than 2*G tables runs serial; larger calls scale one thread per G tables, up to the cap. With the default G=16 the threading onset is 32 tables, matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to restore unconditional threading. The pure decision (choose_table_threads_impl) is split into fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without touching the process environment. ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1 before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return that runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard. So the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - The cap and granularity are read once from the env vars and cached (thread-safe static init); the per-call thread count from choose_table_threads() is clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op). - New unit tests in this diff: - test/tbe:embedding_cpu_threading_test — C++ tests of the pure choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial regardless of work/granularity; onset at 2*G=32; grading scales to the cap; G=1 threads every call. - test/tbe:nbit_forward_threading — runs the real CPU int-nbit forward in separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical (torch.equal) — i.e. table-threading does not change the result. - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Two env knobs control it: - TBE_TABLE_THREADS (default 1): the thread-count cap. =1 keeps the unchanged sequential behavior; =N>1 distributes tables over up to N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). - TBE_TABLES_PER_THREAD (default 16): work-granularity for the per-call thread count (see the work-size guard below). ## Work-size guard (skip threading for small few-table lookups) A uniform thread count *regresses* small few-table TBE lookups: splitting a ~7–13 table loop across threads adds OpenMP fork/join + cross-call contention with no parallelism benefit. (Measured on dpa: its three small remote_ro_event lookups slowed down under threading, while its single 358-table remote_ro lookup sped up.) choose_table_threads() therefore both GATES and GRADES the per-call thread count on the work size via a single formula: N = clamp(work_units / G, 1, cap) (cap=TBE_TABLE_THREADS, G=TBE_TABLES_PER_THREAD) A call with fewer than 2*G tables runs serial; larger calls scale one thread per G tables, up to the cap. With the default G=16 the threading onset is 32 tables, matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to restore unconditional threading. The pure decision (choose_table_threads_impl) is split into fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without touching the process environment. ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1 before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return that runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard. So the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - The cap and granularity are read once from the env vars and cached (thread-safe static init); the per-call thread count from choose_table_threads() is clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op). - New unit tests in this diff: - test/tbe:embedding_cpu_threading_test — C++ tests of the pure choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial regardless of work/granularity; onset at 2*G=32; grading scales to the cap; G=1 threads every call. - test/tbe:nbit_forward_threading — runs the real CPU int-nbit forward in separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical (torch.equal) — i.e. table-threading does not change the result. - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2767
What
Parallelizes the per-table loop in the CPU TBE forward kernel
(IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The
per-table loop is embarrassingly parallel — each table reads its own weight
slice and writes a disjoint slice of the output — so fanning tables out across
threads gives near-linear speedup on table-heavy inference models.
Two env knobs control it:
sequential behavior; =N>1 distributes tables over up to N OpenMP threads with
dynamic scheduling (good load balancing when table sizes are skewed).
count (see the work-size guard below).
Work-size guard (skip threading for small few-table lookups)
A uniform thread count regresses small few-table TBE lookups: splitting a
~7–13 table loop across threads adds OpenMP fork/join + cross-call contention
with no parallelism benefit. (Measured on dpa: its three small remote_ro_event
lookups slowed down under threading, while its single 358-table remote_ro
lookup sped up.) choose_table_threads() therefore both GATES and GRADES the
per-call thread count on the work size via a single formula:
A call with fewer than 2*G tables runs serial; larger calls scale one thread per
G tables, up to the cap. With the default G=16 the threading onset is 32 tables,
matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to
restore unconditional threading.
The pure decision (choose_table_threads_impl) is split into
fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without
touching the process environment.
Default behavior is unchanged
When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1
before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return
that runs the loop body sequentially with no try/catch wrapper and no
thread-local-state guard. So the default path is functionally identical to the
pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its
original per-table position, same error semantics, and the same generated machine
code for the body. The only always-on changes are mechanical and
behavior-preserving (loop-local
weightspointer, int64 loop index).Design
at::parallel_for, so TBE gets its own thread count independent of the global
intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1).
static init); the per-call thread count from choose_table_threads() is clamped
to the number of tables.
Correctness
weights_accpointer, which every iterationoverwrote — a data race once the loop is parallel. Replaced with a loop-local
pointer (identical pointer value). Every other variable in the loop body is
already loop-local, and each table writes a disjoint output slice
(output_acc + D_start), so results are bitwise-identical to the sequential
path.
the threaded path it — like any other throw from the loop body (kernel
errors, at::arange checks) — is captured (first one wins) and rethrown after
the join, so no exception escapes the OpenMP region.
grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g.
at::arange in the nobag path) run with the correct thread-local context.
Verification
target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o
references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op).
choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial
regardless of work/granularity; onset at 2*G=32; grading scales to the cap;
G=1 threads every call.
separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and
TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical
(torch.equal) — i.e. table-threading does not change the result.
test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets).
Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg)
| Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency |
| 8 | 1 | 1,436 | 4.38 | --- | --- |
| 8 | 2 | 1,042 | 6.03 | 1.38x | 69% |
| 8 | 4 | 844 | 7.46 | 1.70x | 43% |
| 8 | 8 | 773 | 8.14 | 1.86x | 23% |
| 32 | 1 | 4,797 | 5.25 | --- | --- |
| 32 | 2 | 2,830 | 8.90 | 1.69x | 85% |
| 32 | 4 | 2,003 | 12.60 | 2.40x | 60% |
| 32 | 8 | 1,633 | 15.49 | 2.94x | 37% |
| 64 | 1 | 10,132 | 4.97 | --- | --- |
| 64 | 2 | 6,767 | 7.44 | 1.50x | 75% |
| 64 | 4 | 4,864 | 10.35 | 2.08x | 52% |
| 64 | 8 | 4,033 | 12.50 | 2.51x | 31% |
2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher
counts due to fixed fork/join overhead per call. This matches the production
recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark
kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction
of total inference latency).
Reviewed By: helloguo, q10
Differential Revision: D102867249