Add multithreading to table lookup (#5849) by ShuyangLiu · Pull Request #5849 · pytorch/FBGEMM

ShuyangLiu · 2026-06-08T18:11:09Z

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2767

What

Parallelizes the per-table loop in the CPU TBE forward kernel
(IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The
per-table loop is embarrassingly parallel — each table reads its own weight
slice and writes a disjoint slice of the output — so fanning tables out across
threads gives near-linear speedup on table-heavy inference models.

Two env knobs control it:

TBE_TABLE_THREADS (default 1): the thread-count cap. =1 keeps the unchanged
sequential behavior; =N>1 distributes tables over up to N OpenMP threads with
dynamic scheduling (good load balancing when table sizes are skewed).
TBE_TABLES_PER_THREAD (default 16): work-granularity for the per-call thread
count (see the work-size guard below).

Work-size guard (skip threading for small few-table lookups)

A uniform thread count regresses small few-table TBE lookups: splitting a
~7–13 table loop across threads adds OpenMP fork/join + cross-call contention
with no parallelism benefit. (Measured on dpa: its three small remote_ro_event
lookups slowed down under threading, while its single 358-table remote_ro
lookup sped up.) choose_table_threads() therefore both GATES and GRADES the
per-call thread count on the work size via a single formula:

N = clamp(work_units / G, 1, cap)   (cap=TBE_TABLE_THREADS, G=TBE_TABLES_PER_THREAD)

A call with fewer than 2*G tables runs serial; larger calls scale one thread per
G tables, up to the cap. With the default G=16 the threading onset is 32 tables,
matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to
restore unconditional threading.

The pure decision (choose_table_threads_impl) is split into
fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without
touching the process environment.

Default behavior is unchanged

When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1
before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return
that runs the loop body sequentially with no try/catch wrapper and no
thread-local-state guard. So the default path is functionally identical to the
pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its
original per-table position, same error semantics, and the same generated machine
code for the body. The only always-on changes are mechanical and
behavior-preserving (loop-local weights pointer, int64 loop index).

Design

Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than
at::parallel_for, so TBE gets its own thread count independent of the global
intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1).
The cap and granularity are read once from the env vars and cached (thread-safe
static init); the per-call thread count from choose_table_threads() is clamped
to the number of tables.

Correctness

Removed the function-scoped weights_acc pointer, which every iteration
overwrote — a data race once the loop is parallel. Replaced with a loop-local
pointer (identical pointer value). Every other variable in the loop body is
already loop-local, and each table writes a disjoint output slice
(output_acc + D_start), so results are bitwise-identical to the sequential
path.
The per-table DEVICE-placement TORCH_CHECK stays in its original position. In
the threaded path it — like any other throw from the loop body (kernel
errors, at::arange checks) — is captured (first one wins) and rethrown after
the join, so no exception escapes the OpenMP region.
Worker threads restore the caller's at::ThreadLocalState (dispatch keys,
grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g.
at::arange in the nobag path) run with the correct thread-local context.

Verification

Builds clean (mode/opt); confirmed OpenMP is actually enabled for this
target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o
references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op).
New unit tests in this diff:
- test/tbe:embedding_cpu_threading_test — C++ tests of the pure
  choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial
  regardless of work/granularity; onset at 2*G=32; grading scales to the cap;
  G=1 threads every call.
- test/tbe:nbit_forward_threading — runs the real CPU int-nbit forward in
  separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and
  TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical
  (torch.equal) — i.e. table-threading does not change the result.
nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including
test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets).

Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg)

| Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency |
| 8 | 1 | 1,436 | 4.38 | --- | --- |
| 8 | 2 | 1,042 | 6.03 | 1.38x | 69% |
| 8 | 4 | 844 | 7.46 | 1.70x | 43% |
| 8 | 8 | 773 | 8.14 | 1.86x | 23% |
| 32 | 1 | 4,797 | 5.25 | --- | --- |
| 32 | 2 | 2,830 | 8.90 | 1.69x | 85% |
| 32 | 4 | 2,003 | 12.60 | 2.40x | 60% |
| 32 | 8 | 1,633 | 15.49 | 2.94x | 37% |
| 64 | 1 | 10,132 | 4.97 | --- | --- |
| 64 | 2 | 6,767 | 7.44 | 1.50x | 75% |
| 64 | 4 | 4,864 | 10.35 | 2.08x | 52% |
| 64 | 8 | 4,033 | 12.50 | 2.51x | 31% |

2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher
counts due to fixed fork/join overhead per call. This matches the production
recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark
kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction
of total inference latency).

Reviewed By: helloguo, q10

Differential Revision: D102867249

meta-codesync · 2026-06-08T18:11:46Z

@ShuyangLiu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102867249.

Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: q10 Differential Revision: D102867249

Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Gated by the TBE_TABLE_THREADS env var: - TBE_TABLE_THREADS=1 (default): unchanged sequential behavior. - TBE_TABLE_THREADS=N>1: tables are distributed over N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 the helper takes an early return and runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard, so the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - Thread count is read once from the env var and cached (thread-safe static init), and clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op, even though no -fopenmp appears in the TARGETS; the fbcode default toolchain supplies it). - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249

Summary: X-link: facebookresearch/FBGEMM#2767 ## What Parallelizes the per-table loop in the CPU TBE forward kernel (IntNBitTableBatchedEmbeddingBagsCodegen::forward) across tables. The per-table loop is embarrassingly parallel — each table reads its own weight slice and writes a disjoint slice of the output — so fanning tables out across threads gives near-linear speedup on table-heavy inference models. Two env knobs control it: - TBE_TABLE_THREADS (default 1): the thread-count cap. =1 keeps the unchanged sequential behavior; =N>1 distributes tables over up to N OpenMP threads with dynamic scheduling (good load balancing when table sizes are skewed). - TBE_TABLES_PER_THREAD (default 16): work-granularity for the per-call thread count (see the work-size guard below). ## Work-size guard (skip threading for small few-table lookups) A uniform thread count *regresses* small few-table TBE lookups: splitting a ~7–13 table loop across threads adds OpenMP fork/join + cross-call contention with no parallelism benefit. (Measured on dpa: its three small remote_ro_event lookups slowed down under threading, while its single 358-table remote_ro lookup sped up.) choose_table_threads() therefore both GATES and GRADES the per-call thread count on the work size via a single formula: N = clamp(work_units / G, 1, cap) (cap=TBE_TABLE_THREADS, G=TBE_TABLES_PER_THREAD) A call with fewer than 2*G tables runs serial; larger calls scale one thread per G tables, up to the cap. With the default G=16 the threading onset is 32 tables, matching the gate validated in ICE A/B testing. Set TBE_TABLES_PER_THREAD=1 to restore unconditional threading. The pure decision (choose_table_threads_impl) is split into fbgemm_gpu/utils/embedding_cpu_threading.h so it can be unit tested without touching the process environment. ## Default behavior is unchanged When TBE_TABLE_THREADS<=1 (the default) choose_table_threads short-circuits to 1 before TBE_TABLES_PER_THREAD is even read, and the helper takes an early return that runs the loop body sequentially with no try/catch wrapper and no thread-local-state guard. So the default path is functionally identical to the pre-change code: same iteration order, the DEVICE-placement TORCH_CHECK in its original per-table position, same error semantics, and the same generated machine code for the body. The only always-on changes are mechanical and behavior-preserving (loop-local `weights` pointer, int64 loop index). ## Design - Raw OpenMP (#pragma omp parallel + omp for, schedule(dynamic)) rather than at::parallel_for, so TBE gets its own thread count independent of the global intra-op pool / OMP_NUM_THREADS (predictors run with OMP_NUM_THREADS=1). - The cap and granularity are read once from the env vars and cached (thread-safe static init); the per-call thread count from choose_table_threads() is clamped to the number of tables. ## Correctness - Removed the function-scoped `weights_acc` pointer, which every iteration overwrote — a data race once the loop is parallel. Replaced with a loop-local pointer (identical pointer value). Every other variable in the loop body is already loop-local, and each table writes a disjoint output slice (output_acc + D_start), so results are bitwise-identical to the sequential path. - The per-table DEVICE-placement TORCH_CHECK stays in its original position. In the threaded path it — like any other throw from the loop body (kernel errors, at::arange checks) — is captured (first one wins) and rethrown after the join, so no exception escapes the OpenMP region. - Worker threads restore the caller's at::ThreadLocalState (dispatch keys, grad/inference mode, autocast, ...), so ATen calls inside the loop (e.g. at::arange in the nobag path) run with the correct thread-local context. ## Verification - Builds clean (mode/opt); confirmed OpenMP is actually enabled for this target by inspecting the compiled object — gen_*_codegen_cpu.cpp.pic.o references __kmpc_fork_call / omp_get_num_threads (the pragma is NOT a no-op). - New unit tests in this diff: - test/tbe:embedding_cpu_threading_test — C++ tests of the pure choose_table_threads_impl decision: cap=1 (no env var) is ALWAYS serial regardless of work/granularity; onset at 2*G=32; grading scales to the cap; G=1 threads every call. - test/tbe:nbit_forward_threading — runs the real CPU int-nbit forward in separate processes under TBE_TABLE_THREADS in {unset, 1, 2, 4} (and TBE_TABLES_PER_THREAD) and asserts the outputs are BITWISE identical (torch.equal) — i.e. table-threading does not change the result. - nbit_forward CPU unit tests pass with TBE_TABLE_THREADS=4, including test_nbit_forward_cpu_with_table_sharing (non-monotonic weights_offsets). ## Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg) | Tables | Threads | Avg us | BW (GB/s) | Speedup | Efficiency | | 8 | 1 | 1,436 | 4.38 | --- | --- | | 8 | 2 | 1,042 | 6.03 | 1.38x | 69% | | 8 | 4 | 844 | 7.46 | 1.70x | 43% | | 8 | 8 | 773 | 8.14 | 1.86x | 23% | | 32 | 1 | 4,797 | 5.25 | --- | --- | | 32 | 2 | 2,830 | 8.90 | 1.69x | 85% | | 32 | 4 | 2,003 | 12.60 | 2.40x | 60% | | 32 | 8 | 1,633 | 15.49 | 2.94x | 37% | | 64 | 1 | 10,132 | 4.97 | --- | --- | | 64 | 2 | 6,767 | 7.44 | 1.50x | 75% | | 64 | 4 | 4,864 | 10.35 | 2.08x | 52% | | 64 | 8 | 4,033 | 12.50 | 2.51x | 31% | 2 threads is the efficiency sweet spot (69-85%); efficiency falls off at higher counts due to fixed fork/join overhead per call. This matches the production recommendation TBE_TABLE_THREADS=2 (+7.3% QPS measured in ICE). Microbenchmark kernel speedups overpredict end-to-end QPS (Amdahl: CPU TBE is a small fraction of total inference latency). Reviewed By: helloguo, q10 Differential Revision: D102867249

meta-cla Bot added the cla signed label Jun 8, 2026

meta-codesync Bot added the meta-exported label Jun 8, 2026

ShuyangLiu force-pushed the export-D102867249 branch from 21bca1a to b863b50 Compare June 8, 2026 22:50

meta-codesync Bot changed the title ~~Add multithreading to table lookup~~ Add multithreading to table lookup (#5849) Jun 8, 2026

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from 93771b1 to 6cac33b Compare June 10, 2026 19:28

ShuyangLiu force-pushed the export-D102867249 branch from 6cac33b to fbd093a Compare June 10, 2026 19:28

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from c028af0 to 6b10e0f Compare June 12, 2026 18:03

ShuyangLiu force-pushed the export-D102867249 branch from 6b10e0f to 00dd8d3 Compare June 15, 2026 19:38

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from a18e209 to d668d4e Compare June 17, 2026 01:52

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from e1a5014 to af6ea34 Compare June 19, 2026 20:19

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from b67a2f3 to bf90b49 Compare June 22, 2026 18:27

ShuyangLiu force-pushed the export-D102867249 branch from bf90b49 to 02c92f4 Compare June 22, 2026 18:27

ShuyangLiu force-pushed the export-D102867249 branch from 02c92f4 to 4aaaa93 Compare June 23, 2026 06:23

ShuyangLiu force-pushed the export-D102867249 branch from 4aaaa93 to ecafadc Compare June 23, 2026 06:23

ShuyangLiu force-pushed the export-D102867249 branch from ecafadc to efff636 Compare June 23, 2026 19:07

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from 340e16a to 772a375 Compare June 24, 2026 05:41

ShuyangLiu force-pushed the export-D102867249 branch from 772a375 to 769e134 Compare June 24, 2026 18:15

ShuyangLiu force-pushed the export-D102867249 branch 2 times, most recently from 337bcb0 to 25580b6 Compare June 25, 2026 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multithreading to table lookup (#5849)#5849

Add multithreading to table lookup (#5849)#5849
ShuyangLiu wants to merge 1 commit into
pytorch:mainfrom
ShuyangLiu:export-D102867249

ShuyangLiu commented Jun 8, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ShuyangLiu commented Jun 8, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Work-size guard (skip threading for small few-table lookups)

Default behavior is unchanged

Design

Correctness

Verification

Benchmark (INT4, B=512, D=128, E=100K, L=20, FP16/SUM, iters=100, 3-run avg)

Uh oh!

meta-codesync Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ShuyangLiu commented Jun 8, 2026 •

edited by meta-codesync Bot

Loading