feat(moe): add moe support and fused topk & moe kernels#37
feat(moe): add moe support and fused topk & moe kernels#37MikanAffine wants to merge 5 commits intoSJTU-DENG-Lab:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR introduces a comprehensive refactoring of the Mixture-of-Experts (MoE) implementation. It replaces the monolithic Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Token<br/>Sequence
participant FusedMoE as FusedMoE<br/>Layer
participant Router as TopK<br/>Router
participant Dispatcher as Token<br/>Dispatcher
participant Runner as MoE<br/>Runner
participant Output as Output<br/>Sequence
Input->>FusedMoE: hidden_states
FusedMoE->>Router: router_logits
Router->>Router: compute top-k<br/>routing scores
Router->>Dispatcher: TopKOutput<br/>(ids, weights)
Dispatcher->>Dispatcher: map tokens to<br/>local experts
Dispatcher->>Runner: DispatchOutput
Runner->>Runner: execute experts<br/>fused_moe(...) or<br/>per-expert MLP
Runner->>Runner: all_reduce if<br/>expert_parallel
Runner->>Output: CombineInput<br/>(combined_hidden_states)
Output->>FusedMoE: (final_hidden_states,<br/>router_logits)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes This PR spans multiple interconnected subsystems (MoE architecture, kernel implementations, distributed parallelism metadata, checkpoint loading, and model integration). It requires careful review of: new abstract base classes and factory patterns across dispatcher/runner/router frameworks; high-density Triton kernel code with GEMM tiling and masking logic; parallelism metadata initialization and validation with cross-process consistency checks; complex weight resolution and tensor slicing for expert-parallel sharding; and distributed training coordination changes affecting multiple worker types. Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
diffulex/moe/__init__.py (1)
10-16: Remove the dead fallback or make it a real branch.Line 15 returns unconditionally, so Line 16 can never execute. If
SparseMoEBlockis still meant to be a fallback, this needs an availability/config check instead of an unreachable return.Suggested cleanup
def build_mlp_or_moe(config, layer_idx: int, dense_factory): """Build a dense MLP or MoE block according to the config.""" if is_moe_layer(config, layer_idx): return FusedSparseMoEBlock.from_config(config) - return SparseMoEBlock.from_config(config) return dense_factory()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/__init__.py` around lines 10 - 16, The function build_mlp_or_moe currently returns FusedSparseMoEBlock.from_config(config) unconditionally, making the subsequent return of SparseMoEBlock.from_config(config) dead code; either remove the unreachable fallback or implement a real branch that chooses SparseMoEBlock when FusedSparseMoEBlock is unavailable. Update build_mlp_or_moe to check availability (e.g., try/except ImportError or a feature flag) before calling FusedSparseMoEBlock.from_config(config) and only call SparseMoEBlock.from_config(config) when the fused implementation is not available, or delete the redundant return line if the fused block is the sole supported implementation; reference the symbols build_mlp_or_moe, FusedSparseMoEBlock, SparseMoEBlock, and is_moe_layer when making the change.diffulex_kernel/__init__.py (1)
42-50: Keep the lazy-export surface consistent.
fused_topkis exposed through both__getattr__and__all__, butfused_moeis only exposed through__getattr__while Line 62 keeps it commented out. Iffused_moeis public, export it consistently; if it is private, hiding it in one place and exposing it in another is confusing.Also applies to: 55-63
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex_kernel/__init__.py` around lines 42 - 50, The lazy-export surface is inconsistent: __getattr__ exposes fused_topk and fused_moe but __all__ only lists fused_topk (fused_moe is commented out). Make the exports consistent by either adding "fused_moe" to the module-level __all__ list (or uncommenting the existing entry) if it should be public, or remove/deny export in __getattr__ for fused_moe if it should be private; update the code paths around __getattr__, the fused_topk and fused_moe import lines, and the __all__ definition so both symbols are treated the same way.diffulex/moe/topk.py (1)
9-10: Lazy-import the Triton backend.
TopKRouter(impl="torch")andtopk_pytorch_reference()do not need the kernel package, but the module-levelfrom diffulex_kernel import fused_topkmakesdiffulex.moe.topkdepend on that stack at import time anyway. Ifdiffulex_kernelresolves the Triton module eagerly, CPU/reference users will fail before they can ever select the torch path.♻️ Suggested change
-from diffulex_kernel import fused_topk - def topk_pytorch_reference( router_logits: torch.Tensor, top_k: int, @@ if impl == "torch": self.impl = topk_pytorch_reference elif impl == "triton": + from diffulex_kernel import fused_topk self.impl = fused_topk else: raise ValueError(f"Unsupported impl: {impl!r}")Also applies to: 59-64
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk.py` around lines 9 - 10, The module currently imports fused_topk at top-level causing an eager dependency; change to lazy-import diffulex_kernel.fused_topk only where needed: move the import into the code paths that actually call it (e.g., inside TopKRouter implementation branch that selects the Triton backend and inside the function that calls fused_topk), so TopKRouter(impl="torch") and topk_pytorch_reference() can be imported without resolving diffulex_kernel; ensure you reference fused_topk by importing it locally just before use and keep existing function/class names (TopKRouter, topk_pytorch_reference, fused_topk) to locate the spots to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@diffulex_kernel/python/fused_moe_triton.py`:
- Around line 21-23: Rename the single-letter kernel dimension `I` to a
descriptive name (e.g., `I_dim` or `INPUT_SIZE`) both in the Triton kernel
signature (the `I: tl.constexpr` parameter) and in the wrapper/local variables
that reference it so Ruff E741 is resolved; update every usage (pointer
arithmetic, index calculations, and any calls that pass the argument) to the new
name, including the other occurrences noted around the later blocks (the local
wrapper variable and the other kernel signatures/usages), ensuring all
references (e.g., kernel definition, launch invocation, and any local variable
named `I`) are consistently renamed.
- Around line 266-343: Replace the fragile asserts in _run_fused_moe_kernels
(and validate in fused_moe caller) with explicit runtime checks that raise clear
exceptions: verify w13 and w2 are 3D tensors, check w13.shape[1] % 2 == 0 and
compute I = w13.shape[1] // 2 only after that, ensure w13.shape[2] == H and
w2.shape == (E, H, I), confirm topk_ids and topk_weights are 2D with identical
shapes and that topk_ids.shape[0] == M; raise ValueError (or TypeError) with
descriptive messages naming the offending tensor (w13, w2, topk_ids,
topk_weights) so kernel launches fail fast with clear Python errors.
In `@diffulex/engine/dllm_block.py`:
- Around line 36-47: The file defines __getstate__ and __setstate__ multiple
times which causes the later/older definitions to shadow the new weakref-based
serialization path; consolidate to a single pair of methods that implement the
weakref handling: keep the implementations that use weakref_fn and convert
s['_req'] and s['_dllm_block_buffer'] to/from weak references, remove or merge
the older duplicate __getstate__/__setstate__ definitions so only one canonical
implementation remains, and ensure the final __setstate__/__getstate__ reference
the weakref_fn helper and the _req and _dllm_block_buffer attributes
consistently.
In `@diffulex/moe/fused_moe.py`:
- Around line 29-38: Constructor currently accepts arbitrary hidden_act causing
instantiation of a fused MoE that only supports "silu"; update the validation to
fail fast by checking hidden_act in __init__ (and the alternate
constructor/loader referenced at lines ~148-156, e.g., from_config or similar
factory) and raise a clear ValueError if hidden_act != "silu" that points users
to use the unfused MoE block instead; ensure the check is implemented early in
the FusedMoE initialization path (reference symbols: __init__, hidden_act,
from_config) so unsupported configs never create a fused instance.
In `@test/python/kernel/test_fused_moe.py`:
- Around line 12-18: Rename the ambiguous dimension name `I` to a descriptive
name (e.g., `intermediate_size` or `intermediate_dim`) everywhere in this test
module: update the function signature of fused_moe_pytorch_reference (change
comments and type hints for w13 and w2), update usages inside
fused_moe_pytorch_reference, update the helper function `_run_test` and any
local variables or test locals that use `I` (including the later locals around
lines 58-69) so all occurrences are consistently renamed and lint E741 is
resolved.
- Around line 375-395: The test named test_determinism does not actually
guarantee determinism because top_k=2 allows atomic_add race-induced FP32
variations; update the test to either set top_k=1 (change the local top_k
variable to 1 so fused_moe runs without expert conflicts and true determinism is
validated) or rename the test (e.g., test_approximation) and its docstring to
reflect it verifies bounded numerical closeness for fused_moe with top_k=2; also
update the test name/docstring and any inline comment accordingly so readers and
CI expectations match the chosen behavior.
---
Nitpick comments:
In `@diffulex_kernel/__init__.py`:
- Around line 42-50: The lazy-export surface is inconsistent: __getattr__
exposes fused_topk and fused_moe but __all__ only lists fused_topk (fused_moe is
commented out). Make the exports consistent by either adding "fused_moe" to the
module-level __all__ list (or uncommenting the existing entry) if it should be
public, or remove/deny export in __getattr__ for fused_moe if it should be
private; update the code paths around __getattr__, the fused_topk and fused_moe
import lines, and the __all__ definition so both symbols are treated the same
way.
In `@diffulex/moe/__init__.py`:
- Around line 10-16: The function build_mlp_or_moe currently returns
FusedSparseMoEBlock.from_config(config) unconditionally, making the subsequent
return of SparseMoEBlock.from_config(config) dead code; either remove the
unreachable fallback or implement a real branch that chooses SparseMoEBlock when
FusedSparseMoEBlock is unavailable. Update build_mlp_or_moe to check
availability (e.g., try/except ImportError or a feature flag) before calling
FusedSparseMoEBlock.from_config(config) and only call
SparseMoEBlock.from_config(config) when the fused implementation is not
available, or delete the redundant return line if the fused block is the sole
supported implementation; reference the symbols build_mlp_or_moe,
FusedSparseMoEBlock, SparseMoEBlock, and is_moe_layer when making the change.
In `@diffulex/moe/topk.py`:
- Around line 9-10: The module currently imports fused_topk at top-level causing
an eager dependency; change to lazy-import diffulex_kernel.fused_topk only where
needed: move the import into the code paths that actually call it (e.g., inside
TopKRouter implementation branch that selects the Triton backend and inside the
function that calls fused_topk), so TopKRouter(impl="torch") and
topk_pytorch_reference() can be imported without resolving diffulex_kernel;
ensure you reference fused_topk by importing it locally just before use and keep
existing function/class names (TopKRouter, topk_pytorch_reference, fused_topk)
to locate the spots to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2ad030d3-f366-4b9c-9e61-c3dd60f9c15e
📒 Files selected for processing (15)
diffulex/engine/dllm_block.pydiffulex/mixin/multi_block/engine/request.pydiffulex/model/sdar_moe.pydiffulex/moe/__init__.pydiffulex/moe/fused_moe.pydiffulex/moe/moe_impl.pydiffulex/moe/topk.pydiffulex/sampler/sdar.pydiffulex/utils/loader.pydiffulex_kernel/__init__.pydiffulex_kernel/python/fused_moe_triton.pydiffulex_kernel/python/fused_topk_triton.pypyproject.tomltest/python/kernel/test_fused_moe.pytest/python/kernel/test_fused_topk.py
diffulex/engine/dllm_block.py
Outdated
| def __getstate__(self): | ||
| s = self.__dict__.copy() | ||
| s['_req'] = s['_req']() | ||
| if '_dllm_block_buffer' in s: | ||
| s['_dllm_block_buffer'] = s['_dllm_block_buffer']() | ||
| return s | ||
|
|
||
| def __setstate__(self, state): | ||
| s = self.__dict__ = state.copy() | ||
| s['_req'] = weakref_fn(s['_req']) | ||
| if '_dllm_block_buffer' in s: | ||
| s['_dllm_block_buffer'] = weakref_fn(s['_dllm_block_buffer']) |
There was a problem hiding this comment.
These new pickle hooks are shadowed by the older ones below.
Lines 73-80 and 223-229 redefine __getstate__/__setstate__, so Python drops the versions added here. That makes the new weakref rehydration path unreachable and leaves two conflicting serialization implementations in the same file.
Also applies to: 198-205
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@diffulex/engine/dllm_block.py` around lines 36 - 47, The file defines
__getstate__ and __setstate__ multiple times which causes the later/older
definitions to shadow the new weakref-based serialization path; consolidate to a
single pair of methods that implement the weakref handling: keep the
implementations that use weakref_fn and convert s['_req'] and
s['_dllm_block_buffer'] to/from weak references, remove or merge the older
duplicate __getstate__/__setstate__ definitions so only one canonical
implementation remains, and ensure the final __setstate__/__getstate__ reference
the weakref_fn helper and the _req and _dllm_block_buffer attributes
consistently.
diffulex/moe/fused_moe.py
Outdated
| def __init__( | ||
| self, | ||
| hidden_size: int, | ||
| intermediate_size: int, | ||
| num_experts: int, | ||
| top_k: int, | ||
| *, | ||
| hidden_act: str = "silu", | ||
| norm_topk_prob: bool = True, | ||
| ) -> None: |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Fail fast on unsupported activations.
from_config() forwards arbitrary config.hidden_act, but the backend currently only supports "silu". Right now an unsupported model will instantiate fine and then abort on its first forward. Guard it here or route those configs to the unfused MoE block.
♻️ Suggested change
self.num_experts = num_experts
self.top_k = top_k
self.norm_topk_prob = norm_topk_prob
- self.hidden_act = hidden_act
+ if hidden_act != "silu":
+ raise ValueError(
+ "FusedSparseMoEBlock only supports hidden_act='silu'"
+ )
+ self.hidden_act = hidden_actAlso applies to: 148-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@diffulex/moe/fused_moe.py` around lines 29 - 38, Constructor currently
accepts arbitrary hidden_act causing instantiation of a fused MoE that only
supports "silu"; update the validation to fail fast by checking hidden_act in
__init__ (and the alternate constructor/loader referenced at lines ~148-156,
e.g., from_config or similar factory) and raise a clear ValueError if hidden_act
!= "silu" that points users to use the unfused MoE block instead; ensure the
check is implemented early in the FusedMoE initialization path (reference
symbols: __init__, hidden_act, from_config) so unsupported configs never create
a fused instance.
test/python/kernel/test_fused_moe.py
Outdated
| def fused_moe_pytorch_reference( | ||
| hidden_states: torch.Tensor, # (M, H) | ||
| w13: torch.Tensor, # (E, 2*I, H) | ||
| w2: torch.Tensor, # (E, H, I) | ||
| topk_weights: torch.Tensor, # (M, top_k) | ||
| topk_ids: torch.Tensor, # (M, top_k) | ||
| top_k: int, |
There was a problem hiding this comment.
Rename I across this file.
Ruff is already flagging every I here as E741, so the new test module will stay lint-red until the dimension name is expanded (intermediate_size, intermediate_dim, etc.). The same rename needs to be applied in _run_test and the later test locals.
Also applies to: 58-69
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/python/kernel/test_fused_moe.py` around lines 12 - 18, Rename the
ambiguous dimension name `I` to a descriptive name (e.g., `intermediate_size` or
`intermediate_dim`) everywhere in this test module: update the function
signature of fused_moe_pytorch_reference (change comments and type hints for w13
and w2), update usages inside fused_moe_pytorch_reference, update the helper
function `_run_test` and any local variables or test locals that use `I`
(including the later locals around lines 58-69) so all occurrences are
consistently renamed and lint E741 is resolved.
test/python/kernel/test_fused_moe.py
Outdated
| def test_determinism(): | ||
| """Same inputs → same outputs across two runs.""" | ||
| device = "cuda" | ||
| M, H, I, E, top_k = 64, 128, 64, 8, 2 | ||
|
|
||
| torch.manual_seed(42) | ||
| hidden_states = torch.randn(M, H, device=device) * 0.1 | ||
| w13 = torch.randn(E, 2 * I, H, device=device) * 0.1 | ||
| w2 = torch.randn(E, H, I, device=device) * 0.1 | ||
| router_logits = torch.randn(M, E, device=device) | ||
| scores = torch.softmax(router_logits, dim=-1) | ||
| topk_weights, topk_ids = torch.topk(scores, top_k, dim=-1) | ||
| topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True) | ||
| topk_ids = topk_ids.to(torch.int32) | ||
|
|
||
| out1 = fused_moe(hidden_states, w13, w2, topk_weights, topk_ids, top_k) | ||
| out2 = fused_moe(hidden_states, w13, w2, topk_weights, topk_ids, top_k) | ||
|
|
||
| # Note: atomic_add may have non-deterministic ordering for fp32, | ||
| # but values should still be very close | ||
| torch.testing.assert_close(out1, out2, atol=1e-5, rtol=1e-5) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.py" | grep -E "(fused_moe|test_fused)" | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 217
🏁 Script executed:
find . -type f -name "*fused_moe*" | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 177
🏁 Script executed:
cat -n ./diffulex_kernel/python/fused_moe_triton.py | head -150Repository: SJTU-DENG-Lab/Diffulex
Length of output: 6333
🏁 Script executed:
rg "atomic_add" ./diffulex_kernel/python/fused_moe_triton.py -A 3 -B 3Repository: SJTU-DENG-Lab/Diffulex
Length of output: 351
🏁 Script executed:
cat -n ./test/python/kernel/test_fused_moe.py | sed -n '370,400p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1621
🏁 Script executed:
cat -n ./diffulex/moe/fused_moe.py | head -100Repository: SJTU-DENG-Lab/Diffulex
Length of output: 3809
🏁 Script executed:
rg "determinism|atomic" ./test/python/kernel/test_fused_moe.py -B 2 -A 2Repository: SJTU-DENG-Lab/Diffulex
Length of output: 497
🏁 Script executed:
wc -l ./diffulex_kernel/python/fused_moe_triton.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 114
🏁 Script executed:
cat -n ./diffulex_kernel/python/fused_moe_triton.py | sed -n '150,180p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1515
🏁 Script executed:
rg "top_k" ./test/python/kernel/test_fused_moe.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 960
🏁 Script executed:
rg "deterministic|CUDA_LAUNCH_BLOCKING" . --type py -lRepository: SJTU-DENG-Lab/Diffulex
Length of output: 216
🏁 Script executed:
cat -n ./diffulex_kernel/python/fused_moe_triton.py | sed -n '175,185p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 533
Rename or adjust this test—it does not validate determinism.
With top_k=2 and tl.atomic_add(), thread ordering is non-deterministic on GPU, causing rounding variations. The test mitigates this with relaxed tolerances (atol=1e-5, rtol=1e-5), making it a bounded-approximation test, not a determinism test. Either:
- Use
top_k=1to eliminate expert conflicts and actual determinism, or - Rename to
test_approximation()or similar to reflect what it actually validates.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 378-378: Ambiguous variable name: I
(E741)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/python/kernel/test_fused_moe.py` around lines 375 - 395, The test named
test_determinism does not actually guarantee determinism because top_k=2 allows
atomic_add race-induced FP32 variations; update the test to either set top_k=1
(change the local top_k variable to 1 so fused_moe runs without expert conflicts
and true determinism is validated) or rename the test (e.g., test_approximation)
and its docstring to reflect it verifies bounded numerical closeness for
fused_moe with top_k=2; also update the test name/docstring and any inline
comment accordingly so readers and CI expectations match the chosen behavior.
…oading - model-specific param remapping is not enough to support MoE weight repacking
- provides support for distinguishing between TP and EP world sizes and ranks - add trivial and a2a token dispatcher for EP
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
diffulex/moe/topk/trivial.py (1)
21-22:⚠️ Potential issue | 🟡 MinorPotential division by zero in renormalization.
If
topk_weightssums to zero (possible with extreme numerical edge cases), this division produces NaN. Consider adding a small epsilon for robustness, consistent with the Triton kernel which usestl.maximum(selected_sum, 1e-20).Suggested fix
if self.renormalize: - topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True) + topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True).clamp(min=1e-20)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/trivial.py` around lines 21 - 22, The renormalization step in trivial.py can divide by zero when topk_weights.sum(...) == 0; modify the block in the method where self.renormalize is used to compute a safe denominator (e.g., denom = topk_weights.sum(dim=-1, keepdim=True).clamp_min(1e-20) or torch.maximum(..., torch.tensor(1e-20, device=topk_weights.device))) and then do topk_weights = topk_weights / denom so the division is robust to zero sums (use 1e-20 to match the Triton kernel).
♻️ Duplicate comments (1)
diffulex_kernel/python/fused_moe_triton.py (1)
41-51:⚠️ Potential issue | 🟡 MinorMissing validation:
topk_ids.shape[0]must equalhidden_states.shape[0].The validation checks that
topk_idsandtopk_weightshave matching shapes, but does not verify that the number of tokens intopk_idsmatcheshidden_states. A mismatch would cause incorrect indexing in the kernel.🛡️ Proposed fix
if topk_ids.shape != topk_weights.shape: raise ValueError( f"topk_ids and topk_weights must have the same shape, got {topk_ids.shape} and {topk_weights.shape}." ) + if topk_ids.shape[0] != hidden_states.shape[0]: + raise ValueError( + f"topk_ids must have one row per token, got {topk_ids.shape[0]} rows but {hidden_states.shape[0]} tokens." + ) if w13.shape[0] != w2.shape[0] or w13.shape[2] != hidden_states.shape[1]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex_kernel/python/fused_moe_triton.py` around lines 41 - 51, The code validates shapes of topk_ids/topk_weights and weight matrices but misses verifying that the number of tokens in topk_ids matches hidden_states, which can break indexing; add a validation after the existing topk_ids/topk_weights check that asserts topk_ids.shape[0] == hidden_states.shape[0] and raises a ValueError with a clear message referencing topk_ids.shape and hidden_states.shape so callers can see the mismatch (place this check alongside the other shape validations in the same block inside fused_moe_triton where topk_ids, topk_weights, and hidden_states are validated).
🧹 Nitpick comments (16)
diffulex/moe/topk/bypass.py (2)
7-8: Typo in docstring.Minor: "implemenation" → "implementation".
✏️ Proposed fix
class BypassTopKRouter(TopKRouter): - """Bypass implemenation, use this if fused moe runner also handles topk""" + """Bypass implementation, use this if fused moe runner also handles topk."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/bypass.py` around lines 7 - 8, Fix the typo in the BypassTopKRouter class docstring: change "Bypass implemenation, use this if fused moe runner also handles topk" to "Bypass implementation, use this if fused moe runner also handles topk" in the BypassTopKRouter (subclass of TopKRouter) docstring.
10-15: Ensure correct router/runner pairing at runtime.This bypass router returns
weights=Noneandids=None, which will causeTritonFusedMoERunnerto raise aRuntimeError(perdiffulex/moe/runner/triton.py:10-14). The design is intentional for runners that handle top-k internally, but consider adding a note in the docstring about compatible runners to prevent misconfiguration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/bypass.py` around lines 10 - 15, The bypass router's forward currently returns weights=None and ids=None which is intentional for routers that let runners perform top-k, but it lacks documentation and can lead to a confusing RuntimeError when paired with runners that expect ids/weights (see TritonFusedMoERunner). Update the forward method's docstring in the bypass router to clearly state that returning None for weights and ids is intentional and that this router must only be used with runners that implement internal top-k (e.g., TritonFusedMoERunner-like runners); additionally, add an optional runtime compatibility check in forward or router initialization to raise a clear error message if paired with an incompatible runner instead of allowing the downstream RuntimeError.diffulex/utils/loader.py (2)
157-169: Consider using direct attribute access instead ofgetattr.Static analysis (B009) flags
getattr(param, "weight_loader")with a constant attribute. Since the missing attribute case is already handled by thetry/exceptblock, you could use direct attribute access for clarity.♻️ Proposed fix
try: param = model.get_parameter(param_name) - weight_loader = partial( - getattr(param, "weight_loader"), - param, - loaded_weight, - ) + weight_loader = partial(param.weight_loader, param, loaded_weight) if shard_id is None: weight_loader() else: weight_loader(shard_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/utils/loader.py` around lines 157 - 169, The code uses getattr(param, "weight_loader") inside loader logic though AttributeError/KeyError is already caught; change to direct attribute access param.weight_loader when building the partial (in the block that calls model.get_parameter and constructs weight_loader) to satisfy static analysis and improve clarity while keeping the same behavior — continue to wrap it in the existing try/except, create the partial with param.weight_loader, param, loaded_weight, and then call weight_loader() or weight_loader(shard_id) depending on shard_id.
139-139: Prefer explicit exception overassertfor runtime validation.Using
assertfor runtime validation can be disabled with-Oflag. Consider raisingValueErrorfor clearer error handling.🛡️ Proposed fix
- assert v == "lm_head" + if v != "lm_head": + raise ValueError(f"Expected 'lm_head' for transformer.ff_out mapping, got '{v}'")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/utils/loader.py` at line 139, Replace the runtime assert "assert v == 'lm_head'" with an explicit check that raises a ValueError when the condition fails; specifically, change it to: if v != "lm_head": raise ValueError(f"Expected 'lm_head' for variable v, got {v!r}"). This ensures the validation cannot be disabled with -O and provides a clear, descriptive error message including the actual value.diffulex/moe/topk/triton.py (1)
2-2: Remove unused import.
torch.nn.functional as Fis imported but never used in this file.🧹 Proposed fix
import torch -import torch.nn.functional as F from diffulex.moe.topk.base import TopKRouter🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/triton.py` at line 2, Remove the unused import "import torch.nn.functional as F" from triton.py; locate the import statement in the top of the file (the symbol to remove is "torch.nn.functional as F") and delete it so the module no longer includes an unused dependency and to satisfy linting.diffulex/moe/runner/__init__.py (2)
19-24: Consider sorting__all__for consistency.Per static analysis (RUF022),
__all__is not sorted. Sorting improves readability and makes diffs cleaner when adding new exports.🧹 Proposed fix
__all__ = [ "MoERunner", - "TrivialMoERunner", "TritonFusedMoERunner", + "TrivialMoERunner", "build_runner", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/runner/__init__.py` around lines 19 - 24, The __all__ export list is unsorted; reorder the list assigned to __all__ so its string entries are in lexicographical order (e.g., place "MoERunner", "TritonFusedMoERunner", "TrivialMoERunner", "build_runner" sorted appropriately), updating the __all__ assignment to the sorted sequence to satisfy RUF022 and keep exports consistent.
15-16: Include the invalidimplvalue in the error message.When an unsupported backend is requested, the error should indicate what value was passed for easier debugging. This is consistent with the error message pattern in
build_dispatcher.🔧 Proposed fix
elif impl == "triton": return TritonFusedMoERunner(*args, **kwargs) else: - raise NotImplementedError + raise NotImplementedError(f"Unsupported runner backend: {impl!r}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/runner/__init__.py` around lines 15 - 16, The else branch raising NotImplementedError in __init__.py should include the actual invalid impl value for clearer debugging; update the error raised in that branch (the raise in the same block that handles backend selection and mirrors build_dispatcher’s pattern) to include impl in the message (e.g., "Unsupported impl: {impl}") so callers see what value was passed.diffulex/moe/runner/trivial.py (1)
1-5: Missingtorchimport.The file uses
dispatch_output.hidden_states.new_zeros(...)which works via the tensor method, but addingimport torchwould be more explicit and consistent with other files.Suggested addition
from __future__ import annotations +import torch import torch.nn.functional as F🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/runner/trivial.py` around lines 1 - 5, The module is missing an explicit torch import used when creating new tensors (e.g., dispatch_output.hidden_states.new_zeros(...)); add "import torch" to trivial.py so tensor creation is explicit and consistent with other modules that use torch, and ensure any references in MoERunner, CombineInput, or DispatchOutput code that rely on torch remain unchanged.diffulex/moe/topk/__init__.py (2)
19-20: Improve error message for unsupported implementations.The
NotImplementedErrorshould include the invalidimplvalue and available options for easier debugging.Suggested improvement
else: - raise NotImplementedError + raise NotImplementedError( + f"Unknown top-k router implementation: {impl!r}. " + f"Supported: 'trivial', 'bypass', 'triton'." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/__init__.py` around lines 19 - 20, The current else branch raises a bare NotImplementedError; update the else in diffulex.moe.topk.__init__.py to raise a NotImplementedError that includes the invalid impl value and the list of supported implementations (e.g., f"Unsupported impl '{impl}'; supported: {supported_list}") so callers can see what was passed and what options exist; locate the else that currently does raise NotImplementedError and replace it with an informative message referencing the impl variable and the available implementations constant or keys from the dispatch map used in this module.
23-30: Consider sorting__all__for consistency.Ruff flagged that
__all__is not sorted (RUF022). While optional, alphabetical ordering improves maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/topk/__init__.py` around lines 23 - 30, The __all__ list in this module is not alphabetically sorted (RUF022); reorder the entries in the __all__ variable so they are in alphabetical order (e.g., "BypassTopKRouter", "TopKOutput", "TopKRouter", "TritonFusedTopKRouter", "TrivialTopKRouter", "build_topk_router") to satisfy the lint rule and improve maintainability; update the __all__ definition accordingly.diffulex/engine/model_runner.py (1)
60-61: Redundant assignments overwritten later.
self.world_sizeandself.rankare assigned here from config values, but they are immediately overwritten at lines 84-85 from thelayoutobject. Consider removing these initial assignments to avoid confusion.Suggested cleanup
- self.world_size = config.tensor_parallel_size - self.rank = rank self.event = event🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/engine/model_runner.py` around lines 60 - 61, The assignments to self.world_size and self.rank from config at the start of ModelRunner.__init__ are redundant because they are overwritten later from layout; remove the initial assignments (self.world_size = config.tensor_parallel_size and self.rank = rank) to avoid confusion and keep the single authoritative source (layout) — update or delete those two lines in ModelRunner.__init__ and ensure any subsequent logic relies on the values set from layout rather than the earlier config-derived values.diffulex/moe/runner/base.py (1)
36-38: Weight tensors not registered as parameters.
w13andw2are stored as plain attributes rather thannn.Parameteror registered buffers. This means they won't appear instate_dict()or be moved by.to(). If this is intentional (e.g., weights managed externally by the loader), consider adding a brief comment explaining the design choice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/runner/base.py` around lines 36 - 38, w13 and w2 are being stored as plain attributes (self.w13, self.w2) so they won't be included in state_dict() or moved with .to(); update the class initializer to register them properly: if these are trainable weights wrap them with torch.nn.Parameter and assign to self.w13/self.w2, otherwise register them as buffers via self.register_buffer('w13', w13) / self.register_buffer('w2', w2); if the current design intentionally keeps them external, add a short comment next to local_expert_start/w13/w2 explaining that these tensors are externally managed and intentionally not registered so reviewers understand the choice.diffulex_kernel/python/fused_topk_triton.py (1)
48-63: Output validation adds safety at runtime cost.The
_validate_fused_topk_outputsfunction performs GPU-to-CPU synchronization (detach().cpu()) on every call, which can impact performance. Consider making this validation optional (e.g., via an environment variable or debug flag) for production use.Suggested change
+import os + +_VALIDATE_OUTPUTS = os.environ.get("DIFFULEX_VALIDATE_TOPK_OUTPUTS", "0") == "1" + def _validate_fused_topk_outputs( topk_weights: torch.Tensor, topk_ids: torch.Tensor, *, num_experts: int, ) -> None: + if not _VALIDATE_OUTPUTS: + return invalid_id_mask = (topk_ids < 0) | (topk_ids >= num_experts) ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex_kernel/python/fused_topk_triton.py` around lines 48 - 63, The runtime validation in _validate_fused_topk_outputs causes GPU->CPU sync (detach().cpu()) on every call; make this validation conditional by introducing a toggle (e.g., module-level boolean FUSED_TOPK_VALIDATE_OUTPUTS set from an environment variable like "FUSED_TOPK_VALIDATE_OUTPUTS" or a debug flag) and only perform the invalid_id_mask check, detach().cpu(), and isfinite checks when that toggle is true; update the function to early-return if validation is disabled and ensure the env-var default is disabled in production but can be enabled for testing/debugging.diffulex/moe/dispatcher/trivial.py (1)
67-70: Strict expert ID matching may be overly restrictive.The current logic raises if
active_expert_idsdoesn't exactly match the expected list. Consider allowing a subset for scenarios where only specific experts need processing (e.g., during debugging or selective expert evaluation).Alternative: allow subset
expert_ids = list(active_expert_ids) - if expert_ids != expected: - raise ValueError(f"Expected active_expert_ids={expected}, got {expert_ids}.") + if not set(expert_ids).issubset(set(expected)): + raise ValueError( + f"active_expert_ids must be a subset of {expected}, got {expert_ids}." + ) return expert_ids🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex/moe/dispatcher/trivial.py` around lines 67 - 70, Change the strict equality check between active_expert_ids and expected to allow expected to be a subset: compute expert_ids = list(active_expert_ids), then validate with something like if expected and not set(expected).issubset(set(expert_ids)): raise ValueError(...); finally return only the relevant experts (e.g., filter expert_ids to those in expected if expected is provided, otherwise return all expert_ids). Update the error message in the raise to reflect that expected must be a subset when applicable; reference variables active_expert_ids, expected, expert_ids and the existing ValueError raise site.diffulex_kernel/python/fused_moe_triton.py (2)
131-132: Early return shape may be inconsistent with weight dimensions.When
hidden_states.shape[0] == 0, the function returnshidden_states.new_zeros((0, w2.shape[1])). However,w2.shape[1]ishidden_size(the output dimension of down_proj). This is correct, but whentopk_ids.shape[1] == 0(zero top-k), returning zeros is semantically correct since no experts contribute.Consider adding a comment clarifying that
w2.shape[1] == hidden_sizeto improve readability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex_kernel/python/fused_moe_triton.py` around lines 131 - 132, The early-return when hidden_states.shape[0] == 0 or topk_ids.shape[1] == 0 returns a zero tensor using w2.shape[1]; clarify this by adding an inline comment in fused_moe_triton.py next to the return explaining that w2.shape[1] is the model hidden_size (the output dim of down_proj), and that returning zeros is correct when there are no tokens or no top-k experts; keep the existing return value but add the comment referencing hidden_states, topk_ids, and w2 for readability.
101-112: 3D tensor load may cause high register pressure.The weight load creates a
(BLOCK_M, BLOCK_N, BLOCK_K)tensor in registers before reduction. WithBLOCK_M=8, BLOCK_N=64, BLOCK_K=32, this allocates 16K elements per thread block. This approach works but may limit occupancy on memory-bound workloads.Consider whether a more traditional 2D tiled GEMM with per-row expert lookup would reduce register pressure, though this is acceptable for a draft implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@diffulex_kernel/python/fused_moe_triton.py` around lines 101 - 112, The current tl.load builds a full (BLOCK_M, BLOCK_N, BLOCK_K) tensor (see the load using w_ptr + expert_offsets... + current_k * stride_wk and mask combining row_mask, valid_expert, offs_n, k_mask) which raises register pressure; change the implementation to tile along the K dimension and load smaller 2D slices instead (e.g., loop over K sub-blocks: load (BLOCK_M, BLOCK_K) or (BLOCK_N, BLOCK_K) chunks per iteration, multiply-accumulate into acc incrementally, and only keep a small b-slice in registers per loop) so you perform partial reductions across K rather than materializing the full 3D block in registers. Ensure masks (row_mask, valid_expert, offs_n < num_cols, k_mask) and strides (stride_we, stride_wn, stride_wk) are applied per sub-load and that the accumulation acc += tl.sum(a[:, None, :] * b, axis=2) still operates on the smaller b-slice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@diffulex/utils/loader.py`:
- Line 1: The file contains a UTF-8 BOM at the start (before the first token
"import os"); remove the leading BOM character so the file is plain UTF-8
without BOM and re-save loader.py (ensure your editor/IDE writes UTF-8 without
BOM) so the module import line ("import os") and the rest of the file no longer
include the hidden character.
---
Outside diff comments:
In `@diffulex/moe/topk/trivial.py`:
- Around line 21-22: The renormalization step in trivial.py can divide by zero
when topk_weights.sum(...) == 0; modify the block in the method where
self.renormalize is used to compute a safe denominator (e.g., denom =
topk_weights.sum(dim=-1, keepdim=True).clamp_min(1e-20) or torch.maximum(...,
torch.tensor(1e-20, device=topk_weights.device))) and then do topk_weights =
topk_weights / denom so the division is robust to zero sums (use 1e-20 to match
the Triton kernel).
---
Duplicate comments:
In `@diffulex_kernel/python/fused_moe_triton.py`:
- Around line 41-51: The code validates shapes of topk_ids/topk_weights and
weight matrices but misses verifying that the number of tokens in topk_ids
matches hidden_states, which can break indexing; add a validation after the
existing topk_ids/topk_weights check that asserts topk_ids.shape[0] ==
hidden_states.shape[0] and raises a ValueError with a clear message referencing
topk_ids.shape and hidden_states.shape so callers can see the mismatch (place
this check alongside the other shape validations in the same block inside
fused_moe_triton where topk_ids, topk_weights, and hidden_states are validated).
---
Nitpick comments:
In `@diffulex_kernel/python/fused_moe_triton.py`:
- Around line 131-132: The early-return when hidden_states.shape[0] == 0 or
topk_ids.shape[1] == 0 returns a zero tensor using w2.shape[1]; clarify this by
adding an inline comment in fused_moe_triton.py next to the return explaining
that w2.shape[1] is the model hidden_size (the output dim of down_proj), and
that returning zeros is correct when there are no tokens or no top-k experts;
keep the existing return value but add the comment referencing hidden_states,
topk_ids, and w2 for readability.
- Around line 101-112: The current tl.load builds a full (BLOCK_M, BLOCK_N,
BLOCK_K) tensor (see the load using w_ptr + expert_offsets... + current_k *
stride_wk and mask combining row_mask, valid_expert, offs_n, k_mask) which
raises register pressure; change the implementation to tile along the K
dimension and load smaller 2D slices instead (e.g., loop over K sub-blocks: load
(BLOCK_M, BLOCK_K) or (BLOCK_N, BLOCK_K) chunks per iteration,
multiply-accumulate into acc incrementally, and only keep a small b-slice in
registers per loop) so you perform partial reductions across K rather than
materializing the full 3D block in registers. Ensure masks (row_mask,
valid_expert, offs_n < num_cols, k_mask) and strides (stride_we, stride_wn,
stride_wk) are applied per sub-load and that the accumulation acc += tl.sum(a[:,
None, :] * b, axis=2) still operates on the smaller b-slice.
In `@diffulex_kernel/python/fused_topk_triton.py`:
- Around line 48-63: The runtime validation in _validate_fused_topk_outputs
causes GPU->CPU sync (detach().cpu()) on every call; make this validation
conditional by introducing a toggle (e.g., module-level boolean
FUSED_TOPK_VALIDATE_OUTPUTS set from an environment variable like
"FUSED_TOPK_VALIDATE_OUTPUTS" or a debug flag) and only perform the
invalid_id_mask check, detach().cpu(), and isfinite checks when that toggle is
true; update the function to early-return if validation is disabled and ensure
the env-var default is disabled in production but can be enabled for
testing/debugging.
In `@diffulex/engine/model_runner.py`:
- Around line 60-61: The assignments to self.world_size and self.rank from
config at the start of ModelRunner.__init__ are redundant because they are
overwritten later from layout; remove the initial assignments (self.world_size =
config.tensor_parallel_size and self.rank = rank) to avoid confusion and keep
the single authoritative source (layout) — update or delete those two lines in
ModelRunner.__init__ and ensure any subsequent logic relies on the values set
from layout rather than the earlier config-derived values.
In `@diffulex/moe/dispatcher/trivial.py`:
- Around line 67-70: Change the strict equality check between active_expert_ids
and expected to allow expected to be a subset: compute expert_ids =
list(active_expert_ids), then validate with something like if expected and not
set(expected).issubset(set(expert_ids)): raise ValueError(...); finally return
only the relevant experts (e.g., filter expert_ids to those in expected if
expected is provided, otherwise return all expert_ids). Update the error message
in the raise to reflect that expected must be a subset when applicable;
reference variables active_expert_ids, expected, expert_ids and the existing
ValueError raise site.
In `@diffulex/moe/runner/__init__.py`:
- Around line 19-24: The __all__ export list is unsorted; reorder the list
assigned to __all__ so its string entries are in lexicographical order (e.g.,
place "MoERunner", "TritonFusedMoERunner", "TrivialMoERunner", "build_runner"
sorted appropriately), updating the __all__ assignment to the sorted sequence to
satisfy RUF022 and keep exports consistent.
- Around line 15-16: The else branch raising NotImplementedError in __init__.py
should include the actual invalid impl value for clearer debugging; update the
error raised in that branch (the raise in the same block that handles backend
selection and mirrors build_dispatcher’s pattern) to include impl in the message
(e.g., "Unsupported impl: {impl}") so callers see what value was passed.
In `@diffulex/moe/runner/base.py`:
- Around line 36-38: w13 and w2 are being stored as plain attributes (self.w13,
self.w2) so they won't be included in state_dict() or moved with .to(); update
the class initializer to register them properly: if these are trainable weights
wrap them with torch.nn.Parameter and assign to self.w13/self.w2, otherwise
register them as buffers via self.register_buffer('w13', w13) /
self.register_buffer('w2', w2); if the current design intentionally keeps them
external, add a short comment next to local_expert_start/w13/w2 explaining that
these tensors are externally managed and intentionally not registered so
reviewers understand the choice.
In `@diffulex/moe/runner/trivial.py`:
- Around line 1-5: The module is missing an explicit torch import used when
creating new tensors (e.g., dispatch_output.hidden_states.new_zeros(...)); add
"import torch" to trivial.py so tensor creation is explicit and consistent with
other modules that use torch, and ensure any references in MoERunner,
CombineInput, or DispatchOutput code that rely on torch remain unchanged.
In `@diffulex/moe/topk/__init__.py`:
- Around line 19-20: The current else branch raises a bare NotImplementedError;
update the else in diffulex.moe.topk.__init__.py to raise a NotImplementedError
that includes the invalid impl value and the list of supported implementations
(e.g., f"Unsupported impl '{impl}'; supported: {supported_list}") so callers can
see what was passed and what options exist; locate the else that currently does
raise NotImplementedError and replace it with an informative message referencing
the impl variable and the available implementations constant or keys from the
dispatch map used in this module.
- Around line 23-30: The __all__ list in this module is not alphabetically
sorted (RUF022); reorder the entries in the __all__ variable so they are in
alphabetical order (e.g., "BypassTopKRouter", "TopKOutput", "TopKRouter",
"TritonFusedTopKRouter", "TrivialTopKRouter", "build_topk_router") to satisfy
the lint rule and improve maintainability; update the __all__ definition
accordingly.
In `@diffulex/moe/topk/bypass.py`:
- Around line 7-8: Fix the typo in the BypassTopKRouter class docstring: change
"Bypass implemenation, use this if fused moe runner also handles topk" to
"Bypass implementation, use this if fused moe runner also handles topk" in the
BypassTopKRouter (subclass of TopKRouter) docstring.
- Around line 10-15: The bypass router's forward currently returns weights=None
and ids=None which is intentional for routers that let runners perform top-k,
but it lacks documentation and can lead to a confusing RuntimeError when paired
with runners that expect ids/weights (see TritonFusedMoERunner). Update the
forward method's docstring in the bypass router to clearly state that returning
None for weights and ids is intentional and that this router must only be used
with runners that implement internal top-k (e.g., TritonFusedMoERunner-like
runners); additionally, add an optional runtime compatibility check in forward
or router initialization to raise a clear error message if paired with an
incompatible runner instead of allowing the downstream RuntimeError.
In `@diffulex/moe/topk/triton.py`:
- Line 2: Remove the unused import "import torch.nn.functional as F" from
triton.py; locate the import statement in the top of the file (the symbol to
remove is "torch.nn.functional as F") and delete it so the module no longer
includes an unused dependency and to satisfy linting.
In `@diffulex/utils/loader.py`:
- Around line 157-169: The code uses getattr(param, "weight_loader") inside
loader logic though AttributeError/KeyError is already caught; change to direct
attribute access param.weight_loader when building the partial (in the block
that calls model.get_parameter and constructs weight_loader) to satisfy static
analysis and improve clarity while keeping the same behavior — continue to wrap
it in the existing try/except, create the partial with param.weight_loader,
param, loaded_weight, and then call weight_loader() or weight_loader(shard_id)
depending on shard_id.
- Line 139: Replace the runtime assert "assert v == 'lm_head'" with an explicit
check that raises a ValueError when the condition fails; specifically, change it
to: if v != "lm_head": raise ValueError(f"Expected 'lm_head' for variable v, got
{v!r}"). This ensures the validation cannot be disabled with -O and provides a
clear, descriptive error message including the actual value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 984d317a-ff55-46a2-a604-ddeb568c046d
📒 Files selected for processing (35)
.gitignorediffulex/config.pydiffulex/engine/dp_worker.pydiffulex/engine/model_runner.pydiffulex/engine/tp_worker.pydiffulex/layer/embed_head.pydiffulex/layer/linear.pydiffulex/model/dream.pydiffulex/model/fast_dllm_v2.pydiffulex/model/llada.pydiffulex/model/sdar.pydiffulex/moe/__init__.pydiffulex/moe/dispatcher/__init__.pydiffulex/moe/dispatcher/base.pydiffulex/moe/dispatcher/datatype.pydiffulex/moe/dispatcher/trivial.pydiffulex/moe/layers.pydiffulex/moe/moe_impl.pydiffulex/moe/runner/__init__.pydiffulex/moe/runner/base.pydiffulex/moe/runner/triton.pydiffulex/moe/runner/trivial.pydiffulex/moe/topk/__init__.pydiffulex/moe/topk/base.pydiffulex/moe/topk/bypass.pydiffulex/moe/topk/datatype.pydiffulex/moe/topk/triton.pydiffulex/moe/topk/trivial.pydiffulex/sampler/sdar.pydiffulex/utils/checkpoint.pydiffulex/utils/loader.pydiffulex/utils/parallelism.pydiffulex_kernel/__init__.pydiffulex_kernel/python/fused_moe_triton.pydiffulex_kernel/python/fused_topk_triton.py
💤 Files with no reviewable changes (1)
- diffulex/moe/moe_impl.py
✅ Files skipped from review due to trivial changes (4)
- .gitignore
- diffulex/moe/topk/datatype.py
- diffulex/layer/embed_head.py
- diffulex/utils/checkpoint.py
🚧 Files skipped from review as they are similar to previous changes (3)
- diffulex/sampler/sdar.py
- diffulex/moe/init.py
- diffulex_kernel/init.py
| @@ -1,4 +1,4 @@ | |||
| import os | |||
| import os | |||
There was a problem hiding this comment.
Remove BOM character from file.
The file starts with a UTF-8 BOM (byte order mark: ), which is unusual for Python source files and can cause issues with some tools. This was also flagged by static analysis (EXE002).
🔧 Proposed fix
-import os
+import os📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import os | |
| import os |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 1-1: The file is executable but no shebang is present
(EXE002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@diffulex/utils/loader.py` at line 1, The file contains a UTF-8 BOM at the
start (before the first token "import os"); remove the leading BOM character so
the file is plain UTF-8 without BOM and re-save loader.py (ensure your
editor/IDE writes UTF-8 without BOM) so the module import line ("import os") and
the rest of the file no longer include the hidden character.
Feature:
Summary by CodeRabbit
New Features
Refactor