Conversation
Summary of ChangesHello @xiaoxi-s, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the flexibility of the Llama3 draft decoder by enabling the configuration of multiple Multi-Layer Perceptrons (MLPs) within each decoder layer. Previously limited to a single MLP, the model can now be instantiated with a user-defined number of MLPs, allowing for more complex and potentially more powerful draft model architectures. The changes include adding a new command-line option for specifying the number of MLPs, updating the core model classes to handle this configuration, and ensuring comprehensive test coverage for the new functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for a multi-layer MLP in the draft decoder, with corresponding changes to the training script and tests. The implementation is consistent and the new functionality is well-tested. I have provided a couple of suggestions to improve the architecture of the multi-layer MLP for better training stability by incorporating residual connections for each MLP block. Additionally, I've pointed out a minor indentation issue in one of the test files.
I am having trouble creating individual review comments. Click here to see my feedback.
specforge/modeling/draft/llama3_eagle.py (1252)
Using nn.Sequential here creates a deep MLP stack without any intermediate normalization or residual connections. This can lead to training instability, especially if num_draft_hidden_layers is large. A more standard and stable approach is to use nn.ModuleList to treat each MLP as a separate block. I'll add a related suggestion on the forward method to complete this change.
self.mlps = nn.ModuleList([LlamaMLP(config) for _ in range(num_draft_hidden_layers)])
specforge/modeling/draft/llama3_eagle.py (1308-1311)
To accompany the change to nn.ModuleList for self.mlps, the forward pass should be updated to loop through the MLPs. This applies a residual connection around each MLP block, which is a more standard and stable architecture than a single deep MLP. This implementation reuses the same post_attention_layernorm for each block.
for mlp in self.mlps:
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = mlp(hidden_states)
hidden_states = residual + hidden_states
tests/test_modeling/test_draft/test_llama3.py (63)
This line has incorrect indentation. According to the PEP 8 style guide, indentation should be in multiples of 4 spaces. The line inside the for loop should be indented by 4 more spaces relative to the for statement.
self.assertIsInstance(model.midlayer.mlps[i], LlamaMLP)
Dogacel
left a comment
There was a problem hiding this comment.
I guess for true multi-layer architecture, we should duplicate the decoder, not MLP. Why you wanted to duplicate the MLP instead?
scripts/train_eagle3.py
Outdated
| choices=["sglang", "hf", "custom"], | ||
| help="The backend of the target model", | ||
| ) | ||
| model_group.add_argument( |
There was a problem hiding this comment.
This variable can be already set using the model config,
SpecForge/configs/llama3-8B-eagle3.json
Line 15 in 886ab9c
Which allows us to save & load models directly with the right number of hidden layers.
|
@sleepcoo LGTM |
|
Your implementation will do input_emb & hidden_states fusion every layer. May I ask why not just do fuse them in first layer. Them let others handle hidden_states only. |
I believe the way of only iterating on hidden_states on later layers exists mostly in traditional transformer architectures. For the decoder layers used in draft model, this implementation keeps the initial token info available across all decoder layers for better speculative decoding instead of only next-token generation. |
I agree, I think input embedding should only be injected in the first layer. Original EAGLE only has 1 decoder layer therefore it didn't matter for them where they have done this injection. Injecting the same data to each layer would result in excessive computations. However changing this means you have to update some code to make sure your input/output shapes match. For example attention takes 2 * hidden_size and outputs hidden_size. I think it would be nice to configure if you want to map from (2 * hidden_size -> hidden_size) and save compute on deeper layers, or if you want to keep the full 2 * hidden_size shape and be more expressive. SpecForge/specforge/modeling/draft/llama3_eagle.py Lines 527 to 529 in e1e9695 |
|
@xiaoxi-s Hi, xiaoxi. It seems fusing them only in the first layer could be more efficient and less redundant. Have you experimented with or compared both approaches? If not, could we test the alternative to verify performance? Appreciate your input. |
Working on the experiments. Will share after it's done. #449 contains the implementation of the more traditional decoder layer as the additional layers. Will share their benchmarks on SharedGPT after the experiment is done. |
|
The attached files contain benchmark results by training one epoch on the shareGPT dataset. The architectures to compare are 1. both hidden_states and input_embeddings as inputs to all draft model's decoder layers (this PR) and 2. only hidden_states as input to multiple layers except the first decoder (PR #449). Both architectures have three decoder layers. According to the benchmark diffs below, the traditional architecture yields longer accept length and smaller latency. Then shall we merge #449 and its SGLang code changes? @sleepcoo @uygnef @Dogacel . Benchmark diffs
Their deployment code by SGLang are also updated with the following two PRs on the SGLang repo: with input embeddings and without input embeddings. (The two PRs can point to v0.5.6 because of SpecForge compatibility if required.) Benchmark files
ModelsThe models are stored on the atlas machine at Models on atlas
ubuntu@atlascloud-h100:~/xiaoxi/dev/SpecForge/outputs$ pwd
/home/ubuntu/xiaoxi/dev/SpecForge/outputs
ubuntu@atlascloud-h100:~/xiaoxi/dev/SpecForge/outputs$ ls
llama3-8b-eagle3-sharegpt-main-1layer llama3-8b-w-emb-multi-layer-eagle3-sharegpt-3layers llama3-8b-wo-emb-multi-layer-eagle3-sharegpt-3layers
llama3-8b-w-emb-multi-layer-eagle3-sharegpt-2layers llama3-8b-wo-emb-multi-layer-eagle3-sharegpt-2layersBelow includes some additional benchmarks from 2-layer models and main as the baseline. Under the "without embedding" architecture, the diffs between 2-layer and 3-layer results show multi-layer did improve accept length of draft models. However, under the "with embedding" architecture, the decrease of Additional benchmark results for the 2 layer models and main
main_results_20260202_111819.txt |
@sleepcoo Based on @xiaoxi-s’s benchmark (table above), the implementation in #449 shows consistently better performance. |
Couple things,
Beyond that, I am pretty OK with having some kind of implementation, preferrably no embeddings to start with as it is simpler, baked into specforge. Another option is having some config variable to control this behavior. Future research could show what is ideal, it seems like it is a hard question to answer over a PR. Thanks for the effort 🙏 |
@uygnef Thank you for your comment!
The "Additional benchmark results" section contains the training results on main (which has only one decoder layer in the draft model).
@Dogacel Agreed. The benchmark comparison also shows the wo. embedding works better. I wouldn't personally bother with configuring w/wo embeddings behavior in the additional layers because exposing the embeddings might have shown too much information for the draft models (even using 1 - 2 additional decoder layers leads to the overfitting problem). But of course, configuring the two kinds of architectures should be addressed on another issue/PR. Thank you for all the feedback. We can wait for @sleepcoo 's opinion on merging #449. |
…oder Use traditional decoder architecture in the additional decoders of llama3 eagle3's draft model.
Motivation
Support multiple Llama3DecoderLayers in
LlamaForCausalLMEagle3.Modifications
A
LlamaForCausalLMEagle3instance can now hold a arbitrary number (>=1) ofLlama3DecoderLayerss configured vianum_hidden_layersoption ofLlamaConfig. The test for the new config is also updated intests/test_modeling/test_draft/test_llama3.py.Related Issues
Sub-issue of #374
Accuracy Test
Tested under
tests/test_modeling/test_draft/test_llama3.py.Benchmark & Profiling
See this comment. We compared the following two architectures. It turns out the w/o embeddings architecture wins in terms of the accept length.
Checklist