Skip to content

Cleanup and Removal of Deprecated Optimizers#1367

Draft
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu:remove_p
Draft

Cleanup and Removal of Deprecated Optimizers#1367
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu:remove_p

Conversation

@Koratahiu
Copy link
Contributor

Summary of Changes

This PR removes the following optimizers and their associated configurations, UI elements, and logic:

  • DAdaptation Suite: Removed DADAPT_ADA_GRAD, DADAPT_ADAM, DADAPT_ADAN, DADAPT_LION, and DADAPT_SGD. (superseded by Prodigy)
  • BitsAndBytes Adagrad: Removed ADAGRAD and ADAGRAD_8BIT. (Very outdated and unstable)
  • Pytorch Optimizers: Removed TIGER and YOGI. (Tiger is SignSGD with tweaked momentum, and YOGI whats that)

To Consider

  • Standard SGD doesn’t work for Transformer models (the newer architectures) due to gradient heterogeneity - where gradient norms vary dramatically across different parameter blocks. This makes it effectively non-functional in OT.
  • Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.
  • Splitting the optimizers into sections (Standard, 8-bit, Advanced) should further simplify the optimizer list?

@dxqb
Copy link
Collaborator

dxqb commented Mar 13, 2026

* [ ]  Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

any idea why it could have been added in the first place?
AdamW is (much) older than OneTrainer

@O-J1
Copy link
Collaborator

O-J1 commented Mar 13, 2026

* [ ]  Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

any idea why it could have been added in the first place? AdamW is (much) older than OneTrainer

I can see a mention from Nero as “Adam being the default “ nearing 3 years ago. Maybe that’s why it was added for completeness sake? (Speculation)

Here’s some quotes from Nero, madmen and surgo, 2 years ago

  • I'd suggest adamw instead of adam
  • im just saying Adam to save a letter xd, adamW has replaced it in pretty much every context
  • Newbies ought to be using only prodigy, adamw, or adafactor. that's it.

Even then adam (without the W) was regarded as bad. Safe to remove imo. Do we need a migration for this so it goes AdamW as the default after these are removed?

@dxqb
Copy link
Collaborator

dxqb commented Mar 15, 2026

grafik

Standard SGD doesn’t work for Transformer models (the newer architectures) due to gradient heterogeneity - where gradient norms vary dramatically across different parameter blocks. This makes it effectively non-functional in OT.

maybe keep SGD for "historical reasons" and maybe it's useful for embeddings or tests? I've used it once for a test with no momentum because I wanted pure gradient descent. Not sure if you could configure AdamW to do that.

but we don't need the optimized variants (8bit and definitely not schedulefree)

Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

feels weird to remove Adam, but I can't think of a reason to keep it then either

@Koratahiu
Copy link
Contributor Author

maybe keep SGD for "historical reasons" and maybe it's useful for embeddings or tests? I've used it once for a test with no momentum because I wanted pure gradient descent. Not sure if you could configure AdamW to do that.

It’s true that it has a unique geometry (L^2 norm) with rotational invariance, which is optimal for embeddings like CLIP tokens.
Theoretically, it is also the only optimizer that converges with a small batch size and no momentum/state.
So, I think it’s fine to keep the original as is.

feels weird to remove Adam, but I can't think of a reason to keep it then either

Unrelated, but theoretically, we should apply weight decay oppositely to how it's implemented in the original Adam (scaling it by the inverse square root of the second moment).
This would bring the weight decay into alignment with how the optimizer works.
AdamW takes the middle ground by applying a uniform weight decay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants