Cleanup and Removal of Deprecated Optimizers by Koratahiu · Pull Request #1367 · Nerogar/OneTrainer

Koratahiu · 2026-03-11T10:07:58Z

Summary of Changes

This PR removes the following optimizers and their associated configurations, UI elements, and logic:

DAdaptation Suite: Removed DADAPT_ADA_GRAD, DADAPT_ADAM, DADAPT_ADAN, DADAPT_LION, and DADAPT_SGD. (superseded by Prodigy)
BitsAndBytes Adagrad: Removed ADAGRAD and ADAGRAD_8BIT. (Very outdated and unstable)
Pytorch Optimizers: Removed TIGER and YOGI. (Tiger is SignSGD with tweaked momentum, and YOGI whats that)

To Consider

Standard SGD doesn’t work for Transformer models (the newer architectures) due to gradient heterogeneity - where gradient norms vary dramatically across different parameter blocks. This makes it effectively non-functional in OT.
Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.
Splitting the optimizers into sections (Standard, 8-bit, Advanced) should further simplify the optimizer list?

dxqb · 2026-03-13T18:50:06Z

* [ ]  Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

any idea why it could have been added in the first place?
AdamW is (much) older than OneTrainer

O-J1 · 2026-03-13T19:01:14Z

* [ ]  Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

any idea why it could have been added in the first place? AdamW is (much) older than OneTrainer

I can see a mention from Nero as “Adam being the default “ nearing 3 years ago. Maybe that’s why it was added for completeness sake? (Speculation)

Here’s some quotes from Nero, madmen and surgo, 2 years ago

I'd suggest adamw instead of adam
im just saying Adam to save a letter xd, adamW has replaced it in pretty much every context
Newbies ought to be using only prodigy, adamw, or adafactor. that's it.

Even then adam (without the W) was regarded as bad. Safe to remove imo. Do we need a migration for this so it goes AdamW as the default after these are removed?

dxqb · 2026-03-15T10:34:41Z

Standard SGD doesn’t work for Transformer models (the newer architectures) due to gradient heterogeneity - where gradient norms vary dramatically across different parameter blocks. This makes it effectively non-functional in OT.

maybe keep SGD for "historical reasons" and maybe it's useful for embeddings or tests? I've used it once for a test with no momentum because I wanted pure gradient descent. Not sure if you could configure AdamW to do that.

but we don't need the optimized variants (8bit and definitely not schedulefree)

Regarding Adam (not AdamW): it has two versions (Original and 8-bit), but the only real difference from AdamW is that their Weight Decay (WD) is flawed. Since it isn't implemented correctly, I think they should be removed.

feels weird to remove Adam, but I can't think of a reason to keep it then either

Koratahiu · 2026-03-15T12:04:11Z

maybe keep SGD for "historical reasons" and maybe it's useful for embeddings or tests? I've used it once for a test with no momentum because I wanted pure gradient descent. Not sure if you could configure AdamW to do that.

It’s true that it has a unique geometry (L^2 norm) with rotational invariance, which is optimal for embeddings like CLIP tokens.
Theoretically, it is also the only optimizer that converges with a small batch size and no momentum/state.
So, I think it’s fine to keep the original as is.

feels weird to remove Adam, but I can't think of a reason to keep it then either

Unrelated, but theoretically, we should apply weight decay oppositely to how it's implemented in the original Adam (scaling it by the inverse square root of the second moment).
This would bring the weight decay into alignment with how the optimizer works.
AdamW takes the middle ground by applying a uniform weight decay.

initial

20656de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup and Removal of Deprecated Optimizers#1367

Cleanup and Removal of Deprecated Optimizers#1367
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu:remove_p

Koratahiu commented Mar 11, 2026

Uh oh!

dxqb commented Mar 13, 2026 •

edited

Loading

Uh oh!

O-J1 commented Mar 13, 2026 •

edited

Loading

Uh oh!

dxqb commented Mar 15, 2026

Uh oh!

Koratahiu commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Koratahiu commented Mar 11, 2026

Summary of Changes

To Consider

Uh oh!

dxqb commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

O-J1 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb commented Mar 15, 2026

Uh oh!

Koratahiu commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dxqb commented Mar 13, 2026 •

edited

Loading

O-J1 commented Mar 13, 2026 •

edited

Loading