Skip to content

Retry jobs on bad GPU hosts#1663

Closed
jathu wants to merge 5 commits intomainfrom
jathu/retry-gpu-health
Closed

Retry jobs on bad GPU hosts#1663
jathu wants to merge 5 commits intomainfrom
jathu/retry-gpu-health

Conversation

@jathu
Copy link

@jathu jathu commented Mar 13, 2026

WIP

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2026
@jathu jathu marked this pull request as ready for review March 13, 2026 19:31
Copy link

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@v0i0
Copy link
Contributor

v0i0 commented Mar 13, 2026

Let's make sure we also cover the benchmark dispatch.

@jathu
Copy link
Author

jathu commented Mar 13, 2026

@huydhn will take over this work

@jathu jathu closed this Mar 13, 2026
@jathu jathu deleted the jathu/retry-gpu-health branch March 13, 2026 23:35
@v0i0
Copy link
Contributor

v0i0 commented Mar 14, 2026

@huydhn will take over this work

@jathu is there an issue with this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants