Skip to content

[examples] Add WideEP DP group fault tolerance example#54

Open
jeffreywang-anyscale wants to merge 16 commits intomainfrom
wideep-dp-group-ft
Open

[examples] Add WideEP DP group fault tolerance example#54
jeffreywang-anyscale wants to merge 16 commits intomainfrom
wideep-dp-group-ft

Conversation

@jeffreywang-anyscale
Copy link
Copy Markdown

@jeffreywang-anyscale jeffreywang-anyscale commented Apr 1, 2026

Summary

Add examples for demonstrating WideEP fault tolerance and autoscaling capabilities with Anyscale services.

Tests

Verified by prompting CC to "Read README.md and run demo 1" and "Run demo 2" and was able to observe expected replica changes from the Ray Serve dashboard.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
jeffreywang-anyscale and others added 8 commits April 1, 2026 21:53
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
g6.12xlarge (4x L4 GPUs) has 192 GiB RAM, not 32 GiB.
The incorrect value caused free pod shape validation to fail
on deployment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Rewrite README to match the pattern of other examples (Install CLI,
  Clone, Deploy, Query, Understanding, Shutdown)
- Split into two clear demos: autoscaling service and fault tolerance job
- Add job.yaml so fault_tolerance_demo.py can be run as an Anyscale job
  from a laptop without needing direct cluster access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Use the deployed service + console terminal to demonstrate fault
tolerance, which is simpler and more visual than submitting a job.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
# Deploy with: anyscale service deploy -f autoscaling/service.yaml

name: wide-ep-autoscaling
image_uri: anyscale/ray-llm:nightly-py311-cu128
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can replace this with 2.55 image once it's released next week

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale
Copy link
Copy Markdown
Author

jeffreywang-anyscale commented Apr 13, 2026

Deepseek DP group fault tolerance (dp_size=16):

Deepseek-fault-tolerance

env_vars:
VLLM_DISABLE_COMPILE_CACHE: "1"

- name: simulate-fault
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unify to the same service and put both endpoints under the same application

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten the directories

#
# Deploy with: anyscale service deploy -f fault_tolerance/service.yaml

name: wide-ep-fault-tolerance
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test different clouds

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants