Nemo curator clean by avigyabb · Pull Request #51 · anyscale/examples

avigyabb · 2026-03-24T22:46:16Z

End-to-end example that downloads images from a HuggingFace parquet dataset, generates CLIP embeddings on GPUs using NeMo Curator, finds near-duplicates via K-means + DBSCAN, and writes a clean deduplicated dataset — all running as a distributed Anyscale job.

xyuzh · 2026-04-06T17:31:14Z

nemo_curator_semantic_dedup/helper.py

+from PIL import Image
+
+
+def download_single_image(url: str, session: requests.Session) -> bytes | None:


can you try the ray download expr and bench the perf, there is a major update to this feature ray-project/ray#61735 that recently got merge that you should try

xyuzh · 2026-04-06T17:31:49Z

nemo_curator_semantic_dedup/helper.py

+        pool_connections=100, pool_maxsize=100, max_retries=0,
+    )
+    session.mount("http://", adapter)
+    session.mount("https://", adapter)


use ray download expr and compare with this

nemo_curator_semantic_dedup/helper.py

xyuzh · 2026-04-06T17:40:01Z

nemo_curator_semantic_dedup/helper.py

+        file_extensions=["parquet"],
+        columns=["url", "caption"],
+        filesystem=HfFileSystem(token=os.environ["HF_TOKEN"]),
+        concurrency=10,


Why the concurrency is 10, don't you pass concurrency as a param?

nemo_curator_semantic_dedup/helper.py

xyuzh · 2026-04-06T17:44:13Z

nemo_curator_semantic_dedup/job.yaml

+        memory: 32Gi
+      min_nodes: 0
+      max_nodes: 10
+    - name: a10g-gpu-workers


have you been using cpus in the a10g workers as well?

Yes, the CPUs should also be utilized.

nemo_curator_semantic_dedup/image_dedup_example.py

xyuzh · 2026-04-06T17:52:03Z

nemo_curator_semantic_dedup/image_dedup_example.py

+    """
+    reader = ImageReaderStage(batch_size=config.batch_size, num_gpus_per_worker=0)
+    reader.resources.cpus = config.reader_cpus_per_task
+    reader.resources.gpus = 0.01


why is gpus set here

is placement group created in the nemo side of code?

why is gpus set here

I could probably be more organized with this. This is essentially a trick I use to make DALI get's scheduled on the gpu nodes and not the cpu nodes which are meant for downloading the images.

is placement group created in the nemo side of code?

No, I don't believe they are used anywhere in the example.

nemo_curator_semantic_dedup/image_dedup_example.py

nemo_curator_semantic_dedup/helper.py

nemo_curator_semantic_dedup/image_dedup_example.py

xyuzh · 2026-04-07T01:38:13Z

nemo_curator_semantic_dedup/image_dedup_example.py

+
+    if config.max_entries is not None:
+        ds = ds.limit(config.max_entries)
+        ds = ds.repartition(num_blocks=max(100, config.max_entries // 1000))


also num_blocks needs to be related to the num of cpus available?

xyuzh · 2026-04-07T01:39:20Z

nemo_curator_semantic_dedup/image_dedup_example.py

+    logger.info(f"Download complete: {total_success} images in {num_shards} shards ({success_rate:.1f}% success rate)")
+
+    # Use executors that avoid scheduling on CPU-only head node
+    streaming_executor = RayDataExecutor(ignore_head_node=True)


does the head node get used by default?
I remember anyscale doesn't allow scheduling on head node in multi-node case

Yea, I think we can remove these, mainly since I added some logic on the nemo curator side to handle this. For some reason I was running into errors there, let me take a closer look and let you know.

nemo_curator_semantic_dedup/README.md

avigyabb added 5 commits March 24, 2026 07:35

Add NeMo Curator image semantic deduplication example

1cc4cfc

new readme

a365047

use github branch

1ac48ed

working 1m images

cff3de0

clean up

c03c70d

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Show resolved Hide resolved

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Show resolved Hide resolved

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Outdated Show resolved Hide resolved

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/image_dedup_example.py Outdated Show resolved Hide resolved

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Show resolved Hide resolved

xyuzh reviewed Apr 6, 2026

View reviewed changes

nemo_curator_semantic_dedup/image_dedup_example.py Outdated Show resolved Hide resolved

address comments

4780560

xyuzh reviewed Apr 7, 2026

View reviewed changes

nemo_curator_semantic_dedup/image_dedup_example.py Show resolved Hide resolved

xyuzh reviewed Apr 7, 2026

View reviewed changes

nemo_curator_semantic_dedup/image_dedup_example.py Show resolved Hide resolved

xyuzh reviewed Apr 7, 2026

View reviewed changes

		from PIL import Image


		def download_single_image(url: str, session: requests.Session) -> bytes \| None:

Conversation

avigyabb commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xyuzh Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avigyabb Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xyuzh Apr 6, 2026 •

edited

Loading

avigyabb Apr 7, 2026 •

edited

Loading