Skip to content

Conversation

@hanaol
Copy link
Collaborator

@hanaol hanaol commented Jan 16, 2026

Handles logging/saving the performance metric across multiple ranks.

for ind, err in zip(index, nmae.tolist(), strict=False):
f.write(f"{ind},{err}\n")
# write final CSV with header
with open(final_csv, "w") as f_out:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all nodes will try to write to this file, which can cause issues. Let's change this to add torch.distributed.barrier() to synchronize the processes and only let rank==0 write the final_csv.

if self.log_dir is not None:
self.log_dir = Path(self.log_dir)
self.log_dir.mkdir(exist_ok=True, parents=True)
self.tmp_dir = Path(self.out_dir) / "tmp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.out_dir could be None here. Should this be in the previous block?

nmae = nmae.unsqueeze(0)
tmp_csv = self.tmp_dir / f"metrics_batch_{self.global_rank}_{batch_idx}.csv"
with open(tmp_csv, "w") as f:
for i, n in zip(indices, nmae, strict=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly prefer strict=True unless there is a reason we do not expect the dimensions to match.

for tmp_csv in all_tmp_csvs:
with open(tmp_csv) as f_in:
for line in f_in:
f_out.write(line)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to clean up the tmp_dir files once this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants