-
Notifications
You must be signed in to change notification settings - Fork 0
saving output from different ranks #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: hanaol/test-feature
Are you sure you want to change the base?
Conversation
| for ind, err in zip(index, nmae.tolist(), strict=False): | ||
| f.write(f"{ind},{err}\n") | ||
| # write final CSV with header | ||
| with open(final_csv, "w") as f_out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all nodes will try to write to this file, which can cause issues. Let's change this to add torch.distributed.barrier() to synchronize the processes and only let rank==0 write the final_csv.
| if self.log_dir is not None: | ||
| self.log_dir = Path(self.log_dir) | ||
| self.log_dir.mkdir(exist_ok=True, parents=True) | ||
| self.tmp_dir = Path(self.out_dir) / "tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.out_dir could be None here. Should this be in the previous block?
| nmae = nmae.unsqueeze(0) | ||
| tmp_csv = self.tmp_dir / f"metrics_batch_{self.global_rank}_{batch_idx}.csv" | ||
| with open(tmp_csv, "w") as f: | ||
| for i, n in zip(indices, nmae, strict=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly prefer strict=True unless there is a reason we do not expect the dimensions to match.
| for tmp_csv in all_tmp_csvs: | ||
| with open(tmp_csv) as f_in: | ||
| for line in f_in: | ||
| f_out.write(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to clean up the tmp_dir files once this is done.
Handles logging/saving the performance metric across multiple ranks.