Add mongo_sync script for syncing MongoDB data to JSON files#150
Add mongo_sync script for syncing MongoDB data to JSON files#150SimoneBendazzoli93 wants to merge 11 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new mongo_sync utility script under the dashboard to export MongoDB maia_projects data to per-namespace JSON files, enriching each project with a derived users email list and normalizing the date field.
Changes:
- Connect to MongoDB via env-provided credentials and fetch
maia_projects/maia_users. - Build a filtered project payload (selected fields + derived
users) and writeProjects/<namespace>.json. - Add basic date normalization for multiple MongoDB date representations.
Overall Verdict
REQUEST CHANGES
Critical Issues
- The user-to-project membership check is incorrect (substring match vs exact membership in a comma-separated namespace list), which can produce wrong
userslists in the generated JSON.
Optional Notes (non-blocking)
- Some key-mapping branches (
group_id/username) are currently unreachable due to themetadatagate and should be removed or reworked. - The script performs DB access and file writes at import time; guarding with a
main()/if __name__ == "__main__":would prevent accidental side effects. - Remove or rewrite the inline “git clone with credentials in URL” comment to avoid promoting a secret-leaking workflow.
This new script connects to a MongoDB database, retrieves project and user data, filters the information, and writes it to JSON files organized by project namespace. It includes error handling for date formatting and ensures the project folder is created if it doesn't exist.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
This update organizes the mongo_sync script by wrapping the main functionality in a `main()` function. It maintains the existing logic for connecting to the MongoDB database, retrieving project and user data, filtering the information, and writing it to JSON files. The project folder creation and date formatting error handling remain intact.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
- Introduced a new script, mongo_sync.py, that connects to a MongoDB database and retrieves project and user data. - The script filters and formats project information, including user emails associated with each project, and saves the data as JSON files in a designated "Projects" directory. - Implemented error handling for date formatting and ensured safe filename generation to prevent path traversal vulnerabilities.
3af7f4b to
72b77fb
Compare
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
📝 WalkthroughWalkthroughThis PR adds a MongoDB sync script that reads user and project documents from MongoDB, filters projects by matching user namespaces, normalizes metadata fields including dates, and exports each filtered project to a JSON file with a path-safe filename derived from the project namespace. ChangesMongoDB Project Sync
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)✅ Unit Test PR creation complete.
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dashboard/core/mongo_sync.py`:
- Around line 59-67: The loop that populates filtered_project["users"] currently
appends raw user.get("email") values (in the users -> filtered_project
construction) which exports PII; change it to export a stable non-PII identifier
instead (e.g., user.get("id") or a deterministic hash of the email using a
project/system salt) so no raw emails are written to disk; update the same
pattern found around the other occurrence noted (lines ~103-104) to use the
ID/hash, and ensure any downstream consumers that expect emails read the new
field name/format (adjust keys or document the change) in the code paths
referencing filtered_project and project.
- Line 63: The code directly indexes project["namespace"] which can raise
KeyError for missing/invalid project docs; change the membership check in the
sync logic to use project.get("namespace") (or assign ns =
project.get("namespace")) and skip the record if ns is None or not a str, i.e.
only proceed to check membership against user_namespaces when a valid namespace
exists to avoid aborting the whole sync.
- Around line 71-77: The code assumes v["$date"] exists and is a string before
parsing in the block around filtered_project[k], which can raise KeyError or
TypeError; change it to safely fetch and normalize the $date value (e.g. use
v.get("$date") or check "$date" in v), ensure non-string values are converted to
str() before calling replace("+00:00")/fromisoformat, and keep the
datetime.fromisoformat(...) call inside the try/except so missing or malformed
values fall back to assigning the original (or stringified) date to
filtered_project[k]; update the logic around variables v, k, filtered_project
and datetime.fromisoformat accordingly.
- Around line 98-104: The current sanitization of raw_namespace into
safe_namespace can cause collisions (e.g., "team/a" -> "team_a") and overwrite
existing exports; update the write logic around safe_namespace (used in
project_folder.joinpath(safe_namespace + ".json") and
json.dump(filtered_project,...)) to detect collisions and produce a unique,
deterministic filename instead of silently overwriting: after computing
safe_namespace, check whether a file with that name already exists and if it was
produced from a different raw_namespace (track a mapping or compare existing
metadata), and if a collision is detected append a short deterministic suffix
(e.g., a hex hash of raw_namespace like first 8 chars) to the filename or raise
a ValueError; ensure the chosen approach uses raw_namespace for uniqueness and
still prevents path traversal via the existing Path(safe_namespace).name check
before opening the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b0ec6139-1168-4315-9029-4f9fde529d77
📒 Files selected for processing (1)
dashboard/core/mongo_sync.py
| filtered_project = {"users": []} | ||
| for user in users: | ||
| user_namespace_value = user.get("namespace") or "" | ||
| user_namespaces = [namespace.strip() for namespace in user_namespace_value.split(",") if namespace.strip()] | ||
| if project["namespace"] in user_namespaces: | ||
| user_email = user.get("email") | ||
| if user_email: | ||
| filtered_project["users"].append(user_email) | ||
| for k, v in project.items(): |
There was a problem hiding this comment.
Reassess exporting raw user emails to disk (PII retention risk).
The script writes email addresses into project JSON files under Projects/. If these files are committed/synced, this creates a privacy/compliance exposure. Prefer stable user IDs or hashed emails unless explicit policy/legal basis requires raw emails.
Also applies to: 103-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dashboard/core/mongo_sync.py` around lines 59 - 67, The loop that populates
filtered_project["users"] currently appends raw user.get("email") values (in the
users -> filtered_project construction) which exports PII; change it to export a
stable non-PII identifier instead (e.g., user.get("id") or a deterministic hash
of the email using a project/system salt) so no raw emails are written to disk;
update the same pattern found around the other occurrence noted (lines ~103-104)
to use the ID/hash, and ensure any downstream consumers that expect emails read
the new field name/format (adjust keys or document the change) in the code paths
referencing filtered_project and project.
| for user in users: | ||
| user_namespace_value = user.get("namespace") or "" | ||
| user_namespaces = [namespace.strip() for namespace in user_namespace_value.split(",") if namespace.strip()] | ||
| if project["namespace"] in user_namespaces: |
There was a problem hiding this comment.
Guard missing project namespace before membership checks.
project["namespace"] can raise KeyError and abort the whole sync when a project document is missing/invalid. Use .get() and skip invalid records safely.
Proposed fix
- if project["namespace"] in user_namespaces:
+ project_namespace = project.get("namespace")
+ if not project_namespace:
+ continue
+ if project_namespace in user_namespaces:
user_email = user.get("email")
if user_email:
filtered_project["users"].append(user_email)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if project["namespace"] in user_namespaces: | |
| project_namespace = project.get("namespace") | |
| if not project_namespace: | |
| continue | |
| if project_namespace in user_namespaces: | |
| user_email = user.get("email") | |
| if user_email: | |
| filtered_project["users"].append(user_email) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dashboard/core/mongo_sync.py` at line 63, The code directly indexes
project["namespace"] which can raise KeyError for missing/invalid project docs;
change the membership check in the sync logic to use project.get("namespace")
(or assign ns = project.get("namespace")) and skip the record if ns is None or
not a str, i.e. only proceed to check membership against user_namespaces when a
valid namespace exists to avoid aborting the whole sync.
| if isinstance(v, dict): | ||
| date_str = v["$date"] | ||
| try: | ||
| date_obj = datetime.fromisoformat(date_str.replace("Z", "+00:00")) | ||
| filtered_project[k] = date_obj.strftime("%Y-%m-%d") | ||
| except Exception: | ||
| filtered_project[k] = date_str |
There was a problem hiding this comment.
Handle $date shape safely before parsing.
date_str = v["$date"] is outside the try, so missing $date crashes the run. Also, non-string $date values should be normalized safely.
Proposed fix
if k == "date":
if isinstance(v, dict):
- date_str = v["$date"]
+ date_raw = v.get("$date")
try:
- date_obj = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
- filtered_project[k] = date_obj.strftime("%Y-%m-%d")
+ if isinstance(date_raw, str):
+ date_obj = datetime.fromisoformat(date_raw.replace("Z", "+00:00"))
+ filtered_project[k] = date_obj.strftime("%Y-%m-%d")
+ elif isinstance(date_raw, datetime):
+ filtered_project[k] = date_raw.strftime("%Y-%m-%d")
+ else:
+ filtered_project[k] = str(date_raw) if date_raw is not None else ""
except Exception:
- filtered_project[k] = date_str
+ filtered_project[k] = str(date_raw) if date_raw is not None else ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dashboard/core/mongo_sync.py` around lines 71 - 77, The code assumes
v["$date"] exists and is a string before parsing in the block around
filtered_project[k], which can raise KeyError or TypeError; change it to safely
fetch and normalize the $date value (e.g. use v.get("$date") or check "$date" in
v), ensure non-string values are converted to str() before calling
replace("+00:00")/fromisoformat, and keep the datetime.fromisoformat(...) call
inside the try/except so missing or malformed values fall back to assigning the
original (or stringified) date to filtered_project[k]; update the logic around
variables v, k, filtered_project and datetime.fromisoformat accordingly.
| safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower()) | ||
| # Additionally, enforce that no path traversal can happen | ||
| if Path(safe_namespace).name != safe_namespace or not safe_namespace: | ||
| raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}") | ||
|
|
||
| with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f: | ||
| json.dump(filtered_project, f, indent=4) |
There was a problem hiding this comment.
Prevent sanitized filename collisions from overwriting project exports.
Different namespaces can collapse to the same safe_namespace (e.g., team/a and team_a) and silently overwrite JSON files.
Proposed fix
+ existing_name = filtered_project.get("namespace", "")
+ file_name = f"{safe_namespace}.json"
+ output_path = project_folder / file_name
+ if output_path.exists():
+ with open(output_path, "r") as rf:
+ existing = json.load(rf)
+ if existing.get("namespace") != existing_name:
+ raise ValueError(
+ f"Namespace collision after sanitization: '{existing_name}' -> '{file_name}'"
+ )
-
- with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f:
+ with open(output_path, "w") as f:
json.dump(filtered_project, f, indent=4)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower()) | |
| # Additionally, enforce that no path traversal can happen | |
| if Path(safe_namespace).name != safe_namespace or not safe_namespace: | |
| raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}") | |
| with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f: | |
| json.dump(filtered_project, f, indent=4) | |
| safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower()) | |
| # Additionally, enforce that no path traversal can happen | |
| if Path(safe_namespace).name != safe_namespace or not safe_namespace: | |
| raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}") | |
| existing_name = filtered_project.get("namespace", "") | |
| file_name = f"{safe_namespace}.json" | |
| output_path = project_folder / file_name | |
| if output_path.exists(): | |
| with open(output_path, "r") as rf: | |
| existing = json.load(rf) | |
| if existing.get("namespace") != existing_name: | |
| raise ValueError( | |
| f"Namespace collision after sanitization: '{existing_name}' -> '{file_name}'" | |
| ) | |
| with open(output_path, "w") as f: | |
| json.dump(filtered_project, f, indent=4) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dashboard/core/mongo_sync.py` around lines 98 - 104, The current sanitization
of raw_namespace into safe_namespace can cause collisions (e.g., "team/a" ->
"team_a") and overwrite existing exports; update the write logic around
safe_namespace (used in project_folder.joinpath(safe_namespace + ".json") and
json.dump(filtered_project,...)) to detect collisions and produce a unique,
deterministic filename instead of silently overwriting: after computing
safe_namespace, check whether a file with that name already exists and if it was
produced from a different raw_namespace (track a mapping or compare existing
metadata), and if a collision is detected append a short deterministic suffix
(e.g., a hex hash of raw_namespace like first 8 chars) to the filename or raise
a ValueError; ensure the chosen approach uses raw_namespace for uniqueness and
still prevents path traversal via the existing Path(safe_namespace).name check
before opening the file.
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @SimoneBendazzoli93. * #150 (comment) The following files were modified: * `dashboard/core/mongo_sync.py`
|
Note Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
✅ Created PR with unit tests: #155 |
This new script connects to a MongoDB database, retrieves project and user data, filters the information, and writes it to JSON files organized by project namespace. It includes error handling for date formatting and ensures the project folder is created if it doesn't exist.
Summary by CodeRabbit