Add Creator Record Generation and Automatic Indexing#8
Add Creator Record Generation and Automatic Indexing#8alexdryden wants to merge 11 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Implements Phase 1 of standalone creator/agent support by generating EAC-CPF XML from ArchivesSpace agents and indexing those agent records into Solr via Traject for ArcLight discovery.
Changes:
- Add a Traject config for mapping EAC-CPF creator XML to Solr fields.
- Extend ArcFlow to fetch agents from ArchivesSpace, generate EAC-CPF XML files, and optionally index them.
- Document new creator-record workflow and CLI flags in the README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| traject_config_eac_cpf.rb | New Traject mapping for EAC-CPF → Solr creator documents. |
| arcflow/main.py | Adds agent extraction, EAC-CPF generation, creator indexing, and new CLI flags. |
| README.md | Documents quick start, creator record behavior, indexing, and schema prerequisites. |
|
@alexdryden I've opened a new pull request, #9, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
From the meeting: |
|
@alexdryden I've opened a new pull request, #10, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@alexdryden I've opened a new pull request, #11, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@copilot we need to exclude some agents from being indexed into Arclight Solr because ArchivesSpace is too aggressive in what it considers an agent. ArchivesSpace will make users (i.e., users of the ASpace software) agents as well as donors. Our aim is for researchers to be able to browse the agents who are responsible for the archival matterial and see their relationships among one another and the content they create. We don't want (for various reasons, including privacy/security) for there to be users and donors in the collection of users. I'd like for your to consult the ArchivesSpace repo https://github.com/archivesspace/archivesspace in detail to understand this issue and learn how agents are represented in the codebase, how they can be discriminated, and how that data can be accessed and assessed from the various API endpoint we have access to, and then to design a solution that would exclude these non-target agents from our indexing routines. Also, report back if you find other instances of agents that don't fit into our needs assessment (summarized above). One idea that we had was to infer the target types of agent by detecting if they have a bioghist statement or any relationships present, as we didn't see those in some of our test runs with the non-target agents--but that could have just been a fluke or an artifact from running this code before some of the bug fixes. So, try to come up with your own solution based on what you find in ArchivesSpace code and/or the forums here https://groups.google.com/a/lyrasislists.org/g/Archivesspace_Users_Group discussions. I expect this to be a difficult, time consuming task that may require multiple passes and consume as many as 10 premium requests. If you get to 10 premium requests before arriving at a solution, then report back to me with your results and we'll discuss. |
|
@alexdryden I've opened a new pull request, #12, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@alexdryden I've opened a new pull request, #13, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@copilot can you review how the traject indexing file is being discovered and where it is being stored? At the very least I'd like for it to be logged to the user running the script, but I'd also like your input on where it should belong (ie in arcflow in or arcuit). Recall that arcflow is a datapipeline between archivesspace and arclight, and that arcuit is the UIUC gem that serves as home for our arclight customization. I don't want to accidentally run the wrong version of the traject and I don't want to wonder if arcflow is configured to work with a version other than the one that has been loaded. |
|
@alexdryden I've opened a new pull request, #14, to work on those changes. Once the pull request is ready, I'll request review from you. |
Extract bioghist from ArchivesSpace agent records and inject into EAD: - Retrieve bioghist notes from linked agent records - Inject structured XML into EAD <archdesc> section - Preserve HTML markup for proper rendering in ArcLight - Fix bioghist element nesting per EAD schema requirements - Add Copilot agent onboarding documentation This enables archival collections to display biographical and historical context about creators directly in the finding aid.
e1645c2 to
a6053f5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.
Comments suppressed due to low confidence (2)
arcflow/main.py:544
- The
delete_eadmethod is called with a keyword argumentindent=4on line 544, but the method signature on line 1222 only acceptsindent_sizeas a parameter name (notindent). This will cause a TypeError at runtime with "got an unexpected keyword argument 'indent'". The call should useindent_size=4instead.
self.delete_ead(
resource_id,
ead_id.replace('.', '-'), # dashes in Solr
f'{xml_dir}/{ead_id}.xml', # dots in filenames
f'{pdf_dir}/{ead_id}.pdf',
indent=4)
arcflow/main.py:542
- Path inconsistency in deleted resource handling: Line 542 constructs the XML file path as
f'{xml_dir}/{ead_id}.xml'but resources are now stored inresource_dir(defined asf'{xml_dir}/resources'on line 510). This should bef'{resource_dir}/{ead_id}.xml'to match the new directory structure established in theupdate_eadsmethod. Similarly, symlink_path on line 536 correctly usesresource_dir, but the actual file deletion references the wrong directory.
self.delete_ead(
resource_id,
ead_id.replace('.', '-'), # dashes in Solr
f'{xml_dir}/{ead_id}.xml', # dots in filenames
| except subprocess.CalledProcessError as e: | ||
| self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') |
There was a problem hiding this comment.
The exception handler on line 618 catches subprocess.CalledProcessError, but this exception will never be raised because subprocess.run() is called without check=True on line 604. The code relies on checking result.returncode instead (line 614), making this except block dead code. Either remove the exception handler or add check=True to subprocess.run() and handle the exception properly.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
|
@alexdryden I've opened a new pull request, #25, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@alexdryden I've opened a new pull request, #26, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@alexdryden I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@alexdryden I've opened a new pull request, #28, to work on those changes. Once the pull request is ready, I'll request review from you. |
Implement complete ETL pipeline for ArchivesSpace agents: - Extract all agent records via ArchivesSpace API - Generate EAC-CPF XML documents for each agent - Auto-discover and configure traject indexing - Batch index to Solr (100 files per call for performance) - Support multiple processing modes (agents-only, collections-only, both) - Add 11 new Solr fields for agent metadata - Include 271-line traject config for EAC-CPF → Solr mapping Key features: - Parallel to existing collection record indexing - Dynamic Solr field mapping for ArcLight compatibility - Robust error handling and logging - Configurable traject config discovery paths This allows ArcLight to provide dedicated agent/creator pages with full biographical information, related collections, and authority control.
Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
log record if filename is not expected pattern: creator_{type}_{id}
ff41cdb to
7b9522a
Compare
… conservative list of things we know are relevant
…tring command with shell=True
b57eeac to
5952798
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
arcflow/main.py:477
- This logs
stderrat error level unconditionally, even when it’s empty on successfulrmruns. That will produce noisy “error” logs during normal operation. Consider logging stderr only when non-empty and/or whenreturncode != 0(and possibly using info/debug for successful cleanup).
result = subprocess.run(
f'rm {xml_file_path}',
shell=True,
cwd=self.arclight_dir,
stderr=subprocess.PIPE,)
self.log.error(f'{" " * indent_size}{result.stderr.decode("utf-8")}')
if result.returncode != 0:
self.log.error(f'{" " * indent_size}Failed to remove pending symlinks {xml_file_path}. Return code: {result.returncode}')
|
@alexdryden I've opened a new pull request, #31, to work on those changes. Once the pull request is ready, I'll request review from you. |
…-scope runs Restructured the pipeline for collections and creators to run independently with their own timestamps, proper cleanup, and parallel execution orchestrated via ThreadPoolExecutor Changes: - Split last_updated into last_updated_collections and last_updated_creators - Extract run_collections() and run_creators() from monolithic run() - Add run_all() that orchestrates both via ThreadPoolExecutor - Scope Solr cleanup to record type using is_creator flag - Update process_deleted_records() to accept scope parameter - Move update_repositories() into run_all() (only runs for full updates) - Fix timestamp comparisons to use min() where needed - Add directory creation safeguards (os.makedirs with exist_ok) - Change is_creator from string 'true' to boolean true - Add proper exception handling in parallel execution Benefits: - Collections and creators can be rebuilt independently (--collections-only, --agents-only) - Full runs execute both pipelines in parallel (faster) - Each record type maintains its own timestamp state - Solr cleanup is scoped to avoid deleting unrelated records
f549e77 to
51d2eea
Compare
|
I'm closing this to open a fresh instance for human reviewers. |
Overview
This PR implements Phase 1 of agent records support: generating standalone agent records from ArchivesSpace and automatically indexing them to Solr for discovery in ArcLight.
Problem Statement
ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not:
Solution
This PR adds automated agent record generation and indexing:
Architecture
Processing Workflow
Step 1: Agent ID Extraction
Step 2: EAC-CPF XML Document Generation
-- Note: repository 1 is always used as it does not seem as if this context has any impact
Step 3: Indexing (Transform and Load)
Usage
Basic Processing
Full workflow (collections + agents)
Added Processing Modes
Integration Steps
Known Limitations
Questions for Leo
Does the implementation adhere to the patterns you expect? Are there any maintenance red flags?
Are you okay with the additional cli arguments?
Are there additional agent metadata fields you think we'll need?