Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote#35
Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote#35
Conversation
Extract bioghist from ArchivesSpace agent records and inject into EAD: - Retrieve bioghist notes from linked agent records - Inject structured XML into EAD <archdesc> section - Preserve HTML markup for proper rendering in ArcLight - Fix bioghist element nesting per EAD schema requirements - Add Copilot agent onboarding documentation This enables archival collections to display biographical and historical context about creators directly in the finding aid.
Implement complete ETL pipeline for ArchivesSpace agents: - Extract all agent records via ArchivesSpace API - Generate EAC-CPF XML documents for each agent - Auto-discover and configure traject indexing - Batch index to Solr (100 files per call for performance) - Support multiple processing modes (agents-only, collections-only, both) - Add 11 new Solr fields for agent metadata - Include 271-line traject config for EAC-CPF → Solr mapping Key features: - Parallel to existing collection record indexing - Dynamic Solr field mapping for ArcLight compatibility - Robust error handling and logging - Configurable traject config discovery paths This allows ArcLight to provide dedicated agent/creator pages with full biographical information, related collections, and authority control.
Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
-always use filename for id -reduce duplicate fields and make fields dynamic -store related agent ids, uris, and relationsips in arrays
Restructured the pipeline for collections and creators to run independently with their own timestamps, proper cleanup, and parallel execution orchestrated via ThreadPoolExecutor Changes: - Split last_updated into last_updated_collections and last_updated_creators - Extract run_collections() and run_creators() from monolithic run() - Add run_all() that orchestrates both via ThreadPoolExecutor - Scope Solr cleanup to record type using is_creator flag - Update process_deleted_records() to accept scope parameter - Move update_repositories() into run_all() (only runs for full updates) - Fix timestamp comparisons to use min() where needed - Add directory creation safeguards (os.makedirs with exist_ok) - Change is_creator from string 'true' to boolean true - Add proper exception handling in parallel execution Benefits: - Collections and creators can be rebuilt independently (--collections-only, --agents-only) - Full runs execute both pipelines in parallel (faster) - Each record type maintains its own timestamp state - Solr cleanup is scoped to avoid deleting unrelated records
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
When used as a fallback, this will proccess all of our EAD/collection document customizations
|
@copilot we are running into a lot of digital objects in the EAC_CPF, like so: This is creating a lot of needless network traffic as we hit that endpoint and can't parse the result. How about we verify that the url is a resource first so we can lighten the network load. For reference, what we are after are these: So, essentially we need to check that it has the form: repositories/{number}/resources/{number} |
Updated in ae3b79e to validate URLs match the pattern |
ae3b79e to
2adc57d
Compare
arcflow/main.py
Outdated
| closing_tag = match.group(3) | ||
|
|
||
| # Idempotent: skip if descriptiveNote already added | ||
| if '<descriptiveNote>' in content: |
There was a problem hiding this comment.
This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.
2adc57d to
56c619c
Compare
alexdryden
left a comment
There was a problem hiding this comment.
@copilot see feedback for changes. if you have any questions or something seems unclear, stop and ask before you proceed.
arcflow/main.py
Outdated
| content = match.group(2) | ||
| closing_tag = match.group(3) | ||
|
|
||
| # Idempotent: skip if descriptiveNote already added |
There was a problem hiding this comment.
This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.
There was a problem hiding this comment.
Fixed in 1b35516. Changed the check to match the specific pattern we create: <descriptiveNote>\s*<p>ead_id:[^<]+</p>\s*</descriptiveNote> instead of just looking for any <descriptiveNote> tag.
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@copilot apply changes based on the comments in this thread that are not already resolved. Note that some of them that relate to the namespaceing are partially resolved. |
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Applied namespace-aware fixes in commit 0463690:
All 17 tests passing. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- AgentService._extract_paragraphs() now returns plain text instead of pre-wrapped <p> tags - XmlTransformService.build_bioghist_element() creates <p> elements using ElementTree for proper XML escaping - This prevents parse errors when bioghist content contains &, <, >, etc. - Updated all tests to expect plain text paragraphs - Added test to verify XML special characters are properly escaped Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
arcflow/services/agent_service.py
Outdated
| indent_size: Indentation size for logging | ||
|
|
||
| Returns: | ||
| List of paragraph strings wrapped in <p> tags |
There was a problem hiding this comment.
The _extract_paragraphs docstring says it returns strings “wrapped in
tags”, but the implementation returns plain text lines and relies on build_bioghist_element to wrap them. Updating the docstring will prevent consumers from double-wrapping or misusing the returned data.
| List of paragraph strings wrapped in <p> tags | |
| List of plain text paragraph strings (not wrapped in <p> tags) |
| eac_cpf_xml = response.text | ||
|
|
||
| # Parse the EAC-CPF XML to validate and inspect its structure | ||
| try: | ||
| root = ET.fromstring(eac_cpf_xml) | ||
| self.log.debug(f'{indent}Parsed EAC-CPF XML root element: {root.tag}') | ||
| except ET.ParseError as e: | ||
| self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}') | ||
| return None | ||
| # Add collection ead_ids to resourceRelation creatorOf elements | ||
| eac_cpf_xml = self.xml_transform.add_collection_links_to_eac_cpf(eac_cpf_xml, indent_size=indent_size) | ||
|
|
||
| # Generate creator ID |
There was a problem hiding this comment.
The PR description mentions refactoring task_agent() to use a validate_eac_cpf_xml() service, but the updated code no longer validates the EAC-CPF document before writing it (it only attempts a parse inside add_collection_links_to_eac_cpf). If validation is still required, consider calling validate_eac_cpf_xml() here and aborting on invalid XML, or update the PR description if validation was intentionally removed.
| # Convert back to string | ||
| result = ET.tostring(root, encoding='unicode', method='xml') | ||
| return result |
There was a problem hiding this comment.
Serializing with ET.tostring() after parsing will typically rewrite the original namespace prefixes (e.g., default EAD/EAC namespaces often become ns0:) and will drop the XML declaration. If downstream tooling expects ArchivesSpace’s original namespace/prefix layout or the <?xml ...?> prolog, consider registering the document’s default namespace before serializing and/or writing with ElementTree.write(..., xml_declaration=True, encoding='utf-8') to preserve the prolog.
| def test_multiple_creators_with_namespace(self): | ||
| """Test handling multiple creators when EAD has default namespace.""" | ||
| xml_with_namespace = '''<?xml version="1.0" encoding="UTF-8"?> | ||
| <ead xmlns="urn:isbn:1-931666-22-9"> | ||
| <archdesc level="collection"> | ||
| <did> | ||
| <origination label="Creator"> | ||
| <corpname source="lcnaf">First Corp</corpname> | ||
| </origination> | ||
| <origination label="Creator"> | ||
| <persname source="lcnaf">Second Person</persname> | ||
| </origination> | ||
| </did> | ||
| </archdesc> | ||
| </ead>''' | ||
|
|
||
| resource = { | ||
| 'linked_agents': [ | ||
| {'role': 'creator', 'ref': '/agents/corporate_entities/123'}, | ||
| {'role': 'creator', 'ref': '/agents/people/456'} | ||
| ] | ||
| } | ||
|
|
||
| result = self.service.add_creator_ids_to_ead(xml_with_namespace, resource) | ||
|
|
||
| # Should add both creator IDs | ||
| self.assertIn('creator_id="creator_corporate_entities_123"', result) | ||
| self.assertIn('creator_id="creator_people_456"', result) | ||
| # Should preserve namespace | ||
| self.assertIn('urn:isbn:1-931666-22-9', result) | ||
|
|
There was a problem hiding this comment.
This test method’s body is indented more than necessary (extra level under the def), which is easy to miss in reviews and makes future edits error-prone. Align the docstring and body to a single indentation level inside the method (4 spaces beyond the def).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@copilot apply changes based on the comments in this thread |
- Update docstring in AgentService._extract_paragraphs to clarify it returns plain text - Add EAC-CPF XML validation in task_agent() before processing - Fix test indentation in test_multiple_creators_with_namespace - Add comment documenting ElementTree serialization behavior with namespaces Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Applied changes in commit 82c2e23:
|
creator_arclight_ids_ssimto EADtask_resource(): calls newadd_creator_ids_to_ead()helperexample_traject_config_ead_extra.rbshows how to indexarcuit:creator_idattributescollection_arclight_ids_ssimto EAC-CPFtask_agent(): calls newadd_collection_links_to_eac_cpf()helperexample_traject_config_eac_cpf.rb: newto_field 'collection_arclight_ids_ssim'add_collection_links_to_eac_cpf()to look for full patternXmlTransformServicewith XML manipulation methodsAgentServicefor agent data fetchingtask_resource()to useinject_collection_metadata()servicetask_agent()to usevalidate_eac_cpf_xml()serviceget_creator_bioghist()to use AgentService and XmlTransformServiceOriginal prompt
Background
We are improving how creator records and collections are linked in ArcLight/Solr. Previously, we inferred creator information from bioghist statements in the first search result, which was clunky and error-prone when multiple creators were involved or when collections had custom bioghist notes.
Goal
Create explicit bidirectional links between creators and collections in Solr by:
creator_arclight_ids_ssimfield to collection records containing array of creator IDscollection_arclight_ids_ssimfield to creator records containing array of collection EAD IDsImplementation Requirements
Part 1: Collections → Creators Linkage
Location:
arcflow/main.pyintask_resource()methodApproach:
linked_agents(line ~235-239)role='creator'/agents/corporate_entities/123) to creator IDs (e.g.,creator_corporate_entities_123)<origination>elementsauthfilenumberattribute to the name elements (<corpname>,<persname>,<famname>) with the corresponding creator IDauthfilenumberis a valid EAD attribute for authority file identifiersTraject Config Update:
arclight_dir/lib/arcuit/traject/ead_extra_config.rb(or wherever EAD collection indexing happens)authfilenumberattributes from origination elementscreator_arclight_ids_ssim(string, stored, indexed, multivalued)Part 2: Creators → Collections Linkage
Location:
arcflow/main.pyintask_agent()methodApproach:
<resourceRelation resourceRelationType="creatorOf">elementsxlink:hrefattribute values (ArchivesSpace resource URIs likehttps://archivesspace-stage.library.illinois.edu/repositories/2/resources/200)self.client.get('/repositories/{repo_id}/resources/{resource_id}').json()ead_idfrom the response<descriptiveNote><p>ead_id:{ead_id}</p></descriptiveNote>child element to each<resourceRelation>element (standard EAC-CPF element)Traject Config Update:
example_traject_config_eac_cpf.rb<resourceRelation><descriptiveNote>elementscollection_arclight_ids_ssim(string, stored, indexed, multivalued)Technical Details
Creator ID Format
Use the existing pattern:
creator_{agent_type}_{id}corporate_entities,people,familiescreator_corporate_entities_123Collection ID Format
Use the
ead_idfield from ArchivesSpace resourcesALA.9.5.16Helper Functions
Reuse the pattern from the existing traject config:
aspace_uri_to_solr_id(uri)- converts ArchivesSpace URIs to creator IDsmain.pyError Handling
Example Data Flow
EAC-CPF Input (from ArchivesSpace):
Modified EAC-CPF (after processing):
EAD Input (from ArchivesSpace):
Modified EAD (after processing):
Branch Strategy
index_creators(contains all the creator indexing infrastructure)index_creators(will be merged to main separately)Testing Considerations
This pull request was created from Copilot chat.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.