Skip to content

Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote#35

Open
Copilot wants to merge 37 commits intomainfrom
copilot/add-bidirectional-creator-collection-links
Open

Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote#35
Copilot wants to merge 37 commits intomainfrom
copilot/add-bidirectional-creator-collection-links

Conversation

Copy link
Contributor

Copilot AI commented Mar 9, 2026

  • Part 1: Collections → Creators - Add creator_arclight_ids_ssim to EAD
    • In task_resource(): calls new add_creator_ids_to_ead() helper
    • New example_traject_config_ead_extra.rb shows how to index arcuit:creator_id attributes
  • Part 2: Creators → Collections - Add collection_arclight_ids_ssim to EAC-CPF
    • In task_agent(): calls new add_collection_links_to_eac_cpf() helper
    • example_traject_config_eac_cpf.rb: new to_field 'collection_arclight_ids_ssim'
  • Refactoring based on code review
    • Fix idempotent check in add_collection_links_to_eac_cpf() to look for full pattern
    • Create XmlTransformService with XML manipulation methods
    • Create AgentService for agent data fetching
    • Add tests for both services (28 tests, all passing)
    • Refactor task_resource() to use inject_collection_metadata() service
    • Refactor task_agent() to use validate_eac_cpf_xml() service
    • Refactor get_creator_bioghist() to use AgentService and XmlTransformService
    • Delegate XML transformations to services (add_creator_ids_to_ead, add_collection_links_to_eac_cpf)
    • Use relative imports for services
    • Add package documentation
    • Refactor XML manipulation to use ElementTree parser instead of string manipulation
    • Make all XML transformations namespace-aware for proper handling of EAD and EAC-CPF namespaces
    • Use ElementTree for building bioghist elements to ensure proper XML escaping
    • Restore validate_eac_cpf_xml method with namespace fallback handling
    • Properly escape XML special characters in bioghist paragraph content
    • Address code review feedback (docstrings, validation, test formatting, namespace comments)
Original prompt

Background

We are improving how creator records and collections are linked in ArcLight/Solr. Previously, we inferred creator information from bioghist statements in the first search result, which was clunky and error-prone when multiple creators were involved or when collections had custom bioghist notes.

Goal

Create explicit bidirectional links between creators and collections in Solr by:

  1. Adding creator_arclight_ids_ssim field to collection records containing array of creator IDs
  2. Adding collection_arclight_ids_ssim field to creator records containing array of collection EAD IDs

Implementation Requirements

Part 1: Collections → Creators Linkage

Location: arcflow/main.py in task_resource() method

Approach:

  1. After fetching the resource with resolved linked_agents (line ~235-239)
  2. Extract agent refs where role='creator'
  3. Convert ArchivesSpace agent URIs (e.g., /agents/corporate_entities/123) to creator IDs (e.g., creator_corporate_entities_123)
  4. Parse the EAD XML and find <origination> elements
  5. Add authfilenumber attribute to the name elements (<corpname>, <persname>, <famname>) with the corresponding creator ID
  6. This keeps the EAD standards-compliant since authfilenumber is a valid EAD attribute for authority file identifiers

Traject Config Update:

  • File: arclight_dir/lib/arcuit/traject/ead_extra_config.rb (or wherever EAD collection indexing happens)
  • Extract authfilenumber attributes from origination elements
  • Index as creator_arclight_ids_ssim (string, stored, indexed, multivalued)

Part 2: Creators → Collections Linkage

Location: arcflow/main.py in task_agent() method

Approach:

  1. After fetching EAC-CPF XML (line ~880)
  2. Parse the XML to find <resourceRelation resourceRelationType="creatorOf"> elements
  3. Extract xlink:href attribute values (ArchivesSpace resource URIs like https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200)
  4. For each URI:
    • Extract repo_id and resource_id
    • Use ASnake client to fetch the resource: self.client.get('/repositories/{repo_id}/resources/{resource_id}').json()
    • Get the ead_id from the response
    • Handle errors: if resource can't be fetched (deleted, unpublished, etc.), log warning and skip that collection
  5. Add <descriptiveNote><p>ead_id:{ead_id}</p></descriptiveNote> child element to each <resourceRelation> element (standard EAC-CPF element)
  6. Save the modified EAC-CPF XML

Traject Config Update:

  • File: example_traject_config_eac_cpf.rb
  • Extract ead_ids from <resourceRelation><descriptiveNote> elements
  • Index as collection_arclight_ids_ssim (string, stored, indexed, multivalued)

Technical Details

Creator ID Format

Use the existing pattern: creator_{agent_type}_{id}

  • Entity types: corporate_entities, people, families
  • Example: creator_corporate_entities_123

Collection ID Format

Use the ead_id field from ArchivesSpace resources

  • Example: ALA.9.5.16

Helper Functions

Reuse the pattern from the existing traject config:

  • aspace_uri_to_solr_id(uri) - converts ArchivesSpace URIs to creator IDs
  • Create Python equivalent for use in main.py

Error Handling

  • If fetching a resource fails (404, permission error, etc.):
    • Log a warning with the resource URI
    • Skip that specific collection
    • Continue processing the creator and other collections

Example Data Flow

EAC-CPF Input (from ArchivesSpace):

<resourceRelation resourceRelationType="creatorOf" 
                  xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200">
  <relationEntry>Certified Public Library Administrator (CPLA)</relationEntry>
</resourceRelation>

Modified EAC-CPF (after processing):

<resourceRelation resourceRelationType="creatorOf" 
                  xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200">
  <relationEntry>Certified Public Library Administrator (CPLA)</relationEntry>
  <descriptiveNote>
    <p>ead_id:ALA.9.5.16</p>
  </descriptiveNote>
</resourceRelation>

EAD Input (from ArchivesSpace):

<origination label="Creator">
  <corpname source="lcnaf">ALA Allied Professional Association</corpname>
</origination>

Modified EAD (after processing):

<origination label="Creator">
  <corpname source="lcnaf" authfilenumber="creator_corporate_entities_123">ALA Allied Professional Association</corpname>
</origination>

Branch Strategy

  • Base branch: index_creators (contains all the creator indexing infrastructure)
  • Target branch: index_creators (will be merged to main separately)

Testing Considerations

  • Verify that collections with multiple creators get multiple IDs in array
  • Verify that creators linked to multiple collections get multiple EAD IDs in array
  • Verify error handling when resources are unpublished/deleted
  • Verify that modified...

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 9 commits February 26, 2026 13:52
Extract bioghist from ArchivesSpace agent records and inject into EAD:
- Retrieve bioghist notes from linked agent records
- Inject structured XML into EAD <archdesc> section
- Preserve HTML markup for proper rendering in ArcLight
- Fix bioghist element nesting per EAD schema requirements
- Add Copilot agent onboarding documentation

This enables archival collections to display biographical and historical
context about creators directly in the finding aid.
Implement complete ETL pipeline for ArchivesSpace agents:
- Extract all agent records via ArchivesSpace API
- Generate EAC-CPF XML documents for each agent
- Auto-discover and configure traject indexing
- Batch index to Solr (100 files per call for performance)
- Support multiple processing modes (agents-only, collections-only, both)
- Add 11 new Solr fields for agent metadata
- Include 271-line traject config for EAC-CPF → Solr mapping

Key features:
- Parallel to existing collection record indexing
- Dynamic Solr field mapping for ArcLight compatibility
- Robust error handling and logging
- Configurable traject config discovery paths

This allows ArcLight to provide dedicated agent/creator pages with
full biographical information, related collections, and authority control.
Replace per-agent API calls with single Solr query for better performance:
- Query ArchivesSpace Solr to filter agents in bulk
- Exclude system users (publish=false)
- Exclude donors (linked_agent_role includes "dnr")
- Exclude software agents (agent_type="agent_software")
- Use consistent EAC namespace prefixes in XPath queries
- Refactor dates extraction for improved readability

Performance improvement: O(n) API calls → O(1) Solr query
Reduces processing time from minutes to seconds for large repositories.

to reflect the required command line arguments

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
-always use filename for id
-reduce duplicate fields and make fields dynamic
-store related agent ids, uris, and relationsips in arrays
Restructured the pipeline for collections and creators to run independently
with their own timestamps, proper cleanup, and parallel execution orchestrated
via ThreadPoolExecutor

Changes:
- Split last_updated into last_updated_collections and last_updated_creators
- Extract run_collections() and run_creators() from monolithic run()
- Add run_all() that orchestrates both via ThreadPoolExecutor
- Scope Solr cleanup to record type using is_creator flag
- Update process_deleted_records() to accept scope parameter
- Move update_repositories() into run_all() (only runs for full updates)
- Fix timestamp comparisons to use min() where needed
- Add directory creation safeguards (os.makedirs with exist_ok)
- Change is_creator from string 'true' to boolean true
- Add proper exception handling in parallel execution

Benefits:
- Collections and creators can be rebuilt independently (--collections-only, --agents-only)
- Full runs execute both pipelines in parallel (faster)
- Each record type maintains its own timestamp state
- Solr cleanup is scoped to avoid deleting unrelated records
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot AI changed the title [WIP] Add bidirectional links between creators and collections in Solr Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote Mar 9, 2026
When used as a fallback, this will proccess all of our EAD/collection
document customizations
@UIUCLibrary UIUCLibrary deleted a comment from Copilot AI Mar 10, 2026
@alexdryden
Copy link
Contributor

@copilot we are running into a lot of digital objects in the EAC_CPF, like so:

      <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/7/digital_objects/627" xlink:type="simple">
        <relationEntry>Jack Sherman Baker Papers (Digital Surrogates and Born Digital Records)</relationEntry>
      </resourceRelation>

This is creating a lot of needless network traffic as we hit that endpoint and can't parse the result. How about we verify that the url is a resource first so we can lighten the network load. For reference, what we are after are these:

      <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/7/resources/1216" xlink:type="simple">
        <relationEntry>Jack Sherman Baker Papers</relationEntry>

So, essentially we need to check that it has the form: repositories/{number}/resources/{number}

Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot we are running into a lot of digital objects in the EAC_CPF, like so:

      <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage....

Updated in ae3b79e to validate URLs match the pattern repositories/{number}/resources/{number} before attempting to fetch them. Digital objects and other non-resource URLs are now skipped silently without generating warnings or network traffic.

@alexdryden alexdryden force-pushed the copilot/add-bidirectional-creator-collection-links branch from ae3b79e to 2adc57d Compare March 10, 2026 17:49
@alexdryden alexdryden marked this pull request as ready for review March 10, 2026 17:50
@alexdryden alexdryden self-requested a review March 10, 2026 17:50
arcflow/main.py Outdated
closing_tag = match.group(3)

# Idempotent: skip if descriptiveNote already added
if '<descriptiveNote>' in content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.

@alexdryden alexdryden force-pushed the copilot/add-bidirectional-creator-collection-links branch from 2adc57d to 56c619c Compare March 10, 2026 18:00
Copy link
Contributor

@alexdryden alexdryden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot see feedback for changes. if you have any questions or something seems unclear, stop and ask before you proceed.

arcflow/main.py Outdated
content = match.group(2)
closing_tag = match.group(3)

# Idempotent: skip if descriptiveNote already added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b35516. Changed the check to match the specific pattern we create: <descriptiveNote>\s*<p>ead_id:[^<]+</p>\s*</descriptiveNote> instead of just looking for any <descriptiveNote> tag.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
alexdryden and others added 3 commits March 10, 2026 21:43
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@alexdryden
Copy link
Contributor

@copilot apply changes based on the comments in this thread that are not already resolved. Note that some of them that relate to the namespaceing are partially resolved.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 11, 2026

@copilot apply changes based on the comments in this thread that are not already resolved. Note that some of them that re...

Applied namespace-aware fixes in commit 0463690:

  1. inject_collection_metadata: Now creates recordgroup, subgroup, and bioghist elements in the same namespace as the parent EAD document
  2. add_collection_links_to_eac_cpf: Detects EAC-CPF namespace and creates descriptiveNote and p elements with proper namespace
  3. build_bioghist_element: Refactored to use ElementTree for proper XML construction and escaping instead of string concatenation
  4. validate_eac_cpf_xml: Restored method with namespace fallback handling for xlink attributes
  5. Tests: Updated to handle namespaced XML output from ElementTree serialization

All 17 tests passing.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 8 comments.

alexdryden and others added 4 commits March 10, 2026 22:14
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- AgentService._extract_paragraphs() now returns plain text instead of pre-wrapped <p> tags
- XmlTransformService.build_bioghist_element() creates <p> elements using ElementTree for proper XML escaping
- This prevents parse errors when bioghist content contains &, <, >, etc.
- Updated all tests to expect plain text paragraphs
- Added test to verify XML special characters are properly escaped

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
alexdryden and others added 2 commits March 10, 2026 22:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 5 comments.

indent_size: Indentation size for logging

Returns:
List of paragraph strings wrapped in <p> tags
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _extract_paragraphs docstring says it returns strings “wrapped in

tags”, but the implementation returns plain text lines and relies on build_bioghist_element to wrap them. Updating the docstring will prevent consumers from double-wrapping or misusing the returned data.

Suggested change
List of paragraph strings wrapped in <p> tags
List of plain text paragraph strings (not wrapped in <p> tags)

Copilot uses AI. Check for mistakes.
Comment on lines 817 to 822
eac_cpf_xml = response.text

# Parse the EAC-CPF XML to validate and inspect its structure
try:
root = ET.fromstring(eac_cpf_xml)
self.log.debug(f'{indent}Parsed EAC-CPF XML root element: {root.tag}')
except ET.ParseError as e:
self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}')
return None
# Add collection ead_ids to resourceRelation creatorOf elements
eac_cpf_xml = self.xml_transform.add_collection_links_to_eac_cpf(eac_cpf_xml, indent_size=indent_size)

# Generate creator ID
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions refactoring task_agent() to use a validate_eac_cpf_xml() service, but the updated code no longer validates the EAC-CPF document before writing it (it only attempts a parse inside add_collection_links_to_eac_cpf). If validation is still required, consider calling validate_eac_cpf_xml() here and aborting on invalid XML, or update the PR description if validation was intentionally removed.

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +106
# Convert back to string
result = ET.tostring(root, encoding='unicode', method='xml')
return result
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing with ET.tostring() after parsing will typically rewrite the original namespace prefixes (e.g., default EAD/EAC namespaces often become ns0:) and will drop the XML declaration. If downstream tooling expects ArchivesSpace’s original namespace/prefix layout or the <?xml ...?> prolog, consider registering the document’s default namespace before serializing and/or writing with ElementTree.write(..., xml_declaration=True, encoding='utf-8') to preserve the prolog.

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +228
def test_multiple_creators_with_namespace(self):
"""Test handling multiple creators when EAD has default namespace."""
xml_with_namespace = '''<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9">
<archdesc level="collection">
<did>
<origination label="Creator">
<corpname source="lcnaf">First Corp</corpname>
</origination>
<origination label="Creator">
<persname source="lcnaf">Second Person</persname>
</origination>
</did>
</archdesc>
</ead>'''

resource = {
'linked_agents': [
{'role': 'creator', 'ref': '/agents/corporate_entities/123'},
{'role': 'creator', 'ref': '/agents/people/456'}
]
}

result = self.service.add_creator_ids_to_ead(xml_with_namespace, resource)

# Should add both creator IDs
self.assertIn('creator_id="creator_corporate_entities_123"', result)
self.assertIn('creator_id="creator_people_456"', result)
# Should preserve namespace
self.assertIn('urn:isbn:1-931666-22-9', result)

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test method’s body is indented more than necessary (extra level under the def), which is easy to miss in reviews and makes future edits error-prone. Align the docstring and body to a single indentation level inside the method (4 spaces beyond the def).

Copilot uses AI. Check for mistakes.
alexdryden and others added 2 commits March 10, 2026 22:51
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@alexdryden
Copy link
Contributor

@copilot apply changes based on the comments in this thread

- Update docstring in AgentService._extract_paragraphs to clarify it returns plain text
- Add EAC-CPF XML validation in task_agent() before processing
- Fix test indentation in test_multiple_creators_with_namespace
- Add comment documenting ElementTree serialization behavior with namespaces

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 11, 2026

@copilot apply changes based on the comments in this thread

Applied changes in commit 82c2e23:

  1. Updated _extract_paragraphs() docstring to clarify it returns plain text (not wrapped in <p> tags)
  2. Added validate_eac_cpf_xml() call in task_agent() to validate EAC-CPF XML before processing
  3. Fixed over-indentation in test_multiple_creators_with_namespace() test method
  4. Added comment documenting ElementTree serialization behavior regarding namespace prefix rewriting and XML declaration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants