Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote by Copilot · Pull Request #35 · UIUCLibrary/arcflow

Copilot · 2026-03-09T19:07:37Z

Original prompt

Background

We are improving how creator records and collections are linked in ArcLight/Solr. Previously, we inferred creator information from bioghist statements in the first search result, which was clunky and error-prone when multiple creators were involved or when collections had custom bioghist notes.

Goal

Create explicit bidirectional links between creators and collections in Solr by:

Adding creator_arclight_ids_ssim field to collection records containing array of creator IDs
Adding collection_arclight_ids_ssim field to creator records containing array of collection EAD IDs

Implementation Requirements

Part 1: Collections → Creators Linkage

Location: arcflow/main.py in task_resource() method

Approach:

After fetching the resource with resolved linked_agents (line ~235-239)
Extract agent refs where role='creator'
Convert ArchivesSpace agent URIs (e.g., /agents/corporate_entities/123) to creator IDs (e.g., creator_corporate_entities_123)
Parse the EAD XML and find <origination> elements
Add authfilenumber attribute to the name elements (<corpname>, <persname>, <famname>) with the corresponding creator ID
This keeps the EAD standards-compliant since authfilenumber is a valid EAD attribute for authority file identifiers

Traject Config Update:

File: arclight_dir/lib/arcuit/traject/ead_extra_config.rb (or wherever EAD collection indexing happens)
Extract authfilenumber attributes from origination elements
Index as creator_arclight_ids_ssim (string, stored, indexed, multivalued)

Part 2: Creators → Collections Linkage

Location: arcflow/main.py in task_agent() method

Approach:

After fetching EAC-CPF XML (line ~880)
Parse the XML to find <resourceRelation resourceRelationType="creatorOf"> elements
Extract xlink:href attribute values (ArchivesSpace resource URIs like https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200)
For each URI:
- Extract repo_id and resource_id
- Use ASnake client to fetch the resource: self.client.get('/repositories/{repo_id}/resources/{resource_id}').json()
- Get the ead_id from the response
- Handle errors: if resource can't be fetched (deleted, unpublished, etc.), log warning and skip that collection
Add <descriptiveNote>ead_id:{ead_id}</descriptiveNote> child element to each <resourceRelation> element (standard EAC-CPF element)
Save the modified EAC-CPF XML

Traject Config Update:

File: example_traject_config_eac_cpf.rb
Extract ead_ids from <resourceRelation><descriptiveNote> elements
Index as collection_arclight_ids_ssim (string, stored, indexed, multivalued)

Technical Details

Creator ID Format

Use the existing pattern: creator_{agent_type}_{id}

Entity types: corporate_entities, people, families
Example: creator_corporate_entities_123

Collection ID Format

Use the ead_id field from ArchivesSpace resources

Example: ALA.9.5.16

Helper Functions

Reuse the pattern from the existing traject config:

aspace_uri_to_solr_id(uri) - converts ArchivesSpace URIs to creator IDs
Create Python equivalent for use in main.py

Error Handling

If fetching a resource fails (404, permission error, etc.):
- Log a warning with the resource URI
- Skip that specific collection
- Continue processing the creator and other collections

Example Data Flow

EAC-CPF Input (from ArchivesSpace):

<resourceRelation resourceRelationType="creatorOf" 
                  xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200">
  <relationEntry>Certified Public Library Administrator (CPLA)</relationEntry>
</resourceRelation>

Modified EAC-CPF (after processing):

<resourceRelation resourceRelationType="creatorOf" 
                  xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/2/resources/200">
  <relationEntry>Certified Public Library Administrator (CPLA)</relationEntry>
  <descriptiveNote>
    <p>ead_id:ALA.9.5.16</p>
  </descriptiveNote>
</resourceRelation>

EAD Input (from ArchivesSpace):

<origination label="Creator">
  <corpname source="lcnaf">ALA Allied Professional Association</corpname>
</origination>

Modified EAD (after processing):

<origination label="Creator">
  <corpname source="lcnaf" authfilenumber="creator_corporate_entities_123">ALA Allied Professional Association</corpname>
</origination>

Branch Strategy

Base branch: index_creators (contains all the creator indexing infrastructure)
Target branch: index_creators (will be merged to main separately)

Testing Considerations

Verify that collections with multiple creators get multiple IDs in array
Verify that creators linked to multiple collections get multiple EAD IDs in array
Verify error handling when resources are unpublished/deleted
Verify that modified...

This pull request was created from Copilot chat.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Extract bioghist from ArchivesSpace agent records and inject into EAD: - Retrieve bioghist notes from linked agent records - Inject structured XML into EAD <archdesc> section - Preserve HTML markup for proper rendering in ArcLight - Fix bioghist element nesting per EAD schema requirements - Add Copilot agent onboarding documentation This enables archival collections to display biographical and historical context about creators directly in the finding aid.

Implement complete ETL pipeline for ArchivesSpace agents: - Extract all agent records via ArchivesSpace API - Generate EAC-CPF XML documents for each agent - Auto-discover and configure traject indexing - Batch index to Solr (100 files per call for performance) - Support multiple processing modes (agents-only, collections-only, both) - Add 11 new Solr fields for agent metadata - Include 271-line traject config for EAC-CPF → Solr mapping Key features: - Parallel to existing collection record indexing - Dynamic Solr field mapping for ArcLight compatibility - Robust error handling and logging - Configurable traject config discovery paths This allows ArcLight to provide dedicated agent/creator pages with full biographical information, related collections, and authority control.

Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

-always use filename for id -reduce duplicate fields and make fields dynamic -store related agent ids, uris, and relationsips in arrays

Restructured the pipeline for collections and creators to run independently with their own timestamps, proper cleanup, and parallel execution orchestrated via ThreadPoolExecutor Changes: - Split last_updated into last_updated_collections and last_updated_creators - Extract run_collections() and run_creators() from monolithic run() - Add run_all() that orchestrates both via ThreadPoolExecutor - Scope Solr cleanup to record type using is_creator flag - Update process_deleted_records() to accept scope parameter - Move update_repositories() into run_all() (only runs for full updates) - Fix timestamp comparisons to use min() where needed - Add directory creation safeguards (os.makedirs with exist_ok) - Change is_creator from string 'true' to boolean true - Add proper exception handling in parallel execution Benefits: - Collections and creators can be rebuilt independently (--collections-only, --agents-only) - Full runs execute both pipelines in parallel (faster) - Each record type maintains its own timestamp state - Solr cleanup is scoped to avoid deleting unrelated records

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

When used as a fallback, this will proccess all of our EAD/collection document customizations

alexdryden · 2026-03-10T17:38:09Z

@copilot we are running into a lot of digital objects in the EAC_CPF, like so:

      <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/7/digital_objects/627" xlink:type="simple">
        <relationEntry>Jack Sherman Baker Papers (Digital Surrogates and Born Digital Records)</relationEntry>
      </resourceRelation>

This is creating a lot of needless network traffic as we hit that endpoint and can't parse the result. How about we verify that the url is a resource first so we can lighten the network load. For reference, what we are after are these:

      <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage.library.illinois.edu/repositories/7/resources/1216" xlink:type="simple">
        <relationEntry>Jack Sherman Baker Papers</relationEntry>

So, essentially we need to check that it has the form: repositories/{number}/resources/{number}

Copilot · 2026-03-10T17:40:17Z

@copilot we are running into a lot of digital objects in the EAC_CPF, like so:
 <resourceRelation resourceRelationType="creatorOf" xlink:href="https://archivesspace-stage....

Updated in ae3b79e to validate URLs match the pattern repositories/{number}/resources/{number} before attempting to fetch them. Digital objects and other non-resource URLs are now skipped silently without generating warnings or network traffic.

alexdryden · 2026-03-10T17:58:48Z

arcflow/main.py

+            closing_tag = match.group(3)
+
+            # Idempotent: skip if descriptiveNote already added
+            if '<descriptiveNote>' in content:


This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.

alexdryden

@copilot see feedback for changes. if you have any questions or something seems unclear, stop and ask before you proceed.

arcflow/main.py

alexdryden · 2026-03-10T18:38:51Z

arcflow/main.py

+            content = match.group(2)
+            closing_tag = match.group(3)
+
+            # Idempotent: skip if descriptiveNote already added


This is too broad, is a valid element and might just be in the eac_cpf. You need to look for the full pattern of what we are making--keep in mind that the ead_id can take a variety of forms.

Fixed in 1b35516. Changed the check to match the specific pattern we create: <descriptiveNote>\s*ead_id:[^<]+\s*</descriptiveNote> instead of just looking for any <descriptiveNote> tag.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

alexdryden · 2026-03-11T01:58:29Z

@copilot apply changes based on the comments in this thread that are not already resolved. Note that some of them that relate to the namespaceing are partially resolved.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2026-03-11T02:02:41Z

@copilot apply changes based on the comments in this thread that are not already resolved. Note that some of them that re...

Applied namespace-aware fixes in commit 0463690:

inject_collection_metadata: Now creates recordgroup, subgroup, and bioghist elements in the same namespace as the parent EAD document
add_collection_links_to_eac_cpf: Detects EAC-CPF namespace and creates descriptiveNote and p elements with proper namespace
build_bioghist_element: Refactored to use ElementTree for proper XML construction and escaping instead of string concatenation
validate_eac_cpf_xml: Restored method with namespace fallback handling for xlink attributes
Tests: Updated to handle namespaced XML output from ElementTree serialization

All 17 tests passing.

Copilot

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 8 comments.

arcflow/services/xml_transform_service.py

tests/test_xml_transform_service.py

tests/test_agent_service.py

arcflow/services/xml_transform_service.py

example_traject_config_ead_extra.rb

arcflow/services/agent_service.py

arcflow/services/xml_transform_service.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- AgentService._extract_paragraphs() now returns plain text instead of pre-wrapped tags - XmlTransformService.build_bioghist_element() creates elements using ElementTree for proper XML escaping - This prevents parse errors when bioghist content contains &, <, >, etc. - Updated all tests to expect plain text paragraphs - Added test to verify XML special characters are properly escaped Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 5 comments.

Copilot · 2026-03-11T02:46:25Z

arcflow/services/agent_service.py

+            indent_size: Indentation size for logging
+
+        Returns:
+            List of paragraph strings wrapped in <p> tags


The _extract_paragraphs docstring says it returns strings “wrapped in
tags”, but the implementation returns plain text lines and relies on build_bioghist_element to wrap them. Updating the docstring will prevent consumers from double-wrapping or misusing the returned data.

Suggested change

List of paragraph strings wrapped in tags

List of plain text paragraph strings (not wrapped in tags)

Copilot · 2026-03-11T02:46:26Z

arcflow/main.py

            eac_cpf_xml = response.text

-            # Parse the EAC-CPF XML to validate and inspect its structure
-            try:
-                root = ET.fromstring(eac_cpf_xml)
-                self.log.debug(f'{indent}Parsed EAC-CPF XML root element: {root.tag}')
-            except ET.ParseError as e:
-                self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}')
-                return None
+            # Add collection ead_ids to resourceRelation creatorOf elements
+            eac_cpf_xml = self.xml_transform.add_collection_links_to_eac_cpf(eac_cpf_xml, indent_size=indent_size)

            # Generate creator ID


The PR description mentions refactoring task_agent() to use a validate_eac_cpf_xml() service, but the updated code no longer validates the EAC-CPF document before writing it (it only attempts a parse inside add_collection_links_to_eac_cpf). If validation is still required, consider calling validate_eac_cpf_xml() here and aborting on invalid XML, or update the PR description if validation was intentionally removed.

Copilot · 2026-03-11T02:46:26Z

arcflow/services/xml_transform_service.py

+            # Convert back to string
+            result = ET.tostring(root, encoding='unicode', method='xml')
+            return result


Serializing with ET.tostring() after parsing will typically rewrite the original namespace prefixes (e.g., default EAD/EAC namespaces often become ns0:) and will drop the XML declaration. If downstream tooling expects ArchivesSpace’s original namespace/prefix layout or the <?xml ...?> prolog, consider registering the document’s default namespace before serializing and/or writing with ElementTree.write(..., xml_declaration=True, encoding='utf-8') to preserve the prolog.

Copilot · 2026-03-11T02:46:26Z

tests/test_xml_transform_service.py

+    def test_multiple_creators_with_namespace(self):
+            """Test handling multiple creators when EAD has default namespace."""
+            xml_with_namespace = '''<?xml version="1.0" encoding="UTF-8"?>
+<ead xmlns="urn:isbn:1-931666-22-9">
+  <archdesc level="collection">
+    <did>
+      <origination label="Creator">
+        <corpname source="lcnaf">First Corp</corpname>
+      </origination>
+      <origination label="Creator">
+        <persname source="lcnaf">Second Person</persname>
+      </origination>
+    </did>
+  </archdesc>
+</ead>'''
+
+            resource = {
+                'linked_agents': [
+                    {'role': 'creator', 'ref': '/agents/corporate_entities/123'},
+                    {'role': 'creator', 'ref': '/agents/people/456'}
+                ]
+            }
+
+            result = self.service.add_creator_ids_to_ead(xml_with_namespace, resource)
+
+            # Should add both creator IDs
+            self.assertIn('creator_id="creator_corporate_entities_123"', result)
+            self.assertIn('creator_id="creator_people_456"', result)
+            # Should preserve namespace
+            self.assertIn('urn:isbn:1-931666-22-9', result)
+


This test method’s body is indented more than necessary (extra level under the def), which is easy to miss in reviews and makes future edits error-prone. Align the docstring and body to a single indentation level inside the method (4 spaces beyond the def).

example_traject_config_ead_extra.rb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

alexdryden · 2026-03-11T02:55:07Z

@copilot apply changes based on the comments in this thread

- Update docstring in AgentService._extract_paragraphs to clarify it returns plain text - Add EAC-CPF XML validation in task_agent() before processing - Fix test indentation in test_multiple_creators_with_namespace - Add comment documenting ElementTree serialization behavior with namespaces Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2026-03-11T02:57:24Z

@copilot apply changes based on the comments in this thread

Applied changes in commit 82c2e23:

Updated _extract_paragraphs() docstring to clarify it returns plain text (not wrapped in  tags)
Added validate_eac_cpf_xml() call in task_agent() to validate EAC-CPF XML before processing
Fixed over-indentation in test_multiple_creators_with_namespace() test method
Added comment documenting ElementTree serialization behavior regarding namespace prefix rewriting and XML declaration

Copilot AI and others added 9 commits February 26, 2026 13:52

fix: clean up traject config

3ef2b0a

-always use filename for id -reduce duplicate fields and make fields dynamic -store related agent ids, uris, and relationsips in arrays

Merge branch 'main' into index_creators

551e563

find eac_cpf traject in arcuit, fallback to example in arcflow

fa0c562

Merge branch 'main' into index_creators

2a8e88a

Initial plan

256a19b

Copilot AI assigned Copilot and alexdryden Mar 9, 2026

Copilot started work on behalf of alexdryden March 9, 2026 19:07 View session

Add bidirectional creator-collection links in EAD and EAC-CPF

221f569

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add bidirectional links between creators and collections in Solr~~ Add bidirectional creator↔collection links via EAD authfilenumber and EAC-CPF descriptiveNote Mar 9, 2026

Copilot finished work on behalf of alexdryden March 9, 2026 19:18

chore: add rg and sg to the example ead extra config

22d2cec

When used as a fallback, this will proccess all of our EAD/collection document customizations

UIUCLibrary deleted a comment from Copilot AI Mar 10, 2026

Copilot started work on behalf of alexdryden March 10, 2026 17:38 View session

Copilot finished work on behalf of alexdryden March 10, 2026 17:40

alexdryden force-pushed the copilot/add-bidirectional-creator-collection-links branch from ae3b79e to 2adc57d Compare March 10, 2026 17:49

alexdryden marked this pull request as ready for review March 10, 2026 17:50

alexdryden self-requested a review March 10, 2026 17:50

alexdryden reviewed Mar 10, 2026

View reviewed changes

update log message for clarity

56c619c

alexdryden force-pushed the copilot/add-bidirectional-creator-collection-links branch from 2adc57d to 56c619c Compare March 10, 2026 18:00

alexdryden requested changes Mar 10, 2026

View reviewed changes

Copilot started work on behalf of alexdryden March 10, 2026 18:40 View session

Fix idempotent check to match specific ead_id pattern

1b35516

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

alexdryden and others added 3 commits March 10, 2026 21:43

update documentation

2116f31

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: preserve nested bioghists for multiple creators

a423017

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chore: remove dead code

6621ae4

Copilot started work on behalf of alexdryden March 11, 2026 01:58 View session

fix: make XML transformations namespace-aware for EAD and EAC-CPF

0463690

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot finished work on behalf of alexdryden March 11, 2026 02:03

alexdryden requested a review from Copilot March 11, 2026 02:05

Copilot started reviewing on behalf of alexdryden March 11, 2026 02:06 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

alexdryden and others added 4 commits March 10, 2026 22:14

fix: check for namespace explicitly

f2a451a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: formatting

1a76bd7

fix: remove unused import

1d4674a

fix: ensure consistent namespace in sub-elements

92e0c45

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of alexdryden March 11, 2026 02:29 View session

Copilot finished work on behalf of alexdryden March 11, 2026 02:32

alexdryden and others added 2 commits March 10, 2026 22:33

remove unused import

170699a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use custom namespace notation

e511635

alexdryden requested a review from Copilot March 11, 2026 02:39

Copilot started reviewing on behalf of alexdryden March 11, 2026 02:40 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

alexdryden and others added 2 commits March 10, 2026 22:51

fix: use correct namespace

675ccb4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: formatting

8cf6e66

Copilot started work on behalf of alexdryden March 11, 2026 02:55 View session

Copilot finished work on behalf of alexdryden March 11, 2026 02:57

	List of paragraph strings wrapped in <p> tags
	List of plain text paragraph strings (not wrapped in <p> tags)

Conversation

Copilot AI commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Goal

Implementation Requirements

Part 1: Collections → Creators Linkage

Part 2: Creators → Collections Linkage

Technical Details

Creator ID Format

Collection ID Format

Helper Functions

Error Handling

Example Data Flow

Branch Strategy

Testing Considerations

Uh oh!

alexdryden commented Mar 10, 2026

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

alexdryden Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

alexdryden left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexdryden Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

alexdryden commented Mar 11, 2026

Uh oh!

Copilot AI commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexdryden commented Mar 11, 2026

Uh oh!

Copilot AI commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 9, 2026 •

edited

Loading