Add Creator Record Generation and Automatic Indexing by alexdryden · Pull Request #8 · UIUCLibrary/arcflow

alexdryden · 2026-02-11T21:31:08Z

Overview

This PR implements Phase 1 of agent records support: generating standalone agent records from ArchivesSpace and automatically indexing them to Solr for discovery in ArcLight.

Problem Statement

ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not:

Browse or search for agents independently
Navigate directly to agent pages
View all collections linked to a specific agent
Discover relationships between agents

Solution

This PR adds automated agent record generation and indexing:

Extract all agents from ArchivesSpace via API
Generate EAC-CPF XML for each agent via archival_contexts endpoint
Define Solr schema with agent-specific fields
Configure traject to map EAC-CPF to Solr fields
Automatically index agents after XML generation

Architecture

Processing Workflow

Step 1: Agent ID Extraction

get_all_agents(agent_types=['people', 'corporate_entities', 'families'], modified_since=0)

Calls ArchivesSpace API endpoints and returns all agent URIs
Uses modified_since to filter results based on the last time arcflow was run

Step 2: EAC-CPF XML Document Generation

process_creators()
task_agent(agent_uri, agents_dir)

Fetches EAC-CPF XML from /repositories/1/archival_contexts/{type}/{id}.xml and saves to agents directory
-- Note: repository 1 is always used as it does not seem as if this context has any impact
Processes in parallel following existing pattern

Step 3: Indexing (Transform and Load)

index_creators(agents_dir, creator_ids)

Invokes traject to index to Solr

Usage

Basic Processing

Full workflow (collections + agents)

python -m arcflow.main \
  --arclight-dir /path/to/arclight \
  --aspace-dir /path/to/archivesspace \
  --solr-url http://localhost:8983/solr/blacklight-core

Added Processing Modes

# Process only agents (skip collections)
python -m arcflow.main ... --agents-only

# Process only collections (skip agents)
python -m arcflow.main ... --collections-only

# Generate XML but skip indexing
python -m arcflow.main ... --skip-creator-indexing

Integration Steps

Add Solr Schema Fields

# Copy field definitions to your schema.xml
cat solr/conf/arcuit_creator_fields.xml >> /path/to/solr/schema.xml
In practice I have added these to a sub-section were the other fields were organized, though in theory just adding them to the end as suggested here should also work. 

# Reload Solr core
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=blacklight-core"

Run arcflow

cd /path/to/arclight
python -m arcflow.main \
  --arclight-dir . \
  --aspace-dir /path/to/archivesspace \
  --solr-url http://localhost:8983/solr/blacklight-core

Verify

# Check XML files
ls public/xml/agents/

# Query Solr for agents
curl "http://localhost:8983/solr/blacklight-core/select?q=is_creator:true"

# Check specific agent
curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_123"

Known Limitations

Empty Control Element - ArchivesSpace EAC-CPF exports have empty , so ID is extracted from filename
No Filtering - Processes ALL agents, not just those who created collections. This ensures that we capture parents/children orgs that may not have created collections as well as former names of agents. It does mean that we will capture more than we target, potentially.
Collection Linking - Links stored via Solr persistent_id, not in XML. This is still in development/testing.

Questions for Leo

Does the implementation adhere to the patterns you expect? Are there any maintenance red flags?
Are you okay with the additional cli arguments?
Are there additional agent metadata fields you think we'll need?

Copilot

Pull request overview

Implements Phase 1 of standalone creator/agent support by generating EAC-CPF XML from ArchivesSpace agents and indexing those agent records into Solr via Traject for ArcLight discovery.

Changes:

Add a Traject config for mapping EAC-CPF creator XML to Solr fields.
Extend ArcFlow to fetch agents from ArchivesSpace, generate EAC-CPF XML files, and optionally index them.
Document new creator-record workflow and CLI flags in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.

File	Description
traject_config_eac_cpf.rb	New Traject mapping for EAC-CPF → Solr creator documents.
arcflow/main.py	Adds agent extraction, EAC-CPF generation, creator indexing, and new CLI flags.
README.md	Documents quick start, creator record behavior, indexing, and schema prerequisites.

arcflow/main.py

example_traject_config_eac_cpf.rb

arcflow/main.py

README.md

Copilot · 2026-02-11T23:12:44Z

@alexdryden I've opened a new pull request, #9, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

arcflow/main.py

example_traject_config_eac_cpf.rb

traject_config_eac_cpf.rb

README.md

alexdryden · 2026-02-12T20:09:21Z

From the meeting:
We need to exclude some agents who are created because they are user or donors (ideas: filter on bioghist present or relationship present; filter against local rules present in ASpace)
Need to consider (probably for later work as we flush out the issue) what happens with alternative names that are not stored as a separate agent

Copilot · 2026-02-13T17:04:14Z

@alexdryden I've opened a new pull request, #10, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-02-13T17:04:30Z

@alexdryden I've opened a new pull request, #11, to work on those changes. Once the pull request is ready, I'll request review from you.

alexdryden · 2026-02-13T17:34:43Z

@copilot we need to exclude some agents from being indexed into Arclight Solr because ArchivesSpace is too aggressive in what it considers an agent. ArchivesSpace will make users (i.e., users of the ASpace software) agents as well as donors. Our aim is for researchers to be able to browse the agents who are responsible for the archival matterial and see their relationships among one another and the content they create. We don't want (for various reasons, including privacy/security) for there to be users and donors in the collection of users.

I'd like for your to consult the ArchivesSpace repo https://github.com/archivesspace/archivesspace in detail to understand this issue and learn how agents are represented in the codebase, how they can be discriminated, and how that data can be accessed and assessed from the various API endpoint we have access to, and then to design a solution that would exclude these non-target agents from our indexing routines.

Also, report back if you find other instances of agents that don't fit into our needs assessment (summarized above).

One idea that we had was to infer the target types of agent by detecting if they have a bioghist statement or any relationships present, as we didn't see those in some of our test runs with the non-target agents--but that could have just been a fluke or an artifact from running this code before some of the bug fixes. So, try to come up with your own solution based on what you find in ArchivesSpace code and/or the forums here https://groups.google.com/a/lyrasislists.org/g/Archivesspace_Users_Group discussions. I expect this to be a difficult, time consuming task that may require multiple passes and consume as many as 10 premium requests. If you get to 10 premium requests before arriving at a solution, then report back to me with your results and we'll discuss.

Copilot · 2026-02-13T17:34:52Z

@alexdryden I've opened a new pull request, #12, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-02-13T18:53:36Z

@alexdryden I've opened a new pull request, #13, to work on those changes. Once the pull request is ready, I'll request review from you.

alexdryden · 2026-02-13T18:59:54Z

@copilot can you review how the traject indexing file is being discovered and where it is being stored? At the very least I'd like for it to be logged to the user running the script, but I'd also like your input on where it should belong (ie in arcflow in or arcuit). Recall that arcflow is a datapipeline between archivesspace and arclight, and that arcuit is the UIUC gem that serves as home for our arclight customization.

I don't want to accidentally run the wrong version of the traject and I don't want to wonder if arcflow is configured to work with a version other than the one that has been loaded.

Copilot · 2026-02-13T19:00:03Z

@alexdryden I've opened a new pull request, #14, to work on those changes. Once the pull request is ready, I'll request review from you.

Extract bioghist from ArchivesSpace agent records and inject into EAD: - Retrieve bioghist notes from linked agent records - Inject structured XML into EAD <archdesc> section - Preserve HTML markup for proper rendering in ArcLight - Fix bioghist element nesting per EAD schema requirements - Add Copilot agent onboarding documentation This enables archival collections to display biographical and historical context about creators directly in the finding aid.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (2)

arcflow/main.py:544

The delete_ead method is called with a keyword argument indent=4 on line 544, but the method signature on line 1222 only accepts indent_size as a parameter name (not indent). This will cause a TypeError at runtime with "got an unexpected keyword argument 'indent'". The call should use indent_size=4 instead.

                        self.delete_ead(
                            resource_id, 
                            ead_id.replace('.', '-'),  # dashes in Solr
                            f'{xml_dir}/{ead_id}.xml', # dots in filenames
                            f'{pdf_dir}/{ead_id}.pdf', 
                            indent=4)

arcflow/main.py:542

Path inconsistency in deleted resource handling: Line 542 constructs the XML file path as f'{xml_dir}/{ead_id}.xml' but resources are now stored in resource_dir (defined as f'{xml_dir}/resources' on line 510). This should be f'{resource_dir}/{ead_id}.xml' to match the new directory structure established in the update_eads method. Similarly, symlink_path on line 536 correctly uses resource_dir, but the actual file deletion references the wrong directory.

                        self.delete_ead(
                            resource_id, 
                            ead_id.replace('.', '-'),  # dashes in Solr
                            f'{xml_dir}/{ead_id}.xml', # dots in filenames

arcflow/main.py

README.md

arcflow/main.py

README.md

arcflow/main.py

Copilot · 2026-02-26T21:07:55Z

arcflow/main.py

        except subprocess.CalledProcessError as e:
            self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}')


The exception handler on line 618 catches subprocess.CalledProcessError, but this exception will never be raised because subprocess.run() is called without check=True on line 604. The code relies on checking result.returncode instead (line 614), making this except block dead code. Either remove the exception handler or add check=True to subprocess.run() and handle the exception properly.

@copilot open a new pull request to apply changes based on this feedback

arcflow/main.py

example_traject_config_eac_cpf.rb

arcflow/main.py

Copilot · 2026-02-26T21:26:44Z

@alexdryden I've opened a new pull request, #25, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-02-26T21:35:14Z

@alexdryden I've opened a new pull request, #26, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-02-26T21:35:37Z

@alexdryden I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-02-26T22:13:23Z

@alexdryden I've opened a new pull request, #28, to work on those changes. Once the pull request is ready, I'll request review from you.

Implement complete ETL pipeline for ArchivesSpace agents: - Extract all agent records via ArchivesSpace API - Generate EAC-CPF XML documents for each agent - Auto-discover and configure traject indexing - Batch index to Solr (100 files per call for performance) - Support multiple processing modes (agents-only, collections-only, both) - Add 11 new Solr fields for agent metadata - Include 271-line traject config for EAC-CPF → Solr mapping Key features: - Parallel to existing collection record indexing - Dynamic Solr field mapping for ArcLight compatibility - Robust error handling and logging - Configurable traject config discovery paths This allows ArcLight to provide dedicated agent/creator pages with full biographical information, related collections, and authority control.

Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

log record if filename is not expected pattern: creator_{type}_{id}

… conservative list of things we know are relevant

…tring command with shell=True

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

arcflow/main.py:477

This logs stderr at error level unconditionally, even when it’s empty on successful rm runs. That will produce noisy “error” logs during normal operation. Consider logging stderr only when non-empty and/or when returncode != 0 (and possibly using info/debug for successful cleanup).

                    result = subprocess.run(
                        f'rm {xml_file_path}',
                        shell=True,
                        cwd=self.arclight_dir,
                        stderr=subprocess.PIPE,)
                    self.log.error(f'{" " * indent_size}{result.stderr.decode("utf-8")}')
                    if result.returncode != 0:
                        self.log.error(f'{" " * indent_size}Failed to remove pending symlinks {xml_file_path}. Return code: {result.returncode}')

arcflow/main.py

example_traject_config_eac_cpf.rb

arcflow/main.py

example_traject_config_eac_cpf.rb

README.md

Copilot · 2026-03-04T20:50:55Z

@alexdryden I've opened a new pull request, #31, to work on those changes. Once the pull request is ready, I'll request review from you.

…-scope runs Restructured the pipeline for collections and creators to run independently with their own timestamps, proper cleanup, and parallel execution orchestrated via ThreadPoolExecutor Changes: - Split last_updated into last_updated_collections and last_updated_creators - Extract run_collections() and run_creators() from monolithic run() - Add run_all() that orchestrates both via ThreadPoolExecutor - Scope Solr cleanup to record type using is_creator flag - Update process_deleted_records() to accept scope parameter - Move update_repositories() into run_all() (only runs for full updates) - Fix timestamp comparisons to use min() where needed - Add directory creation safeguards (os.makedirs with exist_ok) - Change is_creator from string 'true' to boolean true - Add proper exception handling in parallel execution Benefits: - Collections and creators can be rebuilt independently (--collections-only, --agents-only) - Full runs execute both pipelines in parallel (faster) - Each record type maintains its own timestamp state - Solr cleanup is scoped to avoid deleting unrelated records

alexdryden · 2026-03-04T21:24:19Z

I'm closing this to open a fresh instance for human reviewers.

alexdryden requested a review from Copilot February 11, 2026 21:44

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 11, 2026

Fix EAC-CPF namespace handling in XPath queries #9

Merged

alexdryden requested a review from Copilot February 12, 2026 17:42

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 13, 2026

Revert to manual returncode checking for subprocess error handling #10

Merged

Copilot AI mentioned this pull request Feb 13, 2026

[Investigation] Verify xlink:href attribute access in Nokogiri - NOT NEEDED #11

Closed

Copilot AI mentioned this pull request Feb 13, 2026

Filter non-creator agents from indexing (exclude system users, donors) #12

Merged

Copilot AI mentioned this pull request Feb 13, 2026

Replace non-deterministic fallback IDs with explicit skip logic in EAC-CPF indexing #13

Merged

Copilot AI mentioned this pull request Feb 13, 2026

Reorder traject config discovery to follow collection records pattern #14

Merged

10 tasks

alexdryden force-pushed the index_creators branch from e1645c2 to a6053f5 Compare February 26, 2026 20:59

alexdryden requested a review from Copilot February 26, 2026 21:01

Copilot started reviewing on behalf of alexdryden February 26, 2026 21:01 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 26, 2026

Clarify modified_since parameter behavior in get_all_agents #25

Closed

Copilot AI mentioned this pull request Feb 26, 2026

Enforce strict creator_{type}_{id} pattern using filename only #26

Merged

9 tasks

Copilot AI mentioned this pull request Feb 26, 2026

Add input validation to prevent Solr injection in query construction #27

Closed

Copilot AI mentioned this pull request Feb 26, 2026

Remove dead exception handler in index_collections subprocess call #28

Draft

alexdryden and others added 3 commits February 28, 2026 15:56

fix: always use filename for id

7b9522a

log record if filename is not expected pattern: creator_{type}_{id}

alexdryden force-pushed the index_creators branch from ff41cdb to 7b9522a Compare February 28, 2026 21:13

alexdryden added 5 commits March 2, 2026 16:18

fix: reduce duplicate fields and make fields dynamic

635af2b

feat: store related agent ids, uris, and relationsips in arrays

24d86a6

this will require further refinement, but for now this will be a more…

3676246

… conservative list of things we know are relevant

ensure passing of indent size and not the indent string

29907fb

Expand wildcards with glob and use list command sequence instead of s…

5952798

…tring command with shell=True

alexdryden force-pushed the index_creators branch from b57eeac to 5952798 Compare March 3, 2026 19:12

alexdryden requested a review from Copilot March 4, 2026 20:41

Copilot started reviewing on behalf of alexdryden March 4, 2026 20:42 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Copilot AI mentioned this pull request Mar 4, 2026

Clarify: nested bioghist is valid EAD — no code change needed #31

Closed

alexdryden force-pushed the index_creators branch from f549e77 to 51d2eea Compare March 4, 2026 21:10

Merge branch 'main' into index_creators

f23fe83

alexdryden closed this Mar 4, 2026

		except subprocess.CalledProcessError as e:
		self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}')

Conversation

alexdryden commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

Solution

Architecture

Processing Workflow

Usage

Basic Processing

Full workflow (collections + agents)

Added Processing Modes

Integration Steps

Known Limitations

Questions for Leo

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexdryden commented Feb 12, 2026

Uh oh!

Copilot AI commented Feb 13, 2026

Uh oh!

Copilot AI commented Feb 13, 2026

Uh oh!

alexdryden commented Feb 13, 2026

Uh oh!

Copilot AI commented Feb 13, 2026

Uh oh!

Copilot AI commented Feb 13, 2026

Uh oh!

alexdryden commented Feb 13, 2026

Uh oh!

Copilot AI commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexdryden Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Feb 26, 2026

alexdryden commented Feb 11, 2026 •

edited

Loading