Skip to content

Add Creator Record Generation and Automatic Indexing#8

Closed
alexdryden wants to merge 11 commits intomainfrom
index_creators
Closed

Add Creator Record Generation and Automatic Indexing#8
alexdryden wants to merge 11 commits intomainfrom
index_creators

Conversation

@alexdryden
Copy link
Contributor

@alexdryden alexdryden commented Feb 11, 2026

Overview

This PR implements Phase 1 of agent records support: generating standalone agent records from ArchivesSpace and automatically indexing them to Solr for discovery in ArcLight.

Problem Statement

ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not:

  • Browse or search for agents independently
  • Navigate directly to agent pages
  • View all collections linked to a specific agent
  • Discover relationships between agents

Solution

This PR adds automated agent record generation and indexing:

  1. Extract all agents from ArchivesSpace via API
  2. Generate EAC-CPF XML for each agent via archival_contexts endpoint
  3. Define Solr schema with agent-specific fields
  4. Configure traject to map EAC-CPF to Solr fields
  5. Automatically index agents after XML generation

Architecture

Processing Workflow

Step 1: Agent ID Extraction

get_all_agents(agent_types=['people', 'corporate_entities', 'families'], modified_since=0)
  • Calls ArchivesSpace API endpoints and returns all agent URIs
  • Uses modified_since to filter results based on the last time arcflow was run

Step 2: EAC-CPF XML Document Generation

process_creators()
task_agent(agent_uri, agents_dir)
  • Fetches EAC-CPF XML from /repositories/1/archival_contexts/{type}/{id}.xml and saves to agents directory
    -- Note: repository 1 is always used as it does not seem as if this context has any impact
  • Processes in parallel following existing pattern

Step 3: Indexing (Transform and Load)

index_creators(agents_dir, creator_ids)
  • Invokes traject to index to Solr

Usage

Basic Processing

Full workflow (collections + agents)
python -m arcflow.main \
  --arclight-dir /path/to/arclight \
  --aspace-dir /path/to/archivesspace \
  --solr-url http://localhost:8983/solr/blacklight-core
Added Processing Modes
# Process only agents (skip collections)
python -m arcflow.main ... --agents-only

# Process only collections (skip agents)
python -m arcflow.main ... --collections-only

# Generate XML but skip indexing
python -m arcflow.main ... --skip-creator-indexing

Integration Steps

  1. Add Solr Schema Fields
# Copy field definitions to your schema.xml
cat solr/conf/arcuit_creator_fields.xml >> /path/to/solr/schema.xml
In practice I have added these to a sub-section were the other fields were organized, though in theory just adding them to the end as suggested here should also work. 

# Reload Solr core
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=blacklight-core"
  1. Run arcflow
cd /path/to/arclight
python -m arcflow.main \
  --arclight-dir . \
  --aspace-dir /path/to/archivesspace \
  --solr-url http://localhost:8983/solr/blacklight-core
  1. Verify
# Check XML files
ls public/xml/agents/

# Query Solr for agents
curl "http://localhost:8983/solr/blacklight-core/select?q=is_creator:true"

# Check specific agent
curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_123"

Known Limitations

  • Empty Control Element - ArchivesSpace EAC-CPF exports have empty , so ID is extracted from filename
  • No Filtering - Processes ALL agents, not just those who created collections. This ensures that we capture parents/children orgs that may not have created collections as well as former names of agents. It does mean that we will capture more than we target, potentially.
  • Collection Linking - Links stored via Solr persistent_id, not in XML. This is still in development/testing.

Questions for Leo

Does the implementation adhere to the patterns you expect? Are there any maintenance red flags?
Are you okay with the additional cli arguments?
Are there additional agent metadata fields you think we'll need?

@alexdryden alexdryden requested a review from Copilot February 11, 2026 21:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Phase 1 of standalone creator/agent support by generating EAC-CPF XML from ArchivesSpace agents and indexing those agent records into Solr via Traject for ArcLight discovery.

Changes:

  • Add a Traject config for mapping EAC-CPF creator XML to Solr fields.
  • Extend ArcFlow to fetch agents from ArchivesSpace, generate EAC-CPF XML files, and optionally index them.
  • Document new creator-record workflow and CLI flags in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.

File Description
traject_config_eac_cpf.rb New Traject mapping for EAC-CPF → Solr creator documents.
arcflow/main.py Adds agent extraction, EAC-CPF generation, creator indexing, and new CLI flags.
README.md Documents quick start, creator record behavior, indexing, and schema prerequisites.

Copy link
Contributor

Copilot AI commented Feb 11, 2026

@alexdryden I've opened a new pull request, #9, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

@alexdryden
Copy link
Contributor Author

From the meeting:
We need to exclude some agents who are created because they are user or donors (ideas: filter on bioghist present or relationship present; filter against local rules present in ASpace)
Need to consider (probably for later work as we flush out the issue) what happens with alternative names that are not stored as a separate agent

Copy link
Contributor

Copilot AI commented Feb 13, 2026

@alexdryden I've opened a new pull request, #10, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 13, 2026

@alexdryden I've opened a new pull request, #11, to work on those changes. Once the pull request is ready, I'll request review from you.

@alexdryden
Copy link
Contributor Author

@copilot we need to exclude some agents from being indexed into Arclight Solr because ArchivesSpace is too aggressive in what it considers an agent. ArchivesSpace will make users (i.e., users of the ASpace software) agents as well as donors. Our aim is for researchers to be able to browse the agents who are responsible for the archival matterial and see their relationships among one another and the content they create. We don't want (for various reasons, including privacy/security) for there to be users and donors in the collection of users.

I'd like for your to consult the ArchivesSpace repo https://github.com/archivesspace/archivesspace in detail to understand this issue and learn how agents are represented in the codebase, how they can be discriminated, and how that data can be accessed and assessed from the various API endpoint we have access to, and then to design a solution that would exclude these non-target agents from our indexing routines.

Also, report back if you find other instances of agents that don't fit into our needs assessment (summarized above).

One idea that we had was to infer the target types of agent by detecting if they have a bioghist statement or any relationships present, as we didn't see those in some of our test runs with the non-target agents--but that could have just been a fluke or an artifact from running this code before some of the bug fixes. So, try to come up with your own solution based on what you find in ArchivesSpace code and/or the forums here https://groups.google.com/a/lyrasislists.org/g/Archivesspace_Users_Group discussions. I expect this to be a difficult, time consuming task that may require multiple passes and consume as many as 10 premium requests. If you get to 10 premium requests before arriving at a solution, then report back to me with your results and we'll discuss.

Copy link
Contributor

Copilot AI commented Feb 13, 2026

@alexdryden I've opened a new pull request, #12, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 13, 2026

@alexdryden I've opened a new pull request, #13, to work on those changes. Once the pull request is ready, I'll request review from you.

@alexdryden
Copy link
Contributor Author

@copilot can you review how the traject indexing file is being discovered and where it is being stored? At the very least I'd like for it to be logged to the user running the script, but I'd also like your input on where it should belong (ie in arcflow in or arcuit). Recall that arcflow is a datapipeline between archivesspace and arclight, and that arcuit is the UIUC gem that serves as home for our arclight customization.

I don't want to accidentally run the wrong version of the traject and I don't want to wonder if arcflow is configured to work with a version other than the one that has been loaded.

Copy link
Contributor

Copilot AI commented Feb 13, 2026

@alexdryden I've opened a new pull request, #14, to work on those changes. Once the pull request is ready, I'll request review from you.

Extract bioghist from ArchivesSpace agent records and inject into EAD:
- Retrieve bioghist notes from linked agent records
- Inject structured XML into EAD <archdesc> section
- Preserve HTML markup for proper rendering in ArcLight
- Fix bioghist element nesting per EAD schema requirements
- Add Copilot agent onboarding documentation

This enables archival collections to display biographical and historical
context about creators directly in the finding aid.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (2)

arcflow/main.py:544

  • The delete_ead method is called with a keyword argument indent=4 on line 544, but the method signature on line 1222 only accepts indent_size as a parameter name (not indent). This will cause a TypeError at runtime with "got an unexpected keyword argument 'indent'". The call should use indent_size=4 instead.
                        self.delete_ead(
                            resource_id, 
                            ead_id.replace('.', '-'),  # dashes in Solr
                            f'{xml_dir}/{ead_id}.xml', # dots in filenames
                            f'{pdf_dir}/{ead_id}.pdf', 
                            indent=4)

arcflow/main.py:542

  • Path inconsistency in deleted resource handling: Line 542 constructs the XML file path as f'{xml_dir}/{ead_id}.xml' but resources are now stored in resource_dir (defined as f'{xml_dir}/resources' on line 510). This should be f'{resource_dir}/{ead_id}.xml' to match the new directory structure established in the update_eads method. Similarly, symlink_path on line 536 correctly uses resource_dir, but the actual file deletion references the wrong directory.
                        self.delete_ead(
                            resource_id, 
                            ead_id.replace('.', '-'),  # dashes in Solr
                            f'{xml_dir}/{ead_id}.xml', # dots in filenames

Comment on lines 618 to 619
except subprocess.CalledProcessError as e:
self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}')
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler on line 618 catches subprocess.CalledProcessError, but this exception will never be raised because subprocess.run() is called without check=True on line 604. The code relies on checking result.returncode instead (line 614), making this except block dead code. Either remove the exception handler or add check=True to subprocess.run() and handle the exception properly.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@alexdryden I've opened a new pull request, #25, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@alexdryden I've opened a new pull request, #26, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@alexdryden I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@alexdryden I've opened a new pull request, #28, to work on those changes. Once the pull request is ready, I'll request review from you.

alexdryden and others added 3 commits February 28, 2026 15:56
Implement complete ETL pipeline for ArchivesSpace agents:
- Extract all agent records via ArchivesSpace API
- Generate EAC-CPF XML documents for each agent
- Auto-discover and configure traject indexing
- Batch index to Solr (100 files per call for performance)
- Support multiple processing modes (agents-only, collections-only, both)
- Add 11 new Solr fields for agent metadata
- Include 271-line traject config for EAC-CPF → Solr mapping

Key features:
- Parallel to existing collection record indexing
- Dynamic Solr field mapping for ArcLight compatibility
- Robust error handling and logging
- Configurable traject config discovery paths

This allows ArcLight to provide dedicated agent/creator pages with
full biographical information, related collections, and authority control.
Replace per-agent API calls with single Solr query for better performance:
- Query ArchivesSpace Solr to filter agents in bulk
- Exclude system users (publish=false)
- Exclude donors (linked_agent_role includes "dnr")
- Exclude software agents (agent_type="agent_software")
- Use consistent EAC namespace prefixes in XPath queries
- Refactor dates extraction for improved readability

Performance improvement: O(n) API calls → O(1) Solr query
Reduces processing time from minutes to seconds for large repositories.

to reflect the required command line arguments

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
log record if filename is not expected pattern: creator_{type}_{id}
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

arcflow/main.py:477

  • This logs stderr at error level unconditionally, even when it’s empty on successful rm runs. That will produce noisy “error” logs during normal operation. Consider logging stderr only when non-empty and/or when returncode != 0 (and possibly using info/debug for successful cleanup).
                    result = subprocess.run(
                        f'rm {xml_file_path}',
                        shell=True,
                        cwd=self.arclight_dir,
                        stderr=subprocess.PIPE,)
                    self.log.error(f'{" " * indent_size}{result.stderr.decode("utf-8")}')
                    if result.returncode != 0:
                        self.log.error(f'{" " * indent_size}Failed to remove pending symlinks {xml_file_path}. Return code: {result.returncode}')

Copy link
Contributor

Copilot AI commented Mar 4, 2026

@alexdryden I've opened a new pull request, #31, to work on those changes. Once the pull request is ready, I'll request review from you.

…-scope runs

Restructured the pipeline for collections and creators to run independently
with their own timestamps, proper cleanup, and parallel execution orchestrated
via ThreadPoolExecutor

Changes:
- Split last_updated into last_updated_collections and last_updated_creators
- Extract run_collections() and run_creators() from monolithic run()
- Add run_all() that orchestrates both via ThreadPoolExecutor
- Scope Solr cleanup to record type using is_creator flag
- Update process_deleted_records() to accept scope parameter
- Move update_repositories() into run_all() (only runs for full updates)
- Fix timestamp comparisons to use min() where needed
- Add directory creation safeguards (os.makedirs with exist_ok)
- Change is_creator from string 'true' to boolean true
- Add proper exception handling in parallel execution

Benefits:
- Collections and creators can be rebuilt independently (--collections-only, --agents-only)
- Full runs execute both pipelines in parallel (faster)
- Each record type maintains its own timestamp state
- Solr cleanup is scoped to avoid deleting unrelated records
@alexdryden
Copy link
Contributor Author

I'm closing this to open a fresh instance for human reviewers.

@alexdryden alexdryden closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants