Add two-phase timeout for PDF job processing#17
Draft
Copilot wants to merge 49 commits intoindex_creatorsfrom
Draft
Add two-phase timeout for PDF job processing#17Copilot wants to merge 49 commits intoindex_creatorsfrom
Copilot wants to merge 49 commits intoindex_creatorsfrom
Conversation
Implement data ETL for standalone creator records from ArchivesSpace
agents and automatically index them to Solr for discovery in ArcLight.
PROBLEM:
ArchivesSpace agents (people, organizations, families) were not discoverable
as standalone entities in ArcLight. Users could not browse or search for
agents independently of their collections.
SOLUTION:
1. Extract ALL agents from ArchivesSpace via API
2. Generate EAC-CPF XML documents for each agent
3. Define Solr schema fields for agent metadata
4. Configure traject to index agent XML to Solr
5. Implement automatic indexing after XML generation
FEATURES:
- Processes all agent types (people, corporate entities, families)
- Generates standards-compliant EAC-CPF XML
- Links agents to their collections via persistent_id
- Automatic discovery of traject config (bundle show arcuit)
- Batch processing (100 files per traject call)
- Robust error handling with detailed logging
- Multiple processing modes (normal, agents-only, collections-only)
COMPONENTS:
1. Python Processing (arcflow/main.py - 1428 lines):
- get_all_agents() - Fetch ALL agents from ArchivesSpace API
- task_agent() - Generate EAC-CPF XML via archival_contexts endpoint
- process_creators() - Batch process all agents in parallel (10 workers)
- find_traject_config() - Auto-discover traject configuration
- index_creators() - Batch index to Solr via traject
2. Solr Schema (solr/conf/arcuit_creator_fields.xml - 11 fields):
- is_creator (boolean) - Identifies agent records
- creator_persistent_id (string) - Unique identifier
- agent_type (string) - Type: corporate/person/family
- agent_id (string) - ArchivesSpace agent ID
- agent_uri (string) - ArchivesSpace agent URI
- entity_type (string) - EAC-CPF entity type
- related_agents_ssim (multiValued) - Related agent names
- related_agent_uris_ssim (multiValued) - Related agent URIs
- relationship_types_ssim (multiValued) - Relationship types
- document_type (string) - Document type (eac-cpf)
- record_type (string) - Record type (creator/agent)
3. Traject Configuration (traject_config_eac_cpf.rb - 271 lines):
- Maps EAC-CPF XML elements to Solr fields
- Extracts agent identity information
- Processes biographical/historical notes
- Captures related agents and relationships
- Handles collection linkages
USAGE:
python -m arcflow.main \
--arclight-dir /path/to/arclight \
--aspace-dir /path/to/archivesspace \
--solr-url http://localhost:8983/solr/blacklight-core
python -m arcflow.main ... --agents-only
python -m arcflow.main ... --collections-only
python -m arcflow.main ... --skip-creator-indexing
python -m arcflow.main ... --arcuit-dir /path/to/arcuit
COMMAND LINE OPTIONS:
--arclight-dir PATH Path to ArcLight application (required)
--aspace-dir PATH Path to ArchivesSpace data (required)
--solr-url URL Solr instance URL (required)
--arcuit-dir PATH Path to arcuit gem (optional, auto-detected)
--agents-only Process only agents, skip collections
--collections-only Process only collections, skip agents
--skip-creator-indexing Generate XML but don't index to Solr
--force-update Process all records regardless of timestamps
ARCHITECTURE:
ArchivesSpace API
↓ (archivessnake library)
arcflow Python code
↓ (fetches via /repositories/1/archival_contexts)
EAC-CPF XML files (public/xml/agents/*.xml)
↓ (indexed via traject)
Solr (blacklight-core)
↓ (discovered via ArcLight)
ArcLight
DATA FLOW:
1. arcflow calls get_all_agents() - fetches ALL agents from ArchivesSpace API
2. For each agent, task_agent() retrieves EAC-CPF from archival_contexts endpoint
3. Saves EAC-CPF XML to public/xml/agents/ directory
4. find_traject_config() discovers config via 'bundle show arcuit' or --arcuit-dir
5. index_creators() batches XML files (100 per call) and invokes traject
6. traject indexes XML to Solr with is_creator:true flag
7. Agent records now searchable in ArcLight
BENEFITS:
- Users can discover all agents independently of collections
- Direct navigation to agent pages
- Browse all agents of a specific type
- View all collections linked to a specific agent
- Standards-based EAC-CPF format for interoperability
- Automatic indexing reduces manual steps
- Flexible processing modes for different workflows
TECHNICAL DETAILS:
- EAC-CPF format: urn:isbn:1-931666-33-4 namespace
- ID extraction: Filename-based (handles empty control element in EAC-CPF)
- Batch size: 100 files per traject call
- Parallel processing: 10 worker processes for agent generation
- Timeout: 5 minutes per batch
- Error handling: Log errors, continue processing
- Linking: Via Solr persistent_id field (not direct XML updates)
FILES CHANGED:
arcflow-phase1-revised/
├── arcflow/
│ ├── __init__.py (updated imports)
│ ├── main.py (1428 lines, core logic)
│ └── utils/
│ ├── __init__.py
│ ├── bulk_import.py
│ └── stage_classifications.py
├── traject_config_eac_cpf.rb (271 lines)
├── requirements.txt
├── .archivessnake.yml.example
├── README.md (updated)
├── HOW_TO_USE.md (updated)
└── TESTING.md (updated)
solr/
├── README.md (installation instructions)
└── conf/
└── arcuit_creator_fields.xml (11 field definitions)
Documentation:
├── CREATOR_INDEXING_GUIDE.md (comprehensive guide)
├── AUTOMATED_INDEXING_IMPLEMENTATION.md (technical details)
└── README.md (updated with creator section)
TESTING:
Manual verification:
1. Run arcflow with --agents-only flag
2. Verify XML files generated in public/xml/agents/
3. Check Solr for indexed agent records
4. Verify is_creator:true in Solr documents
5. Test agent-collection linking via persistent_id
Automated testing:
- Python syntax validation
- Ruby syntax validation (traject config)
- Solr schema validation
DEPLOYMENT:
1. Add Solr schema fields to schema.xml
2. Reload Solr core
3. Run arcflow to generate and index agents
4. Verify agents appear in Solr
5. Test in ArcLight interface
BACKWARD COMPATIBILITY:
- No breaking changes to existing functionality
- Collections continue to work as before
- Agent indexing is additive
- Can be disabled with --skip-creator-indexing
There is no longer any need for these to be defined outside the method
allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Fix EAC-CPF namespace handling in XPath queries
punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
as a separate command. Detailed explanation: self.traject_extra_config is constructed as a single string containing a space (e.g., "-c /path/to/file.rb"), but subprocess.run(cmd) passes arguments verbatim and Traject won’t parse that as two flags/values. Store the extra config as a path (or as an already-split argv list) and append it as ['-c', traject_extra_config] (or extend with a list) when building cmd. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Revert to manual returncode checking for subprocess error handling
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot
AI
changed the title
Add timeout and diagnostic logging for ArchivesSpace PDF jobs
Add timeout and fix troubleshooting for stuck ArchivesSpace PDF jobs
Feb 20, 2026
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot
AI
changed the title
Add timeout and fix troubleshooting for stuck ArchivesSpace PDF jobs
Add adaptive timeout for ArchivesSpace PDF generation jobs
Feb 20, 2026
Copilot
AI
changed the title
Add adaptive timeout for ArchivesSpace PDF generation jobs
Add two-phase timeout for PDF job processing
Feb 20, 2026
e1645c2 to
a6053f5
Compare
Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
log record if filename is not expected pattern: creator_{type}_{id}
ff41cdb to
7b9522a
Compare
… conservative list of things we know are relevant
Contributor
|
@copilot the only thing I want to keep from this is the timeout for waiting for pdf generation, can you remove everything else so that I can smoothly merge this into index_creators? |
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Contributor
Author
Done! Removed all glob wildcard changes and kept only the PDF timeout features (commit 6674a21). The branch now contains only:
Removed:
Ready for smooth merge into index_creators. |
…tring command with shell=True
12f3297 to
02156e4
Compare
f23fe83 to
89057a9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove glob wildcard changes, keep only PDF timeout features
Per user request, removed all changes except the PDF timeout functionality:
Removed (glob wildcard security fixes):
import globstatementglob.glob()andos.remove()index_collectionsmethodKept (PDF timeout features):
--pdf-timeout-queued,--pdf-timeout-running)Changes:
subprocess.run('rm {pattern}', shell=True)cmd.append(xml_file_path)Result:
Branch now contains ONLY the PDF timeout improvements needed for smooth merge into
index_creators.Original prompt
This pull request was created from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.