Add two-phase timeout for PDF job processing by Copilot · Pull Request #17 · UIUCLibrary/arcflow

Copilot · 2026-02-19T19:39:16Z

Remove glob wildcard changes, keep only PDF timeout features

Per user request, removed all changes except the PDF timeout functionality:

Removed (glob wildcard security fixes):

❌ import glob statement
❌ Symlink removal using glob.glob() and os.remove()
❌ Wildcard expansion in index_collections method

Kept (PDF timeout features):

✅ Two-phase timeout system (queued 5min, running 30min)
✅ CLI arguments (--pdf-timeout-queued, --pdf-timeout-running)
✅ Enhanced logging with status transitions
✅ Context-aware error messages
✅ Smart timeout reset on status change

Changes:

Reverted symlink removal to original shell command: subprocess.run('rm {pattern}', shell=True)
Reverted index_collections to original: cmd.append(xml_file_path)
Removed glob import
All PDF timeout logic remains intact

Result:

Branch now contains ONLY the PDF timeout improvements needed for smooth merge into index_creators.

Original prompt

Problem

Currently, the code uses subprocess.run() with file paths containing wildcards (*), relying on shell=True for expansion. This is problematic because:

Using shell=True is a security risk

The wildcard expansion happens in the shell, not in Python

When the shell command fails, the actual asterisk character is passed literally

Solution

Replace shell-based wildcard expansion with Python's glob.glob() module:

Changes Required
Add import: Add import glob at the top of arcflow/main.py (around line 1-12 with other imports)
Fix symlink removal (lines ~473-486):

Replace the shell command rm {xml_file_path} with Python's os.remove()

Use glob.glob() to expand the wildcard pattern before removing files

Remove shell=True parameter

Current code:
xml_file_path = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml'
result = subprocess.run(
    f'rm {xml_file_path}',
    shell=True,
    cwd=self.arclight_dir,
    stderr=subprocess.PIPE,)
Should become:
xml_file_pattern = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml'
xml_files = glob.glob(xml_file_pattern)

for xml_file_path in xml_files:
    try:
        os.remove(xml_file_path)
        self.log.info(f'{" " * indent_size}Removed pending symlink {xml_file_path}')
    except FileNotFoundError:
        self.log.warning(f'{" " * indent_size}File not found: {xml_file_path}')
    except Exception as e:
        self.log.error(f'{" " * indent_size}Error removing pending symlink {xml_file_path}: {e}')
Fix index_collections method (line ~576):

Use glob.glob() to expand wildcards in xml_file_path before passing to traject command

Add a check to warn if no files match the pattern

Current code:
cmd.append(xml_file_path)
Should become:
# Expand wildcards with glob
xml_files = glob.glob(xml_file_path)

if not xml_files:
    self.log.warning(f'{indent}No files found matching pattern: {xml_file_path}')
    return

# Add all matching files to the command
cmd.extend(xml_files)
Benefits

Security: Eliminates use of shell=True in subprocess calls

Reliability: Explicit wildcard expansion makes the code behavior clearer

Portability: Using Python built-ins (os.remove, glob.glob) instead of shell commands

Error handling: Better control over what happens when files don't exist

Testing

After making these changes, test with the same command that exposed the issue:
python -m arcflow.main \
  --arclight-dir /Users/adryden3/Projects/ASpace/development_env/arcuit \
  --aspace-dir /Users/adryden3/Projects/ASpace/development_env/my-app \
  --solr-url "http://localhost:8983/solr/blacklight-core" \
  --agents-only \
  --force-update
Verify that:

Files matching the wildcard pattern are correctly identified

No error messages about files not found

The indexing completes successfully

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Implement data ETL for standalone creator records from ArchivesSpace agents and automatically index them to Solr for discovery in ArcLight. PROBLEM: ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not browse or search for agents independently of their collections. SOLUTION: 1. Extract ALL agents from ArchivesSpace via API 2. Generate EAC-CPF XML documents for each agent 3. Define Solr schema fields for agent metadata 4. Configure traject to index agent XML to Solr 5. Implement automatic indexing after XML generation FEATURES: - Processes all agent types (people, corporate entities, families) - Generates standards-compliant EAC-CPF XML - Links agents to their collections via persistent_id - Automatic discovery of traject config (bundle show arcuit) - Batch processing (100 files per traject call) - Robust error handling with detailed logging - Multiple processing modes (normal, agents-only, collections-only) COMPONENTS: 1. Python Processing (arcflow/main.py - 1428 lines): - get_all_agents() - Fetch ALL agents from ArchivesSpace API - task_agent() - Generate EAC-CPF XML via archival_contexts endpoint - process_creators() - Batch process all agents in parallel (10 workers) - find_traject_config() - Auto-discover traject configuration - index_creators() - Batch index to Solr via traject 2. Solr Schema (solr/conf/arcuit_creator_fields.xml - 11 fields): - is_creator (boolean) - Identifies agent records - creator_persistent_id (string) - Unique identifier - agent_type (string) - Type: corporate/person/family - agent_id (string) - ArchivesSpace agent ID - agent_uri (string) - ArchivesSpace agent URI - entity_type (string) - EAC-CPF entity type - related_agents_ssim (multiValued) - Related agent names - related_agent_uris_ssim (multiValued) - Related agent URIs - relationship_types_ssim (multiValued) - Relationship types - document_type (string) - Document type (eac-cpf) - record_type (string) - Record type (creator/agent) 3. Traject Configuration (traject_config_eac_cpf.rb - 271 lines): - Maps EAC-CPF XML elements to Solr fields - Extracts agent identity information - Processes biographical/historical notes - Captures related agents and relationships - Handles collection linkages USAGE: python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ --solr-url http://localhost:8983/solr/blacklight-core python -m arcflow.main ... --agents-only python -m arcflow.main ... --collections-only python -m arcflow.main ... --skip-creator-indexing python -m arcflow.main ... --arcuit-dir /path/to/arcuit COMMAND LINE OPTIONS: --arclight-dir PATH Path to ArcLight application (required) --aspace-dir PATH Path to ArchivesSpace data (required) --solr-url URL Solr instance URL (required) --arcuit-dir PATH Path to arcuit gem (optional, auto-detected) --agents-only Process only agents, skip collections --collections-only Process only collections, skip agents --skip-creator-indexing Generate XML but don't index to Solr --force-update Process all records regardless of timestamps ARCHITECTURE: ArchivesSpace API ↓ (archivessnake library) arcflow Python code ↓ (fetches via /repositories/1/archival_contexts) EAC-CPF XML files (public/xml/agents/*.xml) ↓ (indexed via traject) Solr (blacklight-core) ↓ (discovered via ArcLight) ArcLight DATA FLOW: 1. arcflow calls get_all_agents() - fetches ALL agents from ArchivesSpace API 2. For each agent, task_agent() retrieves EAC-CPF from archival_contexts endpoint 3. Saves EAC-CPF XML to public/xml/agents/ directory 4. find_traject_config() discovers config via 'bundle show arcuit' or --arcuit-dir 5. index_creators() batches XML files (100 per call) and invokes traject 6. traject indexes XML to Solr with is_creator:true flag 7. Agent records now searchable in ArcLight BENEFITS: - Users can discover all agents independently of collections - Direct navigation to agent pages - Browse all agents of a specific type - View all collections linked to a specific agent - Standards-based EAC-CPF format for interoperability - Automatic indexing reduces manual steps - Flexible processing modes for different workflows TECHNICAL DETAILS: - EAC-CPF format: urn:isbn:1-931666-33-4 namespace - ID extraction: Filename-based (handles empty control element in EAC-CPF) - Batch size: 100 files per traject call - Parallel processing: 10 worker processes for agent generation - Timeout: 5 minutes per batch - Error handling: Log errors, continue processing - Linking: Via Solr persistent_id field (not direct XML updates) FILES CHANGED: arcflow-phase1-revised/ ├── arcflow/ │ ├── __init__.py (updated imports) │ ├── main.py (1428 lines, core logic) │ └── utils/ │ ├── __init__.py │ ├── bulk_import.py │ └── stage_classifications.py ├── traject_config_eac_cpf.rb (271 lines) ├── requirements.txt ├── .archivessnake.yml.example ├── README.md (updated) ├── HOW_TO_USE.md (updated) └── TESTING.md (updated) solr/ ├── README.md (installation instructions) └── conf/ └── arcuit_creator_fields.xml (11 field definitions) Documentation: ├── CREATOR_INDEXING_GUIDE.md (comprehensive guide) ├── AUTOMATED_INDEXING_IMPLEMENTATION.md (technical details) └── README.md (updated with creator section) TESTING: Manual verification: 1. Run arcflow with --agents-only flag 2. Verify XML files generated in public/xml/agents/ 3. Check Solr for indexed agent records 4. Verify is_creator:true in Solr documents 5. Test agent-collection linking via persistent_id Automated testing: - Python syntax validation - Ruby syntax validation (traject config) - Solr schema validation DEPLOYMENT: 1. Add Solr schema fields to schema.xml 2. Reload Solr core 3. Run arcflow to generate and index agents 4. Verify agents appear in Solr 5. Test in ArcLight interface BACKWARD COMPATIBILITY: - No breaking changes to existing functionality - Collections continue to work as before - Agent indexing is additive - Can be disabled with --skip-creator-indexing

There is no longer any need for these to be defined outside the method

allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…index_creators

fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Fix EAC-CPF namespace handling in XPath queries

punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

as a separate command. Detailed explanation: self.traject_extra_config is constructed as a single string containing a space (e.g., "-c /path/to/file.rb"), but subprocess.run(cmd) passes arguments verbatim and Traject won’t parse that as two flags/values. Store the extra config as a path (or as an already-split argv list) and append it as ['-c', traject_extra_config] (or extend with a list) when building cmd. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Revert to manual returncode checking for subprocess error handling

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

log record if filename is not expected pattern: creator_{type}_{id}

… conservative list of things we know are relevant

alexdryden · 2026-03-03T18:45:28Z

@copilot the only thing I want to keep from this is the timeout for waiting for pdf generation, can you remove everything else so that I can smoothly merge this into index_creators?

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2026-03-03T18:48:09Z

@copilot the only thing I want to keep from this is the timeout for waiting for pdf generation, can you remove everything else so that I can smoothly merge this into index_creators?

Done! Removed all glob wildcard changes and kept only the PDF timeout features (commit 6674a21).

The branch now contains only:

Two-phase timeout system (queued 5min, running 30min)
CLI arguments for timeout configuration
Enhanced logging and status tracking
Context-aware error messages

Removed:

glob import and wildcard expansion
os.remove() file removal
All shell security fixes

Ready for smooth merge into index_creators.

…tring command with shell=True

alexdryden and others added 30 commits February 11, 2026 14:45

fix: remove method to test a single creator record

8a96e8d

refactor: move variable declarations into method body

604f68c

There is no longer any need for these to be defined outside the method

remove unwanted documentation

38c3612

refactor: revert to PIPE for sterr for consistency

4912ddc

fix: spacing

6014dd1

fix: duplicate line

5dbe81e

fix: remove unused method

1c63cae

Update arcflow/main.py

fef9307

allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: add time in case it isn't loaded when traject runs

849098e

Update arcflow/main.py

ec3c961

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'index_creators' of github.com:UIUCLibrary/arcflow into …

29d1c7f

…index_creators

fix: remove hardcoded directory from local implemantaiton

a043c9d

fix: use updated example

a407d72

Update README.md

8ec0445

fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update traject_config_eac_cpf.rb

202a532

In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

deb325e

Use consistent eac: namespace prefix in all XPath queries

05239d1

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Refactor dates XPath for improved readability

5d2588e

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Merge pull request #9 from UIUCLibrary/copilot/sub-pr-8

0e25143

Fix EAC-CPF namespace handling in XPath queries

Update README.md

e092b10

punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use dynamic mappings for ArcLight Solr fields

32d923e

add extra traject config with extend

ae77c98

to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Preserve html markup

cd1f94d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

107e43d

Add check=True to subprocess.run and handle exception properly

b542367

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Move success log to else block for clarity

092d1be

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Revert to original error handling without check=True

924ea47

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Merge pull request #10 from UIUCLibrary/copilot/sub-pr-8-again

db949e2

Revert to manual returncode checking for subprocess error handling

Copilot started work on behalf of alexdryden February 20, 2026 02:24 View session

Fix: Correct ArchivesSpace background job processor instructions

579c594

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI changed the title ~~Add timeout and diagnostic logging for ArchivesSpace PDF jobs~~ Add timeout and fix troubleshooting for stuck ArchivesSpace PDF jobs Feb 20, 2026

Copilot finished work on behalf of alexdryden February 20, 2026 02:27

Copilot started work on behalf of alexdryden February 20, 2026 02:29 View session

Add two-phase timeout for large PDF handling

d5e0974

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI changed the title ~~Add timeout and fix troubleshooting for stuck ArchivesSpace PDF jobs~~ Add adaptive timeout for ArchivesSpace PDF generation jobs Feb 20, 2026

Copilot finished work on behalf of alexdryden February 20, 2026 02:32

Copilot started work on behalf of alexdryden February 20, 2026 02:33 View session

Copilot AI changed the title ~~Add adaptive timeout for ArchivesSpace PDF generation jobs~~ Add two-phase timeout for PDF job processing Feb 20, 2026

Copilot finished work on behalf of alexdryden February 20, 2026 02:38

Copilot started work on behalf of alexdryden February 20, 2026 02:41 View session

Copilot finished work on behalf of alexdryden February 20, 2026 02:47

alexdryden force-pushed the index_creators branch from e1645c2 to a6053f5 Compare February 26, 2026 20:59

alexdryden and others added 2 commits February 28, 2026 16:13

fix: always use filename for id

7b9522a

log record if filename is not expected pattern: creator_{type}_{id}

alexdryden force-pushed the index_creators branch from ff41cdb to 7b9522a Compare February 28, 2026 21:13

alexdryden added 4 commits March 2, 2026 16:18

fix: reduce duplicate fields and make fields dynamic

635af2b

feat: store related agent ids, uris, and relationsips in arrays

24d86a6

this will require further refinement, but for now this will be a more…

3676246

… conservative list of things we know are relevant

ensure passing of indent size and not the indent string

29907fb

Copilot started work on behalf of alexdryden March 3, 2026 18:45 View session

Remove glob wildcard changes, keep only PDF timeout features

6674a21

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Expand wildcards with glob and use list command sequence instead of s…

5952798

…tring command with shell=True

Copilot finished work on behalf of alexdryden March 3, 2026 19:48

Merge branch 'index_creators' into copilot/replace-wildcard-expansion

02156e4

alexdryden force-pushed the copilot/replace-wildcard-expansion branch from 12f3297 to 02156e4 Compare March 3, 2026 20:06

alexdryden force-pushed the index_creators branch from f23fe83 to 89057a9 Compare March 4, 2026 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two-phase timeout for PDF job processing#17

Add two-phase timeout for PDF job processing#17
Copilot wants to merge 49 commits intoindex_creatorsfrom
copilot/replace-wildcard-expansion

Copilot AI commented Feb 19, 2026 •

edited

Loading

Uh oh!

alexdryden commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Remove glob wildcard changes, keep only PDF timeout features

Removed (glob wildcard security fixes):

Kept (PDF timeout features):

Changes:

Result:

Problem

Solution

Changes Required

Benefits

Testing

Uh oh!

alexdryden commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 19, 2026 •

edited

Loading