Skip to content

Add pytest infrastructure with 113 tests covering core functionality#20

Draft
Copilot wants to merge 45 commits intoindex_creatorsfrom
copilot/add-comprehensive-testing-infrastructure
Draft

Add pytest infrastructure with 113 tests covering core functionality#20
Copilot wants to merge 45 commits intoindex_creatorsfrom
copilot/add-comprehensive-testing-infrastructure

Conversation

Copy link
Contributor

Copilot AI commented Feb 23, 2026

Establishes comprehensive testing infrastructure to accelerate AI agent development workflows. Currently covers 8 key modules with targeted tests for file operations, EAD processing, batch logic, and configuration discovery.

Infrastructure

  • pytest.ini: Coverage reporting, custom markers (unit, integration, slow, skip_complex)
  • CI/CD workflow: Python 3.9/3.10/3.11, coverage upload to Codecov
  • Shared fixtures: Mock ArchivesSpace client, sample data structures, temp directories
  • Documentation: tests/README.md with usage patterns and testing philosophy

Test Coverage (113 passing, 10 skipped)

  • File operations (15 tests): save_file, create_symlink, get_ead_from_symlink
  • Subprocess/glob (18 tests): Wildcard expansion, iglob iteration, CSV discovery
  • EAD operations (14 tests): ID extraction, dots→dashes sanitization, XML parsing
  • Batch processing (21 tests): Size calculations, iteration patterns, edge cases
  • Config discovery (15 tests): find_traject_config multi-path search (arcuit_dir → bundle → fallback)
  • XML manipulation (20 tests): Escaping for plain text vs. structured XML preservation
  • Utilities (11 tests): get_repo_id, path construction helpers

Agent Filtering (Intentionally Stubbed)

test_agent_filtering.py documents the 5-tier filtering system but marks all implementation tests as skipped. The current logic is too complex to test without refactoring:

@pytest.mark.skip(reason="Agent filtering too complex - needs refactoring before testing")
def test_system_user_exclusion(self):
    """
    SKIPPED: Should test that agents with is_user=True are excluded.
    
    Current issues:
    - Requires full ArcFlow instance
    - Needs mock client with complex response structure
    - Tight coupling to logging
    """
    pass

3 documentation tests pass to verify filtering tier definitions remain accurate.

Usage

pytest                              # All tests
pytest -m unit                      # Unit tests only
pytest --cov=arcflow --cov-report=term-missing

Metrics

  • Test count: 123 total (113 passing, 10 skipped)
  • Coverage: 16% baseline (targeted functions)
  • Runtime: ~5 seconds
  • Dependencies: pytest>=7.0.0, pytest-cov>=4.0.0, pytest-mock>=3.10.0
Original prompt

Add comprehensive testing infrastructure to speed up AI agent development workflow:

Infrastructure Files

  1. pytest.ini - Configure pytest with coverage and markers
  2. .github/workflows/test.yml - CI/CD for Python 3.9, 3.10, 3.11
  3. tests/conftest.py - Shared fixtures (mock_asnake_client, temp_dir, sample data)
  4. tests/README.md - Documentation on running and writing tests

Test Files

Create 8 test files in tests/ directory:

  1. test_file_operations.py - save_file, create_symlink, get_ead_from_symlink
  2. test_subprocess_fixes.py - glob.glob wildcard expansion for batch files
  3. test_ead_operations.py - EAD ID extraction, dots→dashes sanitization
  4. test_batching.py - batch calculation logic, edge cases
  5. test_config_discovery.py - find_traject_config method (new in index_creators)
  6. test_xml_manipulation.py - XML escaping, bioghist content handling
  7. test_utilities.py - Simple helpers like get_repo_id, path construction
  8. test_agent_filtering.py - STUB ONLY - Document that filtering logic is too complex and needs refactoring before testing. Mark all tests as skipped with note.

Key Points

  • Use pytest fixtures from conftest.py
  • Mock ArchivesSpace API calls
  • Test happy paths, edge cases, and error handling
  • Agent filtering gets stub warning file only (too complex to test as-is)
  • Include pytest-cov for coverage reporting
  • Tests should pass except agent filtering (intentionally skipped)

Dependencies

Add to setup.py or pyproject.toml:

pytest>=7.0.0
pytest-cov>=4.0.0  
pytest-mock>=3.10.0

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

alexdryden and others added 30 commits February 11, 2026 14:45
Implement data ETL for standalone creator records from ArchivesSpace
agents and automatically index them to Solr for discovery in ArcLight.

PROBLEM:
ArchivesSpace agents (people, organizations, families) were not discoverable
as standalone entities in ArcLight. Users could not browse or search for
agents independently of their collections.

SOLUTION:
1. Extract ALL agents from ArchivesSpace via API
2. Generate EAC-CPF XML documents for each agent
3. Define Solr schema fields for agent metadata
4. Configure traject to index agent XML to Solr
5. Implement automatic indexing after XML generation

FEATURES:
- Processes all agent types (people, corporate entities, families)
- Generates standards-compliant EAC-CPF XML
- Links agents to their collections via persistent_id
- Automatic discovery of traject config (bundle show arcuit)
- Batch processing (100 files per traject call)
- Robust error handling with detailed logging
- Multiple processing modes (normal, agents-only, collections-only)

COMPONENTS:

1. Python Processing (arcflow/main.py - 1428 lines):
   - get_all_agents() - Fetch ALL agents from ArchivesSpace API
   - task_agent() - Generate EAC-CPF XML via archival_contexts endpoint
   - process_creators() - Batch process all agents in parallel (10 workers)
   - find_traject_config() - Auto-discover traject configuration
   - index_creators() - Batch index to Solr via traject

2. Solr Schema (solr/conf/arcuit_creator_fields.xml - 11 fields):
   - is_creator (boolean) - Identifies agent records
   - creator_persistent_id (string) - Unique identifier
   - agent_type (string) - Type: corporate/person/family
   - agent_id (string) - ArchivesSpace agent ID
   - agent_uri (string) - ArchivesSpace agent URI
   - entity_type (string) - EAC-CPF entity type
   - related_agents_ssim (multiValued) - Related agent names
   - related_agent_uris_ssim (multiValued) - Related agent URIs
   - relationship_types_ssim (multiValued) - Relationship types
   - document_type (string) - Document type (eac-cpf)
   - record_type (string) - Record type (creator/agent)

3. Traject Configuration (traject_config_eac_cpf.rb - 271 lines):
   - Maps EAC-CPF XML elements to Solr fields
   - Extracts agent identity information
   - Processes biographical/historical notes
   - Captures related agents and relationships
   - Handles collection linkages

USAGE:

python -m arcflow.main \
  --arclight-dir /path/to/arclight \
  --aspace-dir /path/to/archivesspace \
  --solr-url http://localhost:8983/solr/blacklight-core

python -m arcflow.main ... --agents-only

python -m arcflow.main ... --collections-only

python -m arcflow.main ... --skip-creator-indexing

python -m arcflow.main ... --arcuit-dir /path/to/arcuit

COMMAND LINE OPTIONS:
--arclight-dir PATH        Path to ArcLight application (required)
--aspace-dir PATH          Path to ArchivesSpace data (required)
--solr-url URL             Solr instance URL (required)
--arcuit-dir PATH          Path to arcuit gem (optional, auto-detected)
--agents-only              Process only agents, skip collections
--collections-only         Process only collections, skip agents
--skip-creator-indexing    Generate XML but don't index to Solr
--force-update             Process all records regardless of timestamps

ARCHITECTURE:

ArchivesSpace API
    ↓ (archivessnake library)
arcflow Python code
    ↓ (fetches via /repositories/1/archival_contexts)
EAC-CPF XML files (public/xml/agents/*.xml)
    ↓ (indexed via traject)
Solr (blacklight-core)
    ↓ (discovered via ArcLight)
ArcLight

DATA FLOW:
1. arcflow calls get_all_agents() - fetches ALL agents from ArchivesSpace API
2. For each agent, task_agent() retrieves EAC-CPF from archival_contexts endpoint
3. Saves EAC-CPF XML to public/xml/agents/ directory
4. find_traject_config() discovers config via 'bundle show arcuit' or --arcuit-dir
5. index_creators() batches XML files (100 per call) and invokes traject
6. traject indexes XML to Solr with is_creator:true flag
7. Agent records now searchable in ArcLight

BENEFITS:
- Users can discover all agents independently of collections
- Direct navigation to agent pages
- Browse all agents of a specific type
- View all collections linked to a specific agent
- Standards-based EAC-CPF format for interoperability
- Automatic indexing reduces manual steps
- Flexible processing modes for different workflows

TECHNICAL DETAILS:
- EAC-CPF format: urn:isbn:1-931666-33-4 namespace
- ID extraction: Filename-based (handles empty control element in EAC-CPF)
- Batch size: 100 files per traject call
- Parallel processing: 10 worker processes for agent generation
- Timeout: 5 minutes per batch
- Error handling: Log errors, continue processing
- Linking: Via Solr persistent_id field (not direct XML updates)

FILES CHANGED:

arcflow-phase1-revised/
├── arcflow/
│   ├── __init__.py (updated imports)
│   ├── main.py (1428 lines, core logic)
│   └── utils/
│       ├── __init__.py
│       ├── bulk_import.py
│       └── stage_classifications.py
├── traject_config_eac_cpf.rb (271 lines)
├── requirements.txt
├── .archivessnake.yml.example
├── README.md (updated)
├── HOW_TO_USE.md (updated)
└── TESTING.md (updated)

solr/
├── README.md (installation instructions)
└── conf/
    └── arcuit_creator_fields.xml (11 field definitions)

Documentation:
├── CREATOR_INDEXING_GUIDE.md (comprehensive guide)
├── AUTOMATED_INDEXING_IMPLEMENTATION.md (technical details)
└── README.md (updated with creator section)

TESTING:

Manual verification:
1. Run arcflow with --agents-only flag
2. Verify XML files generated in public/xml/agents/
3. Check Solr for indexed agent records
4. Verify is_creator:true in Solr documents
5. Test agent-collection linking via persistent_id

Automated testing:
- Python syntax validation
- Ruby syntax validation (traject config)
- Solr schema validation

DEPLOYMENT:

1. Add Solr schema fields to schema.xml
2. Reload Solr core
3. Run arcflow to generate and index agents
4. Verify agents appear in Solr
5. Test in ArcLight interface

BACKWARD COMPATIBILITY:
- No breaking changes to existing functionality
- Collections continue to work as before
- Agent indexing is additive
- Can be disabled with --skip-creator-indexing
There is no longer any need for these to be defined outside the
method
allow for whitespace in filenames and for quoted arguments

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fix typo

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Fix EAC-CPF namespace handling in XPath queries
punctuation/spacing

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
as a separate command. 

Detailed explanation: self.traject_extra_config is constructed as a single string containing a space (e.g., "-c /path/to/file.rb"), but subprocess.run(cmd) passes arguments verbatim and Traject won’t parse that as two flags/values. Store the extra config as a path (or as an already-split argv list) and append it as ['-c', traject_extra_config] (or extend with a list) when building cmd.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
to prevent commands from running together

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Revert to manual returncode checking for subprocess error handling
Copilot AI and others added 10 commits February 13, 2026 17:34
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
… selection

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Filter non-creator agents from indexing (exclude system users, donors)
…#14)

* Improve traject config discovery and logging

- Add fallback search in arcflow package directory for development
- Add clear logging showing which traject config is being used
- Add warning when using arcflow package version (development mode)
- Improve error messages when traject config not found
- Document that traject config belongs in arcuit gem, not arcflow
- Update README with traject config location guidance

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Address code review feedback

- Change log level from error to warning for missing traject config
- Update example path to clarify arcuit gem location
- Show actual searched paths in error message for better troubleshooting

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Reorder traject config search to follow collection records pattern

- Change search order: arcuit_dir (1st) → bundle show (2nd) → example file (3rd)
- Rename traject_config_eac_cpf.rb to example_traject_config_eac_cpf.rb
- Prioritize arcuit_dir parameter as most up-to-date user control
- Fall back to example file for module usage without arcuit
- Update README with new search order and example file guidance

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Address code review feedback on example file

- Update usage comment to reference correct filename
- Improve log message formatting for consistency
- Add note about copying to arcuit for production use

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Update traject config search paths to follow ArcLight pattern

- Remove arcuit_dir/arcflow path (development artifact)
- Add arcuit_dir/lib/arcuit/traject path (matches EAD traject location)
- Apply same paths to both arcuit_dir and bundle show arcuit searches
- Update debug message to reflect new subdirectory checked

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Simplify example traject config search to single known location

- Remove candidate paths loop for example file
- Directly check the one known location at repo root
- Add comment explaining we know the exact location

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
and update documentation to reflect creator records are also deleted
upon --force-update.
This add logic to delete creator records similar to resources and
refactors directory structure of xml files so it is clear where
resource and agent document live for create/read/delete.
- Add pytest, pytest-cov, pytest-mock to requirements.txt
- Create pytest.ini with coverage and marker configuration
- Create .github/workflows/test.yml for CI on Python 3.9, 3.10, 3.11
- Create tests/conftest.py with shared fixtures
- Create tests/README.md with documentation
- Add 8 test files covering key functionality
- All tests passing (113 passed, 10 intentionally skipped)

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot AI changed the title [WIP] Add comprehensive testing infrastructure for AI agent development Add pytest infrastructure with 113 tests covering core functionality Feb 23, 2026
Copilot AI requested a review from alexdryden February 23, 2026 20:14
Copilot AI and others added 4 commits February 26, 2026 12:05
* Initial plan

* Add testing infrastructure with Traject smoke tests

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Add traject_plus gem and fix smoke test writer class

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Address code review feedback: fix fixture duplication and test logic

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

* Add explicit permissions to GitHub Actions workflow for security

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Co-authored-by: Alex Dryden <alex.dryden@gmail.com>
@alexdryden alexdryden force-pushed the index_creators branch 3 times, most recently from f23fe83 to 89057a9 Compare March 4, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants