Skip to content

[FEATURE] Composable extraction pipeline with stage registry and typed contracts#241

Draft
mykola-pereyma wants to merge 8 commits into
awslabs:mainfrom
mykola-pereyma:feature/composable-extraction-pipeline
Draft

[FEATURE] Composable extraction pipeline with stage registry and typed contracts#241
mykola-pereyma wants to merge 8 commits into
awslabs:mainfrom
mykola-pereyma:feature/composable-extraction-pipeline

Conversation

@mykola-pereyma
Copy link
Copy Markdown
Collaborator

@mykola-pereyma mykola-pereyma commented May 5, 2026

Description

Closes #240

Introduces a composable extraction pipeline framework that allows users to define custom stage sequences mixing local extraction (CPU-based NER via GLiNER) and LLM-based extraction, with schema/ontology constraints for guiding and filtering results.

Changes

Core Abstractions

  • ExtractionStage ABC with typed input_keys()/output_keys()/as_transform() contracts
  • PipelineBuilder with build-time validation of stage input/output compatibility
  • ExtractionConfig.from_stages(stages, schema) factory method for custom pipelines
  • ExtractionSchema + EntityTypeConfig for defining entity types with descriptions, attributes, and aliases

Built-in Stages (8)

Stage Type Purpose
LLMPropositionStage LLM Extract propositions from text
LocalPropositionStage Local Non-LLM proposition extraction
LLMTopicExtractionStage LLM Extract topics, entities, relationships
NERExtractionStage Local GLiNER-based CPU entity extraction
EntityMergeStage Transform Fuzzy-deduplicate NER + LLM entities
SchemaFilterStage Filter Filter output against schema (strict/non-strict)
BatchLLMPropositionStage Batch Batch variant of proposition extraction
BatchTopicExtractionStage Batch Batch variant of topic extraction

Additional

  • JSON topic extraction prompt with text parser fallback
  • Schema constraints auto-injected into LLM topic extraction prompt
  • INFO logging of custom pipeline composition
  • Pydantic PrivateAttr for NER model caching
  • Error handling with clear messages for NER model loading

Backward Compatibility

  • ExtractionConfig() constructor unchanged — existing code works without modification
  • Pipeline output remains TopicCollection format
  • Default chunking applies to both default and custom pipeline paths
  • Pure additive extension — no breaking changes

Testing

  • 229 extraction-specific tests
  • 1308 total tests passing
  • Integration tests for custom pipeline path (chunking + schema injection)
  • All pre-existing tests pass (no regression)

Usage

from graphrag_toolkit.lexical_graph import ExtractionConfig
from graphrag_toolkit.lexical_graph.indexing.extract import (
    LLMPropositionStage, LLMTopicExtractionStage,
    NERExtractionStage, EntityMergeStage, SchemaFilterStage,
    ExtractionSchema, EntityTypeConfig,
)

schema = ExtractionSchema(
    entity_types={
        'Person': EntityTypeConfig(description='A human being'),
        'Organization': EntityTypeConfig(description='A company or institution'),
    },
    strict=True,
)

config = ExtractionConfig.from_stages(
    stages=[
        LLMPropositionStage(),
        NERExtractionStage(entity_labels=['Person', 'Organization']),
        LLMTopicExtractionStage(),
        EntityMergeStage(fuzzy_threshold=0.85),
        SchemaFilterStage(schema=schema),
    ],
    schema=schema,
)

Documentation

  • Usage guide: docs-site/src/content/docs/lexical-graph/composable-extraction-pipeline-usage.md
  • Example notebook: examples/lexical-graph-local-dev/notebooks/06-Composable-Extraction-Pipeline.ipynb

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Define abstract base class with input_keys(), output_keys(), and
as_transform() methods. Stages declare their metadata key contracts
to enable build-time validation by PipelineBuilder.

Part of composable extraction pipeline (Option 2).
SchemaFilter TransformComponent filters entities and relationships
against ExtractionSchema. Supports strict mode, case-insensitive
entity type matching, and alias resolution.

Part of composable extraction pipeline (Option 2).
ExtractionConfig now accepts optional stages (List[ExtractionStage])
and schema (ExtractionSchema). When stages are provided,
_configure_extraction_pipeline uses PipelineBuilder to compose them,
bypassing the default hardcoded pipeline. Backward compatible.

Part of composable extraction pipeline (Option 2).
Jupyter notebook with 9 examples: default pipeline using stages,
schema-constrained extraction, pipeline validation, skip propositions,
NER + LLM hybrid, full pipeline, local propositions, custom stages,
and end-to-end indexing.
When ExtractionConfig has a schema, it is automatically passed to
LLMTopicExtractionStage which formats it as prompt constraints via
format_as_prompt_constraint(). This guides the LLM during extraction,
not just during post-extraction filtering. Added {schema_constraints}
placeholder to the text prompt. Backward compatible — empty string
when no schema is set.
…bility, NER error handling

- Replace ExtractionConfig(stages=...) with ExtractionConfig.from_stages() factory
  method to prevent misuse of mutually exclusive parameters
- Move chunking above custom stages early return so it applies to both paths
- Add INFO log showing custom pipeline composition
- Use Pydantic PrivateAttr for NER model caching instead of fragile hasattr
- Add logging and error handling for NER model loading
- Simplify from_stages() to call cls() for future-proof attribute inheritance
- Add integration tests for custom pipeline path (chunking + schema injection)
- Update docs and notebook to use from_stages() API
@mykola-pereyma mykola-pereyma force-pushed the feature/composable-extraction-pipeline branch from 8221a98 to 9d28e55 Compare May 14, 2026 23:35
@mykola-pereyma mykola-pereyma marked this pull request as draft May 14, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Composable extraction pipeline with stage registry and typed contracts

1 participant