[FEATURE] Composable extraction pipeline with stage registry and typed contracts#241
Draft
mykola-pereyma wants to merge 8 commits into
Draft
Conversation
Define abstract base class with input_keys(), output_keys(), and as_transform() methods. Stages declare their metadata key contracts to enable build-time validation by PipelineBuilder. Part of composable extraction pipeline (Option 2).
SchemaFilter TransformComponent filters entities and relationships against ExtractionSchema. Supports strict mode, case-insensitive entity type matching, and alias resolution. Part of composable extraction pipeline (Option 2).
ExtractionConfig now accepts optional stages (List[ExtractionStage]) and schema (ExtractionSchema). When stages are provided, _configure_extraction_pipeline uses PipelineBuilder to compose them, bypassing the default hardcoded pipeline. Backward compatible. Part of composable extraction pipeline (Option 2).
Jupyter notebook with 9 examples: default pipeline using stages, schema-constrained extraction, pipeline validation, skip propositions, NER + LLM hybrid, full pipeline, local propositions, custom stages, and end-to-end indexing.
When ExtractionConfig has a schema, it is automatically passed to
LLMTopicExtractionStage which formats it as prompt constraints via
format_as_prompt_constraint(). This guides the LLM during extraction,
not just during post-extraction filtering. Added {schema_constraints}
placeholder to the text prompt. Backward compatible — empty string
when no schema is set.
…bility, NER error handling - Replace ExtractionConfig(stages=...) with ExtractionConfig.from_stages() factory method to prevent misuse of mutually exclusive parameters - Move chunking above custom stages early return so it applies to both paths - Add INFO log showing custom pipeline composition - Use Pydantic PrivateAttr for NER model caching instead of fragile hasattr - Add logging and error handling for NER model loading - Simplify from_stages() to call cls() for future-proof attribute inheritance - Add integration tests for custom pipeline path (chunking + schema injection) - Update docs and notebook to use from_stages() API
8221a98 to
9d28e55
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #240
Introduces a composable extraction pipeline framework that allows users to define custom stage sequences mixing local extraction (CPU-based NER via GLiNER) and LLM-based extraction, with schema/ontology constraints for guiding and filtering results.
Changes
Core Abstractions
ExtractionStageABC with typedinput_keys()/output_keys()/as_transform()contractsPipelineBuilderwith build-time validation of stage input/output compatibilityExtractionConfig.from_stages(stages, schema)factory method for custom pipelinesExtractionSchema+EntityTypeConfigfor defining entity types with descriptions, attributes, and aliasesBuilt-in Stages (8)
LLMPropositionStageLocalPropositionStageLLMTopicExtractionStageNERExtractionStageEntityMergeStageSchemaFilterStageBatchLLMPropositionStageBatchTopicExtractionStageAdditional
Backward Compatibility
ExtractionConfig()constructor unchanged — existing code works without modificationTopicCollectionformatTesting
Usage
Documentation
docs-site/src/content/docs/lexical-graph/composable-extraction-pipeline-usage.mdexamples/lexical-graph-local-dev/notebooks/06-Composable-Extraction-Pipeline.ipynbBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.