A reference implementation of the data quality metrics introduced in my Master’s research project, documented in doc/*/paper.typ (Property Graph Quality Assessment). This project provides a systematic framework for evaluating completeness, validity, consistency, integrity, and uniqueness in labeled property graphs (e.g., Neo4j).
Property graphs are schema‑flexible and semantically rich, but this freedom makes them prone to :
- Missing relationships or nodes → completeness issues
- Invalid label sets or malformed property values → conformity violations
- Inconsistent functional dependencies → coherence flaws
- Structural anomalies (duplicate edges, missing mandatory properties) → integrity / uniqueness degradation
Automated quality profiling helps:
- Validate graph‑based ETL pipelines
- Enforce domain constraints without a rigid schema
- Detect semantic drift in labels and relationships
- Improve downstream analytics (e.g., graph ML, path queries)
Requires Python 3.10+ and uv.
# Clone the repository
git clone https://github.com/LugolBis/data-quality.git
cd data-quality
# Create virtual environment and install dependencies
uv venv
uv pip install -e .Create a .env file and configure it :
echo '' > .envand copy-paste
URI="neo4j://127.0.0.1:7687"
DB_USER="your_neo4j_user"
DB_PW="your_neo4j_password"
DB_NAME="your_database"
Launch the interactive profiler (Streamlit UI) :
streamlit run src/main.pyThen :
- Connect to a Neo4j database (or upload a Cypher dump).
- Define constraints based on your domain rules.
- Run the assessment and easily export them as CSV.