Skip to content

LugolBis/data-quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

236 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-quality

A reference implementation of the data quality metrics introduced in my Master’s research project, documented in doc/*/paper.typ (Property Graph Quality Assessment). This project provides a systematic framework for evaluating completeness, validity, consistency, integrity, and uniqueness in labeled property graphs (e.g., Neo4j).

Why measure data quality in a property graph ?

Property graphs are schema‑flexible and semantically rich, but this freedom makes them prone to :

  • Missing relationships or nodes → completeness issues
  • Invalid label sets or malformed property values → conformity violations
  • Inconsistent functional dependencies → coherence flaws
  • Structural anomalies (duplicate edges, missing mandatory properties) → integrity / uniqueness degradation

Automated quality profiling helps:

  • Validate graph‑based ETL pipelines
  • Enforce domain constraints without a rigid schema
  • Detect semantic drift in labels and relationships
  • Improve downstream analytics (e.g., graph ML, path queries)

Getting started

Requires Python 3.10+ and uv.

# Clone the repository
git clone https://github.com/LugolBis/data-quality.git
cd data-quality

# Create virtual environment and install dependencies
uv venv
uv pip install -e .

Create a .env file and configure it :

echo '' > .env

and copy-paste

URI="neo4j://127.0.0.1:7687"
DB_USER="your_neo4j_user"
DB_PW="your_neo4j_password"
DB_NAME="your_database"

Usage

Launch the interactive profiler (Streamlit UI) :

streamlit run src/main.py

Then :

  1. Connect to a Neo4j database (or upload a Cypher dump).
  2. Define constraints based on your domain rules.
  3. Run the assessment and easily export them as CSV.