GS-Quant learns semantically structured discrete codes for KG entities and injects them into LLM-based knowledge graph completion.
- Granular Semantic Enhancement (GSE) aligns quantization levels with coarse-to-fine semantic granularity.
- Generative Structural Reconstruction (GSR) turns discrete codes into causally structured semantic descriptors.
- Quantized entity codes are injected into LLM prompts for knowledge graph completion.
- The repository provides an end-to-end pipeline from hierarchy construction and quantization to LoRA fine-tuning and evaluation.
.
├── cluster/ # hierarchy construction and optional LLM refinement
├── codebook/ # residual quantization modules
├── dataset/ # KG dataloaders and argument helpers
├── graph_embedding/ # hierarchy-aware KG embedding model
├── scripts/ # training / analysis shell scripts
├── train_graph_embedding.py # stage 1: graph embedding training
├── train_codebook.py # stage 2: codebook training
├── use_codebook.py # export quantized entity ids and token vocabulary
├── adapter_lora_data.py # convert external KG-LLM data into token-augmented JSONL
├── train_lora.py # stage 3: LoRA fine-tuning
├── eval_llm.py # generation and evaluation pipeline
└── eval_on_dataset.py # standalone metric computation from predictions
This project is managed with uv.
uv syncThe current pyproject.toml requires Python 3.13+. Main dependencies include PyTorch, Transformers, PEFT, DeepSpeed, Accelerate, and vLLM.
If you train large models, make sure the local environment also provides:
- CUDA-compatible PyTorch
- enough GPU memory for your chosen backbone
- a working DeepSpeed / NCCL setup for multi-GPU runs
Raw benchmark data is expected under:
data/<dataset>/
├── entities.dict
├── relations.dict
├── train.txt
├── valid.txt
├── test.txt
└── entity2text.txt / entity.json # optional but recommended
train_graph_embedding.py and related utilities assume the standard entity / relation dictionary format and triple files separated by tabs.
Precomputed hierarchy artifacts are stored under:
processed_data/<dataset>/
├── entity_init_embeddings.npy
├── clusters_embeddings_seed.npy
├── clusters_embeddings_llm.npy
├── seed_hierarchy.json
├── llm_hierarchy.json
├── entity_info_seed_hier.json
└── entity_info_llm_hier.json
These files are produced by the hierarchy construction pipeline in cluster/.
adapter_lora_data.py can adapt third-party instruction-tuning datasets from
DIFT, stored in the following layout:
data/DIFT-dataset/<dataset_alias>/<source>/data_KGELlama/
├── train.json
├── valid.json
└── test.json
The script converts them into this repository's JSONL format and optionally inserts quantized entity tokens into prompts and candidates.
For convenience, the repository also provides one-command batch scripts for common runs, such as scripts/batch_lora_codebook_gpu_multi.sh and scripts/batch_lora_codebook_gpu.sh.
For hierarchy construction and the corresponding processed files, please refer to
KG-FIT. This repository assumes the hierarchy
artifacts have already been prepared under processed_data/<dataset>/.
For graph embedding training, please refer to the provided shell script scripts/run_train_graph_embedding.sh. The resulting checkpoints are written to:
processed_data/<dataset>/checkpoints/<model>_<hierarchy_type>_batch_<...>/
Depending on flags, this stage can also export top-k hit records for later adapter-data creation.
Train the residual quantizer with train_codebook.py, using:
- graph embedding checkpoints from Step 2
- hierarchy-aware cluster embeddings under
processed_data/<dataset>/ - hierarchy metadata such as
entity_info_llm_hier.json
This stage produces a codebook checkpoint and the corresponding quantized entity representations:
entity_quantized.json: quantized code indices for each entitytokens.json: textual token vocabulary derived from the codebook
If needed, they can be exported afterward with use_codebook.py.
Use adapter_lora_data.py to convert DIFT-style supervision data into this repository's JSONL format by combining:
- the external KGC supervision data from DIFT
- the quantized entity codes from Step 3
- entity metadata from the processed hierarchy files
If --wrap_token is enabled, the script also updates tokens.json with:
<#begin_of_entity><#end_of_entity>
The generated records follow the standard instruction-tuning format:
{ "instruction": "...", "input": "", "output": "..." }Use train_lora.py to fine-tune the LLM on the token-augmented JSONL data from Step 4. During training:
- New quantized tokens are appended to the tokenizer vocabulary when
--tokens_fileis provided. - LoRA target modules default to
q_proj,v_proj. - The training file can be
json,jsonl, orcsv.
Use eval_llm.py to generate predictions and compute ranking metrics on the test split. It saves:
predictions.jsonlmetrics.jsonrank_predictions.csveval_args.json
For post-hoc metric computation from an existing prediction file, use eval_on_dataset.py.
If you use this repository, please cite our paper:
@misc{xie2026gsquantgranularsemanticgenerative,
title={GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion},
author={Qizhuo Xie and Yunhui Liu and Yu Xing and Qianzi Hou and Xudong Jin and Tao Zheng and Tieke He},
year={2026},
eprint={2604.21649},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.21649},
}This project builds on open-source tooling from PyTorch, Hugging Face Transformers, PEFT, Accelerate, DeepSpeed, and related KG benchmark ecosystems.
This project is released under the Apache-2.0 License. See LICENSE for details.
