knowledge-lora

Continued Pretraining (CPT) + Supervised Fine-Tuning (SFT) LoRA adapter for Ministral-3-14B-Base-2512, trained on German Wikipedia and custom documents (PDFs, Markdown) to extend the model's knowledge cutoff.

Pipeline stages

Stage	Script	Output	Notes
01 download	`scripts/01_download_wiki.py`	`data/raw/`	Resumable, MD5-verified
02 extract	`scripts/02_extract_wiki.py`	`data/processed/wikipedia_*.jsonl`
03 PDF extract	`scripts/03_extract_pdfs.py`	`data/processed/pdfs.jsonl`	Optional
04 Markdown extract	`scripts/04_extract_markdown.py`	`data/processed/markdown.jsonl`	Optional
05 dedup	`scripts/05_clean_deduplicate.py`	`data/processed/corpus.jsonl`	SHA-256 + MinHash LSH
06 tokenize	`scripts/06_tokenize.py`	`data/tokenized/cpt_dataset/`	Packed Arrow, seq_len 8192
07 SFT data	`scripts/07_create_sft_data.py`	`data/processed/sft_data.jsonl`	Template-based, no model needed
08 QA gen	`scripts/08_generate_qa_llm.py`	`data/processed/sft_qa_llm.jsonl`	LLM-based, runs after CPT merge
CPT training	`train_cpt.sh`	`output/cpt/`	LoRA rank 128, BF16
SFT training	`train_sft.sh`	`output/sft/`	LoRA rank 64, from CPT checkpoint

Architecture overview

Wikipedia dump ──┐
PDF files        ├── Data Pipeline ──► deduplicated corpus ──► CPT LoRA ──► SFT LoRA
Markdown files ──┘                       (corpus.jsonl)         (rank 128)   (rank 64)

Phase 1 – CPT: The base model learns new factual knowledge via next-token prediction on the cleaned corpus (LoRA rank 128, full BF16, seq_len 8192).

Phase 2 – SFT: The CPT adapter is used as a starting point for instruction-following fine-tuning on template-generated Q&A / summarisation data (LoRA rank 64, lr 2e-5).

Hardware requirements

Component	Minimum	Recommended
GPU VRAM / unified memory	24 GB (with QLoRA)	128 GB (DGX Spark / GB10)
Storage for German Wikipedia	~25 GB raw	~60 GB with intermediates
Python	3.11+	3.11+

Training was designed for the Dell GB10 / NVIDIA DGX Spark (128 GB unified Blackwell memory) — full BF16 with no quantisation.

Installation

# 1. Clone
git clone https://github.com/cl4wb0rg/knowledge-lora.git
cd knowledge-lora

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate

# 3. Install all training dependencies (handles CUDA 13.0 / GB10 quirks)
bash install.sh

# 4. Configure credentials
cp .env.example .env
# Edit .env and add your HuggingFace token

DGX Spark / CUDA 13.0 notes:

install.sh handles the correct installation order for torch, axolotl (GitHub HEAD), flash-attn, and xformers.

xformers cannot be built for CUDA 13.0 (removed driver API symbols) and is skipped. install.sh uses --only-binary xformers so pip never attempts a source compile — previously, the source build spawned many parallel nvcc/cicc processes that exhausted all RAM+swap and froze the system. flash-attn covers the same functionality.

flash-attn is built from source against CUDA 13.0 (no pre-built cu13 wheel exists). The build uses MAX_JOBS=1 and nice -n 10 to stay below 80 % CPU/RAM load. Expect ~20–30 minutes on first install; subsequent runs skip the build if flash-attn is already correctly installed.

GB10 freeze prevention: before each flash-attn build, install.sh clears the filesystem cache and caps GPU power to 80 % of its maximum (via nvidia-smi -pl) to prevent hard crashes from memory exhaustion or power spikes during nvcc compilation. Requires sudo; silently skipped if unavailable. The power limit is restored after the build completes. The flash-attn build subprocess also sets a high OOM score (oom_score_adj=500) so the kernel kills it first if memory runs out, keeping the system responsive.

axolotl is installed from GitHub HEAD; PyPI releases do not support torch 2.10+.

Verify the installation

After install.sh completes, run the smoke test to confirm the full training stack (torch CUDA, flash-attn, LoRA, forward/backward pass) works end-to-end. No large model download or pre-processed data needed — it uses gpt2 (124 M) with synthetic inputs.

python scripts/smoke_test.py
# Expected output:
#   [OK]  torch 2.10.0+cu130  |  NVIDIA GB10  |  120 GB
#   [OK]  flash-attn 2.8.3
#   [OK]  gpt2 loaded on cuda (bf16)
#   [OK]  LoRA applied  |  trainable 147.5K / 124.6M params
#     step 1/5  loss=...
#     ...
#   [OK]  all losses finite  |  first=...  last=...
#   === PASSED ===

vLLM inference environment (optional)

vLLM requires a different torch version than axolotl and must live in its own venv:

bash install_vllm.sh        # creates .venv-vllm/
source .venv-vllm/bin/activate

Use this environment only for running scripts/08_generate_qa_llm.py after CPT.

CUDA 13.0 vLLM binary patching: vLLM 0.16.0 binaries reference libcudart.so.12 but CUDA 13.0 ships libcudart.so.13. Two binary patches are required after install_vllm.sh:
Replace the DT_NEEDED string libcudart.so.12 → libcudart.so.13 in .dynstr (done by install_vllm.sh)
Fix the ELF vna_hash in .gnu.version_r (glibc checks this hash for version matching):
source .venv-vllm/bin/activate
pip install pyelftools
python scripts/patch_vllm_verneed_hash.py
This patches all 6 vLLM .so files in-place. Safe to re-run (idempotent).

Data pipeline

Run the steps in order. All steps are idempotent — re-running skips already completed work where possible.

Step 1 — Download Wikipedia

python scripts/01_download_wiki.py --lang de
# For multiple languages:
python scripts/01_download_wiki.py --lang de --lang en --lang fr

Downloads data/raw/dewiki-latest-pages-articles.xml.bz2 and verifies the MD5 checksum against Wikimedia's published checksums. The download is resumable — if the connection drops, re-running the script continues from where it left off.

Step 2 — Extract Wikipedia text

python scripts/02_extract_wiki.py \
    --dump data/raw/dewiki-latest-pages-articles.xml.bz2 \
    --lang de

Produces data/processed/wikipedia_de.jsonl. Uses all available CPU cores by default.

Step 3 — Extract PDFs (optional)

python scripts/03_extract_pdfs.py --input-dir /path/to/your/pdfs

Produces data/processed/pdfs.jsonl.

Step 4 — Extract Markdown / text files (optional)

python scripts/04_extract_markdown.py --input-dir /path/to/your/docs

Supports .md, .markdown, .txt, .rst. Produces data/processed/markdown.jsonl.

Step 5 — Deduplicate

python scripts/05_clean_deduplicate.py \
    --input-files \
        data/processed/wikipedia_de.jsonl \
        data/processed/pdfs.jsonl \
        data/processed/markdown.jsonl \
    --output-file data/processed/corpus.jsonl

Two-stage deduplication: exact (SHA-256) + near-duplicate (MinHash LSH, Jaccard ≥ 0.8).

Step 6 — Tokenize

HF_TOKEN=hf_... python scripts/06_tokenize.py \
    --input data/processed/corpus.jsonl \
    --model-id mistralai/Ministral-3-14B-Base-2512 \
    --seq-len 8192

Produces a packed Arrow dataset at data/tokenized/cpt_dataset/. Memory usage is bounded by --batch-size regardless of corpus size.

Step 7 — Generate SFT data (template-based)

python scripts/07_create_sft_data.py \
    --input data/processed/corpus.jsonl \
    --output data/processed/sft_data.jsonl \
    --max-docs 200000

Creates template-based summarisation, Q&A, and text-continuation examples. No model calls required — fast and deterministic.

Step 8 — Generate SFT data (LLM-based, optional)

Generates higher-quality Q&A pairs using the CPT-merged model via vLLM. Run this after CPT is complete and the adapter has been merged.

# Activate the vLLM venv (separate from training venv — see Installation)
source .venv-vllm/bin/activate

python scripts/08_generate_qa_llm.py \
    --model output/cpt/merged \
    --input data/processed/corpus.jsonl \
    --output data/processed/sft_qa_llm.jsonl \
    --qa-per-doc 3 \
    --batch-size 64

The CPT model is used (not the base model) so that generated questions reflect the newly learned knowledge. Output is in the same Alpaca format as step 7 and can be combined with or used instead of the template-based data for SFT.

Training

Phase 1: Continued Pretraining (CPT)

source .env          # loads HF_TOKEN, WANDB_* etc.
bash train_cpt.sh

Training checkpoints are saved to output/cpt/. See configs/cpt_config.yaml for all hyperparameters.

Key settings (DGX Spark / 128 GB):

Parameter	Value	Notes
LoRA rank	128	Higher rank → more capacity for new knowledge
Sequence length	8192	Increase to 16384 if needed
Micro batch size	2	Effective batch = 16 seqs/step
Gradient checkpointing	on	Required to fit training pass in unified memory
Learning rate	1e-4	Standard for CPT
Precision	BF16	No quantisation on 128 GB
Epochs	1	One pass is typically sufficient

Phase 2: Supervised Fine-Tuning (SFT)

After CPT completes, update configs/sft_config.yaml to point to the best CPT checkpoint:

# configs/sft_config.yaml — uncomment this line:
lora_weights: ./output/cpt/checkpoint-XXXX

Then run:

bash train_sft.sh

Output: output/sft/.

Merge and export

# Merge LoRA weights into the base model for standalone deployment
accelerate launch -m axolotl.cli.merge_lora configs/sft_config.yaml \
    --lora-model-dir output/sft

Project structure

knowledge-lora/
├── scripts/
│   ├── smoke_test.py             # Quick stack check (torch+CUDA, flash-attn, LoRA)
│   ├── 01_download_wiki.py       # Download Wikipedia dumps
│   ├── 02_extract_wiki.py        # Parse XML → JSONL
│   ├── 03_extract_pdfs.py        # PDF → JSONL
│   ├── 04_extract_markdown.py    # MD/RST/TXT → JSONL
│   ├── 05_clean_deduplicate.py   # SHA-256 + MinHash LSH dedup
│   ├── 06_tokenize.py            # Tokenise + pack → Arrow dataset
│   ├── 07_create_sft_data.py     # Template-based SFT data (no model needed)
│   ├── 08_generate_qa_llm.py     # LLM-based Q&A via vLLM (run after CPT)
│   └── patch_vllm_verneed_hash.py  # Fix vna_hash in vLLM .so files for CUDA 13
├── configs/
│   ├── cpt_config.yaml           # Axolotl CPT config
│   └── sft_config.yaml           # Axolotl SFT config
├── data/                         # Git-ignored; created at runtime
│   ├── raw/
│   ├── processed/
│   └── tokenized/
├── output/                       # Git-ignored; training checkpoints
├── .github/workflows/ci.yml      # Lint + type-check CI
├── .env.example                  # Credential template
├── pyproject.toml                # ruff + mypy config
├── requirements.txt
├── install.sh                    # Staged installer for CUDA 13.0 / GB10
├── install_vllm.sh               # Separate venv installer for vLLM inference
├── train_cpt.sh
├── train_sft.sh
└── LICENSE                       # Apache 2.0

Adding more languages

The pipeline is language-agnostic. To add English Wikipedia:

python scripts/01_download_wiki.py --lang en
python scripts/02_extract_wiki.py --dump data/raw/enwiki-latest-pages-articles.xml.bz2 --lang en

Then include data/processed/wikipedia_en.jsonl in the --input-files list for step 5. Earlier files in the list take priority during deduplication.

Security notes

HF tokens are read from the HF_TOKEN environment variable. Never pass tokens as CLI arguments (they appear in process listings and shell history).
Downloaded dumps are verified against Wikimedia's published MD5 checksums before being used.
File paths stored in JSONL output are always relative to the input directory to avoid leaking host filesystem layout.
Partial downloads are written to a .tmp file and renamed atomically only on success. A crashed download will not leave a corrupt file at the canonical path.

Uploading to HuggingFace Hub

After training, push adapters and datasets with:

# Push everything (private by default)
bash push_to_hub.sh

# Individual components
bash push_to_hub.sh --cpt          # CPT LoRA adapter only
bash push_to_hub.sh --sft          # SFT LoRA adapter only
bash push_to_hub.sh --data         # datasets only

# Under an org instead of your personal namespace
bash push_to_hub.sh --prefix my-org

# Make repos public
bash push_to_hub.sh --public

Requires HF_TOKEN in .env with write permission. Repos are private by default — pass --public to change.

Artifact	Hub repo	Notes
CPT LoRA adapter	`{user}/ministral-14b-de-cpt-lora`	`output/cpt/checkpoint-2400/`; 2.1 GB adapter weights
SFT LoRA adapter	`{user}/ministral-14b-de-sft-lora`	`output/sft/`; best checkpoint by eval loss
Datasets	`{user}/ministral-14b-de-dataset`	splits: `template` (step 07) + `llm_qa` (step 08)

Optimizer states (optimizer.pt, 3.7 GB) are excluded automatically — inference only.

Contributing

See CONTRIBUTING.md.

License

Apache License 2.0 — see LICENSE.

The base model (Ministral-3-14B-Base-2512) is also licensed under Apache 2.0. Wikipedia content is licensed under CC BY-SA 4.0. Ensure your custom PDFs and Markdown files are licensed for training use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

knowledge-lora

Pipeline stages

Architecture overview

Hardware requirements

Installation

Verify the installation

vLLM inference environment (optional)

Data pipeline

Step 1 — Download Wikipedia

Step 2 — Extract Wikipedia text

Step 3 — Extract PDFs (optional)

Step 4 — Extract Markdown / text files (optional)

Step 5 — Deduplicate

Step 6 — Tokenize

Step 7 — Generate SFT data (template-based)

Step 8 — Generate SFT data (LLM-based, optional)

Training

Phase 1: Continued Pretraining (CPT)

Phase 2: Supervised Fine-Tuning (SFT)

Merge and export

Project structure

Adding more languages

Security notes

Uploading to HuggingFace Hub

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
configs		configs
data		data
output		output
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
install.sh		install.sh
install_vllm.sh		install_vllm.sh
push_to_hub.sh		push_to_hub.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh
train_cpt.sh		train_cpt.sh
train_sft.sh		train_sft.sh

Folders and files

Latest commit

History

Repository files navigation

knowledge-lora

Pipeline stages

Architecture overview

Hardware requirements

Installation

Verify the installation

vLLM inference environment (optional)

Data pipeline

Step 1 — Download Wikipedia

Step 2 — Extract Wikipedia text

Step 3 — Extract PDFs (optional)

Step 4 — Extract Markdown / text files (optional)

Step 5 — Deduplicate

Step 6 — Tokenize

Step 7 — Generate SFT data (template-based)

Step 8 — Generate SFT data (LLM-based, optional)

Training

Phase 1: Continued Pretraining (CPT)

Phase 2: Supervised Fine-Tuning (SFT)

Merge and export

Project structure

Adding more languages

Security notes

Uploading to HuggingFace Hub

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages