Continued Pretraining (CPT) + Supervised Fine-Tuning (SFT) LoRA adapter for Ministral-3-14B-Base-2512, trained on German Wikipedia and custom documents (PDFs, Markdown) to extend the model's knowledge cutoff.
| Stage | Script | Output | Notes |
|---|---|---|---|
| 01 download | scripts/01_download_wiki.py |
data/raw/ |
Resumable, MD5-verified |
| 02 extract | scripts/02_extract_wiki.py |
data/processed/wikipedia_*.jsonl |
|
| 03 PDF extract | scripts/03_extract_pdfs.py |
data/processed/pdfs.jsonl |
Optional |
| 04 Markdown extract | scripts/04_extract_markdown.py |
data/processed/markdown.jsonl |
Optional |
| 05 dedup | scripts/05_clean_deduplicate.py |
data/processed/corpus.jsonl |
SHA-256 + MinHash LSH |
| 06 tokenize | scripts/06_tokenize.py |
data/tokenized/cpt_dataset/ |
Packed Arrow, seq_len 8192 |
| 07 SFT data | scripts/07_create_sft_data.py |
data/processed/sft_data.jsonl |
Template-based, no model needed |
| 08 QA gen | scripts/08_generate_qa_llm.py |
data/processed/sft_qa_llm.jsonl |
LLM-based, runs after CPT merge |
| CPT training | train_cpt.sh |
output/cpt/ |
LoRA rank 128, BF16 |
| SFT training | train_sft.sh |
output/sft/ |
LoRA rank 64, from CPT checkpoint |
Wikipedia dump ──┐
PDF files ├── Data Pipeline ──► deduplicated corpus ──► CPT LoRA ──► SFT LoRA
Markdown files ──┘ (corpus.jsonl) (rank 128) (rank 64)
Phase 1 – CPT: The base model learns new factual knowledge via next-token prediction on the cleaned corpus (LoRA rank 128, full BF16, seq_len 8192).
Phase 2 – SFT: The CPT adapter is used as a starting point for instruction-following fine-tuning on template-generated Q&A / summarisation data (LoRA rank 64, lr 2e-5).
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM / unified memory | 24 GB (with QLoRA) | 128 GB (DGX Spark / GB10) |
| Storage for German Wikipedia | ~25 GB raw | ~60 GB with intermediates |
| Python | 3.11+ | 3.11+ |
Training was designed for the Dell GB10 / NVIDIA DGX Spark (128 GB unified Blackwell memory) — full BF16 with no quantisation.
# 1. Clone
git clone https://github.com/cl4wb0rg/knowledge-lora.git
cd knowledge-lora
# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate
# 3. Install all training dependencies (handles CUDA 13.0 / GB10 quirks)
bash install.sh
# 4. Configure credentials
cp .env.example .env
# Edit .env and add your HuggingFace tokenDGX Spark / CUDA 13.0 notes:
install.shhandles the correct installation order for torch, axolotl (GitHub HEAD), flash-attn, and xformers.- xformers cannot be built for CUDA 13.0 (removed driver API symbols) and is skipped.
install.shuses--only-binary xformersso pip never attempts a source compile — previously, the source build spawned many parallelnvcc/ciccprocesses that exhausted all RAM+swap and froze the system. flash-attn covers the same functionality.- flash-attn is built from source against CUDA 13.0 (no pre-built cu13 wheel exists). The build uses
MAX_JOBS=1andnice -n 10to stay below 80 % CPU/RAM load. Expect ~20–30 minutes on first install; subsequent runs skip the build if flash-attn is already correctly installed.- GB10 freeze prevention: before each flash-attn build,
install.shclears the filesystem cache and caps GPU power to 80 % of its maximum (vianvidia-smi -pl) to prevent hard crashes from memory exhaustion or power spikes during nvcc compilation. Requiressudo; silently skipped if unavailable. The power limit is restored after the build completes. The flash-attn build subprocess also sets a high OOM score (oom_score_adj=500) so the kernel kills it first if memory runs out, keeping the system responsive.- axolotl is installed from GitHub HEAD; PyPI releases do not support torch 2.10+.
After install.sh completes, run the smoke test to confirm the full training
stack (torch CUDA, flash-attn, LoRA, forward/backward pass) works end-to-end.
No large model download or pre-processed data needed — it uses gpt2 (124 M)
with synthetic inputs.
python scripts/smoke_test.py
# Expected output:
# [OK] torch 2.10.0+cu130 | NVIDIA GB10 | 120 GB
# [OK] flash-attn 2.8.3
# [OK] gpt2 loaded on cuda (bf16)
# [OK] LoRA applied | trainable 147.5K / 124.6M params
# step 1/5 loss=...
# ...
# [OK] all losses finite | first=... last=...
# === PASSED ===vLLM requires a different torch version than axolotl and must live in its own venv:
bash install_vllm.sh # creates .venv-vllm/
source .venv-vllm/bin/activateUse this environment only for running scripts/08_generate_qa_llm.py after CPT.
CUDA 13.0 vLLM binary patching: vLLM 0.16.0 binaries reference
libcudart.so.12but CUDA 13.0 shipslibcudart.so.13. Two binary patches are required afterinstall_vllm.sh:
- Replace the
DT_NEEDEDstringlibcudart.so.12→libcudart.so.13in.dynstr(done byinstall_vllm.sh)- Fix the ELF
vna_hashin.gnu.version_r(glibc checks this hash for version matching):This patches all 6 vLLMsource .venv-vllm/bin/activate pip install pyelftools python scripts/patch_vllm_verneed_hash.py.sofiles in-place. Safe to re-run (idempotent).
Run the steps in order. All steps are idempotent — re-running skips already completed work where possible.
python scripts/01_download_wiki.py --lang de
# For multiple languages:
python scripts/01_download_wiki.py --lang de --lang en --lang frDownloads data/raw/dewiki-latest-pages-articles.xml.bz2 and verifies the MD5
checksum against Wikimedia's published checksums. The download is resumable
— if the connection drops, re-running the script continues from where it left off.
python scripts/02_extract_wiki.py \
--dump data/raw/dewiki-latest-pages-articles.xml.bz2 \
--lang deProduces data/processed/wikipedia_de.jsonl.
Uses all available CPU cores by default.
python scripts/03_extract_pdfs.py --input-dir /path/to/your/pdfsProduces data/processed/pdfs.jsonl.
python scripts/04_extract_markdown.py --input-dir /path/to/your/docsSupports .md, .markdown, .txt, .rst.
Produces data/processed/markdown.jsonl.
python scripts/05_clean_deduplicate.py \
--input-files \
data/processed/wikipedia_de.jsonl \
data/processed/pdfs.jsonl \
data/processed/markdown.jsonl \
--output-file data/processed/corpus.jsonlTwo-stage deduplication: exact (SHA-256) + near-duplicate (MinHash LSH, Jaccard ≥ 0.8).
HF_TOKEN=hf_... python scripts/06_tokenize.py \
--input data/processed/corpus.jsonl \
--model-id mistralai/Ministral-3-14B-Base-2512 \
--seq-len 8192Produces a packed Arrow dataset at data/tokenized/cpt_dataset/.
Memory usage is bounded by --batch-size regardless of corpus size.
python scripts/07_create_sft_data.py \
--input data/processed/corpus.jsonl \
--output data/processed/sft_data.jsonl \
--max-docs 200000Creates template-based summarisation, Q&A, and text-continuation examples. No model calls required — fast and deterministic.
Generates higher-quality Q&A pairs using the CPT-merged model via vLLM. Run this after CPT is complete and the adapter has been merged.
# Activate the vLLM venv (separate from training venv — see Installation)
source .venv-vllm/bin/activate
python scripts/08_generate_qa_llm.py \
--model output/cpt/merged \
--input data/processed/corpus.jsonl \
--output data/processed/sft_qa_llm.jsonl \
--qa-per-doc 3 \
--batch-size 64The CPT model is used (not the base model) so that generated questions reflect the newly learned knowledge. Output is in the same Alpaca format as step 7 and can be combined with or used instead of the template-based data for SFT.
source .env # loads HF_TOKEN, WANDB_* etc.
bash train_cpt.shTraining checkpoints are saved to output/cpt/.
See configs/cpt_config.yaml for all hyperparameters.
Key settings (DGX Spark / 128 GB):
| Parameter | Value | Notes |
|---|---|---|
| LoRA rank | 128 | Higher rank → more capacity for new knowledge |
| Sequence length | 8192 | Increase to 16384 if needed |
| Micro batch size | 2 | Effective batch = 16 seqs/step |
| Gradient checkpointing | on | Required to fit training pass in unified memory |
| Learning rate | 1e-4 | Standard for CPT |
| Precision | BF16 | No quantisation on 128 GB |
| Epochs | 1 | One pass is typically sufficient |
After CPT completes, update configs/sft_config.yaml to point to the best
CPT checkpoint:
# configs/sft_config.yaml — uncomment this line:
lora_weights: ./output/cpt/checkpoint-XXXXThen run:
bash train_sft.shOutput: output/sft/.
# Merge LoRA weights into the base model for standalone deployment
accelerate launch -m axolotl.cli.merge_lora configs/sft_config.yaml \
--lora-model-dir output/sftknowledge-lora/
├── scripts/
│ ├── smoke_test.py # Quick stack check (torch+CUDA, flash-attn, LoRA)
│ ├── 01_download_wiki.py # Download Wikipedia dumps
│ ├── 02_extract_wiki.py # Parse XML → JSONL
│ ├── 03_extract_pdfs.py # PDF → JSONL
│ ├── 04_extract_markdown.py # MD/RST/TXT → JSONL
│ ├── 05_clean_deduplicate.py # SHA-256 + MinHash LSH dedup
│ ├── 06_tokenize.py # Tokenise + pack → Arrow dataset
│ ├── 07_create_sft_data.py # Template-based SFT data (no model needed)
│ ├── 08_generate_qa_llm.py # LLM-based Q&A via vLLM (run after CPT)
│ └── patch_vllm_verneed_hash.py # Fix vna_hash in vLLM .so files for CUDA 13
├── configs/
│ ├── cpt_config.yaml # Axolotl CPT config
│ └── sft_config.yaml # Axolotl SFT config
├── data/ # Git-ignored; created at runtime
│ ├── raw/
│ ├── processed/
│ └── tokenized/
├── output/ # Git-ignored; training checkpoints
├── .github/workflows/ci.yml # Lint + type-check CI
├── .env.example # Credential template
├── pyproject.toml # ruff + mypy config
├── requirements.txt
├── install.sh # Staged installer for CUDA 13.0 / GB10
├── install_vllm.sh # Separate venv installer for vLLM inference
├── train_cpt.sh
├── train_sft.sh
└── LICENSE # Apache 2.0
The pipeline is language-agnostic. To add English Wikipedia:
python scripts/01_download_wiki.py --lang en
python scripts/02_extract_wiki.py --dump data/raw/enwiki-latest-pages-articles.xml.bz2 --lang enThen include data/processed/wikipedia_en.jsonl in the --input-files list
for step 5. Earlier files in the list take priority during deduplication.
- HF tokens are read from the
HF_TOKENenvironment variable. Never pass tokens as CLI arguments (they appear in process listings and shell history). - Downloaded dumps are verified against Wikimedia's published MD5 checksums before being used.
- File paths stored in JSONL output are always relative to the input directory to avoid leaking host filesystem layout.
- Partial downloads are written to a
.tmpfile and renamed atomically only on success. A crashed download will not leave a corrupt file at the canonical path.
After training, push adapters and datasets with:
# Push everything (private by default)
bash push_to_hub.sh
# Individual components
bash push_to_hub.sh --cpt # CPT LoRA adapter only
bash push_to_hub.sh --sft # SFT LoRA adapter only
bash push_to_hub.sh --data # datasets only
# Under an org instead of your personal namespace
bash push_to_hub.sh --prefix my-org
# Make repos public
bash push_to_hub.sh --publicRequires HF_TOKEN in .env with write permission.
Repos are private by default — pass --public to change.
| Artifact | Hub repo | Notes |
|---|---|---|
| CPT LoRA adapter | {user}/ministral-14b-de-cpt-lora |
output/cpt/checkpoint-2400/; 2.1 GB adapter weights |
| SFT LoRA adapter | {user}/ministral-14b-de-sft-lora |
output/sft/; best checkpoint by eval loss |
| Datasets | {user}/ministral-14b-de-dataset |
splits: template (step 07) + llm_qa (step 08) |
Optimizer states (optimizer.pt, 3.7 GB) are excluded automatically — inference only.
See CONTRIBUTING.md.
Apache License 2.0 — see LICENSE.
The base model (Ministral-3-14B-Base-2512) is also licensed under Apache 2.0. Wikipedia content is licensed under CC BY-SA 4.0. Ensure your custom PDFs and Markdown files are licensed for training use.