Run large language models that don't fit in RAM. No GPU required.
WayInfer is a native GGUF inference engine that streams model weights from SSD on demand using memory-mapped I/O. An 80GB model loads in under 1 second and runs on a machine with 48GB of RAM — or less.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SSD/NVMe │────>│ RAM │────>│ Compute │
│ (model.gguf)│ mmap│ (OS paging) │ │ (AVX2 SIMD)│
│ 80GB+ │ │ on-demand │ │ 8 threads │
└─────────────┘ └─────────────┘ └─────────────┘
Traditional inference engines load the entire model into memory before running. WayInfer uses mmap to let the OS page model weights from SSD into RAM as needed. Only the active layers occupy physical memory at any time.
WayInfer's design is derived from the tiered memory manager in WayOS, an AI-first operating system that treats storage as a unified memory hierarchy (SSD ↔ RAM ↔ VRAM). The core insight: for Mixture of Experts (MoE) models, only 2 of 8+ experts are active per token — the rest can stay on disk until needed.
Key components:
| Component | File | Purpose |
|---|---|---|
| GGUF Parser | src/gguf.c |
Parses GGUF headers in <1s, supports N-file splits |
| Tensor Engine | src/tensor_engine.c |
Quantized dot products (AVX2 SIMD, 8-thread parallel) |
| Inference Engine | src/gguf_chat.c |
Full transformer forward pass — attention, MoE routing, FFN |
| Tier Manager | src/memory/tier_manager.c |
SSD streaming memory manager (WayOS architecture) |
| Platform | src/platform/ |
Cross-platform mmap, threading (Windows + Linux) |
Mixtral 8x22B Instruct — 141B parameters, 80GB quantized, split across 2 GGUF files:
Model: Mixtral-8x22B-Instruct-v0.1.Q4_K_M (80 GB)
RAM: 48 GB (model is 1.7x available memory)
Load: 0.3 seconds
Prompt: "What is 2+2?"
Output: "The sum of 2 and 2 is 4."
Speed: ~0.08 tok/s (scalar+threading, no GPU)
The engine produces correct, coherent English from an 80GB MoE model on a machine that cannot hold the model in memory.
WayInfer works with GGUF files using K-quant quantization — the most common format on HuggingFace.
| Quant Type | Bits/Weight | Status | Notes |
|---|---|---|---|
| Q4_K_M | 4.5 | Supported | Most common, recommended |
| Q5_K_M | 5.5 | Supported | Higher quality |
| Q6_K | 6.5 | Supported | Near-lossless |
| Q8_0 | 8.0 | Supported | Used for K/V projections |
| F32 | 32 | Supported | Norm weights, metadata |
| F16 | 16 | Supported | Router weights |
| MXFP4, IQ* | varies | Not supported | Niche formats |
Split GGUF files are fully supported — models split across any number of files (2, 4, 8, etc.) are loaded and merged automatically.
Tested architectures:
- Mixtral / Mistral (MoE, GQA)
- Llama 3.x (dense, GQA)
Requirements: Visual Studio 2022 Build Tools, Windows 10/11 SDK, CPU with AVX2 support.
build.cmdOutput: build\wayinfer.exe
python validate.py --model path\to\model.gguf --prompt "Your question here" --max-tokens 30Requires pip install llama-cpp-python for tokenization only (loads vocab in 0.2s, does NOT load model weights).
build\wayinfer.exe --model path\to\model.gguf --greedy --max-tokens 20Flags:
--model <path>— GGUF model file (first split if multi-file)--ids-file <path>— Pre-tokenized input (binary format)--greedy— Deterministic output (argmax sampling)--temp <T>— Sampling temperature (default 0.7)--max-tokens <N>— Maximum tokens to generate--debug— Enable diagnostic output
WayInfer does not depend on ggml, llama.cpp, or any external compute library. It implements its own quantized dot product kernels that match the numerical behavior of ggml's scalar path.
This matters because GGUF quantization is calibrated for a specific dot product computation order. Using a different method (e.g., dequant-to-float32 then dot product) produces numerically different results that compound across layers and destroy output quality. WayInfer's tensor engine replicates the exact computation:
- Input quantization — float32 input is quantized to Q8_K (256-element blocks with per-group sums)
- Block-level integer accumulation — weight and input quants are multiplied in int8/int16, accumulated in int32 across 8 parallel lanes
- Scale application — float conversion happens once per super-block, not per element
- AVX2 SIMD — 32-byte vector operations for the inner dot products
- 8-thread parallelism — output rows split across CPU cores
- Speed: ~0.08 tok/s on Mixtral 80GB with CPU-only scalar+AVX2. This is limited by SSD bandwidth and single-core throughput. AVX-512 VNNI and GPU offload would improve this significantly.
- Tokenization: Relies on
llama-cpp-pythonfor correct BPE tokenization. The built-in greedy tokenizer is inaccurate for production use. - Chat interface: No interactive chat loop yet. Use
validate.pyfor prompt-response testing. - Model support: Only K-quant GGUF formats (Q4_K, Q5_K, Q6_K, Q8_0). Models using MXFP4, IQ-quants, or GPTQ are not supported.
- Platform: Windows only (Linux mmap/threading stubs exist but are untested).
- AVX-512 / VNNI tensor engine kernels (~10x speedup)
- GPU offload for attention and FFN (CUDA/Vulkan)
- Interactive chat with streaming output
- Built-in BPE tokenizer (remove llama-cpp-python dependency)
- Linux build and testing
- SSD-aware expert prefetch (predict next experts, pre-page from SSD)
- KV cache compression for longer context
src/
├── gguf_chat.c # Inference engine (forward pass, attention, MoE)
├── gguf.c / gguf.h # GGUF file parser (instant load, N-file splits)
├── tensor_engine.c / .h # Quantized compute kernels (AVX2, threaded)
├── memory/
│ ├── tier_manager.c # WayOS-derived tiered memory manager
│ ├── expert_cache.c # MoE expert caching
│ ├── prefetch.c # Predictive expert prefetch
│ └── coherency.c # Memory coherency
├── platform/
│ ├── io_win.c # Windows mmap (CreateFileMapping)
│ ├── io_linux.c # Linux mmap (mmap/madvise)
│ ├── threadpool_win.c # Windows threading
│ └── threadpool_posix.c # POSIX threading
├── fmoe_main.c # Reference: llama.dll wrapper
├── model_loader.c # Model loading utilities
├── router.c # MoE expert router
├── pipeline.c # Inference pipeline
└── backend/ # GPU backend stubs (future)
validate.py # End-to-end validation tool
tokenizer.py # Fast GGUF tokenizer (reads vocab in 0.2s)
build.cmd # Windows build script
MIT