WayInfer

Run large language models that don't fit in RAM. No GPU required.

WayInfer is a native GGUF inference engine that streams model weights from SSD on demand using memory-mapped I/O. An 80GB model loads in under 1 second and runs on a machine with 48GB of RAM — or less.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   SSD/NVMe  │────>│    RAM      │────>│   Compute   │
│  (model.gguf)│ mmap│  (OS paging) │     │  (AVX2 SIMD)│
│  80GB+      │     │  on-demand   │     │  8 threads   │
└─────────────┘     └─────────────┘     └─────────────┘

Traditional inference engines load the entire model into memory before running. WayInfer uses mmap to let the OS page model weights from SSD into RAM as needed. Only the active layers occupy physical memory at any time.

Architecture

WayInfer's design is derived from the tiered memory manager in WayOS, an AI-first operating system that treats storage as a unified memory hierarchy (SSD ↔ RAM ↔ VRAM). The core insight: for Mixture of Experts (MoE) models, only 2 of 8+ experts are active per token — the rest can stay on disk until needed.

Key components:

Component	File	Purpose
GGUF Parser	`src/gguf.c`	Parses GGUF headers in <1s, supports N-file splits
Tensor Engine	`src/tensor_engine.c`	Quantized dot products (AVX2 SIMD, 8-thread parallel)
Inference Engine	`src/gguf_chat.c`	Full transformer forward pass — attention, MoE routing, FFN
Tier Manager	`src/memory/tier_manager.c`	SSD streaming memory manager (WayOS architecture)
Platform	`src/platform/`	Cross-platform mmap, threading (Windows + Linux)

Validated Results

Mixtral 8x22B Instruct — 141B parameters, 80GB quantized, split across 2 GGUF files:

Model:  Mixtral-8x22B-Instruct-v0.1.Q4_K_M (80 GB)
RAM:    48 GB (model is 1.7x available memory)
Load:   0.3 seconds
Prompt: "What is 2+2?"
Output: "The sum of 2 and 2 is 4."
Speed:  ~0.08 tok/s (scalar+threading, no GPU)

The engine produces correct, coherent English from an 80GB MoE model on a machine that cannot hold the model in memory.

Supported Model Formats

WayInfer works with GGUF files using K-quant quantization — the most common format on HuggingFace.

Quant Type	Bits/Weight	Status	Notes
Q4_K_M	4.5	Supported	Most common, recommended
Q5_K_M	5.5	Supported	Higher quality
Q6_K	6.5	Supported	Near-lossless
Q8_0	8.0	Supported	Used for K/V projections
F32	32	Supported	Norm weights, metadata
F16	16	Supported	Router weights
MXFP4, IQ*	varies	Not supported	Niche formats

Split GGUF files are fully supported — models split across any number of files (2, 4, 8, etc.) are loaded and merged automatically.

Tested architectures:

Mixtral / Mistral (MoE, GQA)
Llama 3.x (dense, GQA)

Build

Requirements: Visual Studio 2022 Build Tools, Windows 10/11 SDK, CPU with AVX2 support.

build.cmd

Output: build\wayinfer.exe

Usage

Quick Test

python validate.py --model path\to\model.gguf --prompt "Your question here" --max-tokens 30

Requires pip install llama-cpp-python for tokenization only (loads vocab in 0.2s, does NOT load model weights).

Direct Engine

build\wayinfer.exe --model path\to\model.gguf --greedy --max-tokens 20

Flags:

--model <path> — GGUF model file (first split if multi-file)
--ids-file <path> — Pre-tokenized input (binary format)
--greedy — Deterministic output (argmax sampling)
--temp <T> — Sampling temperature (default 0.7)
--max-tokens <N> — Maximum tokens to generate
--debug — Enable diagnostic output

Custom Tensor Engine

WayInfer does not depend on ggml, llama.cpp, or any external compute library. It implements its own quantized dot product kernels that match the numerical behavior of ggml's scalar path.

This matters because GGUF quantization is calibrated for a specific dot product computation order. Using a different method (e.g., dequant-to-float32 then dot product) produces numerically different results that compound across layers and destroy output quality. WayInfer's tensor engine replicates the exact computation:

Input quantization — float32 input is quantized to Q8_K (256-element blocks with per-group sums)
Block-level integer accumulation — weight and input quants are multiplied in int8/int16, accumulated in int32 across 8 parallel lanes
Scale application — float conversion happens once per super-block, not per element
AVX2 SIMD — 32-byte vector operations for the inner dot products
8-thread parallelism — output rows split across CPU cores

Limitations

Speed: ~0.08 tok/s on Mixtral 80GB with CPU-only scalar+AVX2. This is limited by SSD bandwidth and single-core throughput. AVX-512 VNNI and GPU offload would improve this significantly.
Tokenization: Relies on llama-cpp-python for correct BPE tokenization. The built-in greedy tokenizer is inaccurate for production use.
Chat interface: No interactive chat loop yet. Use validate.py for prompt-response testing.
Model support: Only K-quant GGUF formats (Q4_K, Q5_K, Q6_K, Q8_0). Models using MXFP4, IQ-quants, or GPTQ are not supported.
Platform: Windows only (Linux mmap/threading stubs exist but are untested).

Roadmap

AVX-512 / VNNI tensor engine kernels (~10x speedup)
GPU offload for attention and FFN (CUDA/Vulkan)
Interactive chat with streaming output
Built-in BPE tokenizer (remove llama-cpp-python dependency)
Linux build and testing
SSD-aware expert prefetch (predict next experts, pre-page from SSD)
KV cache compression for longer context

Project Structure

src/
├── gguf_chat.c          # Inference engine (forward pass, attention, MoE)
├── gguf.c / gguf.h      # GGUF file parser (instant load, N-file splits)
├── tensor_engine.c / .h  # Quantized compute kernels (AVX2, threaded)
├── memory/
│   ├── tier_manager.c    # WayOS-derived tiered memory manager
│   ├── expert_cache.c    # MoE expert caching
│   ├── prefetch.c        # Predictive expert prefetch
│   └── coherency.c       # Memory coherency
├── platform/
│   ├── io_win.c          # Windows mmap (CreateFileMapping)
│   ├── io_linux.c        # Linux mmap (mmap/madvise)
│   ├── threadpool_win.c  # Windows threading
│   └── threadpool_posix.c # POSIX threading
├── fmoe_main.c           # Reference: llama.dll wrapper
├── model_loader.c        # Model loading utilities
├── router.c              # MoE expert router
├── pipeline.c            # Inference pipeline
└── backend/              # GPU backend stubs (future)
validate.py               # End-to-end validation tool
tokenizer.py              # Fast GGUF tokenizer (reads vocab in 0.2s)
build.cmd                 # Windows build script

License

MIT

Acknowledgments

Architecture derived from WayOS tiered memory manager
Quantization format compatible with GGUF specification
Tensor engine dot products match ggml scalar computation path

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.cmd		build.cmd
tokenizer.py		tokenizer.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WayInfer

How It Works

Architecture

Validated Results

Supported Model Formats

Build

Usage

Quick Test

Direct Engine

Custom Tensor Engine

Limitations

Roadmap

Project Structure

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WayInfer

How It Works

Architecture

Validated Results

Supported Model Formats

Build

Usage

Quick Test

Direct Engine

Custom Tensor Engine

Limitations

Roadmap

Project Structure

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages