Skip to content

cloudlinqed/WayInfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WayInfer

Run large language models that don't fit in RAM. No GPU required.

WayInfer is a native GGUF inference engine that streams model weights from SSD on demand using memory-mapped I/O. An 80GB model loads in under 1 second and runs on a machine with 48GB of RAM — or less.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   SSD/NVMe  │────>│    RAM      │────>│   Compute   │
│  (model.gguf)│ mmap│  (OS paging) │     │  (AVX2 SIMD)│
│  80GB+      │     │  on-demand   │     │  8 threads   │
└─────────────┘     └─────────────┘     └─────────────┘

Traditional inference engines load the entire model into memory before running. WayInfer uses mmap to let the OS page model weights from SSD into RAM as needed. Only the active layers occupy physical memory at any time.

Architecture

WayInfer's design is derived from the tiered memory manager in WayOS, an AI-first operating system that treats storage as a unified memory hierarchy (SSD ↔ RAM ↔ VRAM). The core insight: for Mixture of Experts (MoE) models, only 2 of 8+ experts are active per token — the rest can stay on disk until needed.

Key components:

Component File Purpose
GGUF Parser src/gguf.c Parses GGUF headers in <1s, supports N-file splits
Tensor Engine src/tensor_engine.c Quantized dot products (AVX2 SIMD, 8-thread parallel)
Inference Engine src/gguf_chat.c Full transformer forward pass — attention, MoE routing, FFN
Tier Manager src/memory/tier_manager.c SSD streaming memory manager (WayOS architecture)
Platform src/platform/ Cross-platform mmap, threading (Windows + Linux)

Validated Results

Mixtral 8x22B Instruct — 141B parameters, 80GB quantized, split across 2 GGUF files:

Model:  Mixtral-8x22B-Instruct-v0.1.Q4_K_M (80 GB)
RAM:    48 GB (model is 1.7x available memory)
Load:   0.3 seconds
Prompt: "What is 2+2?"
Output: "The sum of 2 and 2 is 4."
Speed:  ~0.08 tok/s (scalar+threading, no GPU)

The engine produces correct, coherent English from an 80GB MoE model on a machine that cannot hold the model in memory.

Supported Model Formats

WayInfer works with GGUF files using K-quant quantization — the most common format on HuggingFace.

Quant Type Bits/Weight Status Notes
Q4_K_M 4.5 Supported Most common, recommended
Q5_K_M 5.5 Supported Higher quality
Q6_K 6.5 Supported Near-lossless
Q8_0 8.0 Supported Used for K/V projections
F32 32 Supported Norm weights, metadata
F16 16 Supported Router weights
MXFP4, IQ* varies Not supported Niche formats

Split GGUF files are fully supported — models split across any number of files (2, 4, 8, etc.) are loaded and merged automatically.

Tested architectures:

  • Mixtral / Mistral (MoE, GQA)
  • Llama 3.x (dense, GQA)

Build

Requirements: Visual Studio 2022 Build Tools, Windows 10/11 SDK, CPU with AVX2 support.

build.cmd

Output: build\wayinfer.exe

Usage

Quick Test

python validate.py --model path\to\model.gguf --prompt "Your question here" --max-tokens 30

Requires pip install llama-cpp-python for tokenization only (loads vocab in 0.2s, does NOT load model weights).

Direct Engine

build\wayinfer.exe --model path\to\model.gguf --greedy --max-tokens 20

Flags:

  • --model <path> — GGUF model file (first split if multi-file)
  • --ids-file <path> — Pre-tokenized input (binary format)
  • --greedy — Deterministic output (argmax sampling)
  • --temp <T> — Sampling temperature (default 0.7)
  • --max-tokens <N> — Maximum tokens to generate
  • --debug — Enable diagnostic output

Custom Tensor Engine

WayInfer does not depend on ggml, llama.cpp, or any external compute library. It implements its own quantized dot product kernels that match the numerical behavior of ggml's scalar path.

This matters because GGUF quantization is calibrated for a specific dot product computation order. Using a different method (e.g., dequant-to-float32 then dot product) produces numerically different results that compound across layers and destroy output quality. WayInfer's tensor engine replicates the exact computation:

  1. Input quantization — float32 input is quantized to Q8_K (256-element blocks with per-group sums)
  2. Block-level integer accumulation — weight and input quants are multiplied in int8/int16, accumulated in int32 across 8 parallel lanes
  3. Scale application — float conversion happens once per super-block, not per element
  4. AVX2 SIMD — 32-byte vector operations for the inner dot products
  5. 8-thread parallelism — output rows split across CPU cores

Limitations

  • Speed: ~0.08 tok/s on Mixtral 80GB with CPU-only scalar+AVX2. This is limited by SSD bandwidth and single-core throughput. AVX-512 VNNI and GPU offload would improve this significantly.
  • Tokenization: Relies on llama-cpp-python for correct BPE tokenization. The built-in greedy tokenizer is inaccurate for production use.
  • Chat interface: No interactive chat loop yet. Use validate.py for prompt-response testing.
  • Model support: Only K-quant GGUF formats (Q4_K, Q5_K, Q6_K, Q8_0). Models using MXFP4, IQ-quants, or GPTQ are not supported.
  • Platform: Windows only (Linux mmap/threading stubs exist but are untested).

Roadmap

  • AVX-512 / VNNI tensor engine kernels (~10x speedup)
  • GPU offload for attention and FFN (CUDA/Vulkan)
  • Interactive chat with streaming output
  • Built-in BPE tokenizer (remove llama-cpp-python dependency)
  • Linux build and testing
  • SSD-aware expert prefetch (predict next experts, pre-page from SSD)
  • KV cache compression for longer context

Project Structure

src/
├── gguf_chat.c          # Inference engine (forward pass, attention, MoE)
├── gguf.c / gguf.h      # GGUF file parser (instant load, N-file splits)
├── tensor_engine.c / .h  # Quantized compute kernels (AVX2, threaded)
├── memory/
│   ├── tier_manager.c    # WayOS-derived tiered memory manager
│   ├── expert_cache.c    # MoE expert caching
│   ├── prefetch.c        # Predictive expert prefetch
│   └── coherency.c       # Memory coherency
├── platform/
│   ├── io_win.c          # Windows mmap (CreateFileMapping)
│   ├── io_linux.c        # Linux mmap (mmap/madvise)
│   ├── threadpool_win.c  # Windows threading
│   └── threadpool_posix.c # POSIX threading
├── fmoe_main.c           # Reference: llama.dll wrapper
├── model_loader.c        # Model loading utilities
├── router.c              # MoE expert router
├── pipeline.c            # Inference pipeline
└── backend/              # GPU backend stubs (future)
validate.py               # End-to-end validation tool
tokenizer.py              # Fast GGUF tokenizer (reads vocab in 0.2s)
build.cmd                 # Windows build script

License

MIT

Acknowledgments

  • Architecture derived from WayOS tiered memory manager
  • Quantization format compatible with GGUF specification
  • Tensor engine dot products match ggml scalar computation path

About

Run LLMs larger than your RAM — native GGUF inference engine with SSD streaming, no GPU required

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors