The Last Archive

"it's not perfect, but it works (i guess)"

The Last Archive

A high-performance, local-first search and archival engine powered by RAG.
This is a completely open-source project dedicated to knowledge preservation and semantic search.

Features • Quick Start • Architecture • Tech Stack • IMPORTANT • License

Overview

The Last Archive is an integrated solution for private web archival and semantic search. It empowers users to crawl, index, and interact with web content through a unified interface, leveraging Large Language Models (LLMs) and vector search technology—all while maintaining complete data sovereignty on local infrastructure.

Key Features

Privacy-First RAG: Context-aware intelligence using your personal archives without external API dependencies.
Semantic Retrieval: Advanced vector-based search that understands intent beyond simple keywords.
Modular Microservices: Dockerized components that communicate over a unified internal network.
Local-First Design: Engineered to run entirely offline, ensuring data privacy and fast response times.
Hybrid Storage: Combines relational metadata (SQLite) with vector embeddings (Qdrant).
Ethical Archival: Built-in support for robots.txt compliance and controlled crawl rates.

Architecture

The platform is engineered as a distributed system of specialized microservices to ensure modularity and performance:

Spider (Go): A high-concurrency web crawler capable of archiving complex site structures, including HTML, images, and PDF documents.
Embedding Service (Python): Utilizes state-of-the-art transformer models to generate high-dimensional vector representations of text.
Llama-Go (Go): A high-performance bridge for Ollama, providing OpenAI-compatible LLM inference endpoints.
Search Server (Node.js): The central orchestration layer that manages the API, coordinates between services, and handles persistent storage.
Frontend (Next.js): A premium, responsive user interface built with React 19 and Tailwind CSS.
Qdrant: A robust vector database used for semantic similarity search and high-speed retrieval.

Tech Stack

Component	Technology
Frontend	Next.js 16, React 19, Radix UI, Tailwind CSS
Primary API	Node.js (Express), Better-SQLite3
Logic/Crawling	Go 1.25
AI Inference	Python (PyTorch/Transformers), Ollama
Vector Engine	Qdrant

Quick Start

The easiest way to get started is using the provided CLI tools for your platform.

Prerequisites

Docker and Docker Compose
Ollama (For local LLM hosting)
Go 1.21+ (Required for local crawler execution)

1. Initialize & Start Services

Use the CLI to create the internal network and launch all microservices:

Linux/macOS:

chmod +x archive.sh
./archive.sh up

Windows:

.\archive.ps1 up

2. Seed the Archive

Once the services are active, use the CLI to add your first websites to the index:

Linux/macOS:

./archive.sh crawl

Windows:

.\archive.ps1 crawl

3. Access the Interface

Open your browser and navigate to:

Web UI: http://localhost:3000
API Health: http://localhost:1213/api/health

Configuration

Key environment variables used across the services (managed via .env files in each service directory):

Variable	Description	Default Target
`LLAMA_BASE_URL`	Endpoint for the LLM bridge	`http://llm-server:1410`
`EMBEDDING_BASE_URL`	Endpoint for the embedding service	`http://embedding-service:5050`
`QDRANT_HOST`	Hostname for the Qdrant database	`qdrant`
`QDRANT_API_KEY`	Authentication key for Qdrant	`REQUIRED`
`PORT`	Listening port for the search server	`1213`

Repository Structure

.
├── frontend            # Next.js web application
├── server              # Node.js API Orchestrator (Express + SQLite)
├── spider              # Go-based high-concurrency crawler
├── llama-go            # Go bridge for Ollama inference
├── embedding_service   # Python-based embedding generation
├── qdrant              # Vector storage configuration
├── archive.sh          # Linux/macOS CLI tool
└── archive.ps1         # Windows PowerShell CLI tool

Very Important Note (Read Before Complaining)

The Last Archive is NOT magic out of the box.

This system is entirely database-driven.
If your database is empty or barely populated, the model will appear “dumb” — because it has nothing to work with.

Read this carefully:

You must let the crawler run for a decent amount of time.
You must crawl a meaningful amount of data before expecting useful answers.
The more content you crawl, the better the system gets — this is how RAG works.
If you crawl one website (or worse, one page) and then complain that “it’s not working”, that is your mistake, not a bug.

About Issues (Yes, This Is Blunt)

Do NOT open an issue stupidally saying “it’s not working” when:

You didn’t let the crawler run long enough
Your database is almost empty
You don’t understand how the service works
You didn’t read the documentation

That is not a bug. That is user error.

Design Reality Check

The model does not hallucinate knowledge.
It only responds based on what exists in the database.
No data = no intelligence.
More crawled data = better answers.

Stability Statement

This project has been tested thoroughly on both Windows and Linux before publishing.
The services, crawler, and pipeline work as intended when used correctly.

If you don’t know how to use the service, that is your own issue, not a reason to open an issue.

How to Run the System (Yes, This Matters)

You must run the system using the provided shell or PowerShell scripts, depending on your operating system.

Do NOT run docker-compose manually from the root directory.
If you do that and nothing starts, that is NOT a bug.

Why?

The system is split into multiple microservices.
Each service has its own Docker configuration.
All services communicate over a specific internal Docker network.
That network is created and managed by the provided shell (archive.sh) or PowerShell (archive.ps1) scripts.

If you skip the scripts:

The network will not be created correctly
Services will not discover each other
The system will not start properly

Again: this is expected behavior, not a bug.

Final Warning (Seriously)

Do NOT run services separately unless you actually know what you’re doing.
Do NOT mix manual Docker commands with the automated setup.
Do NOT open a stupid issue about this when you:
- Didn’t follow the instructions
- Don’t understand Docker
- Ignored the setup scripts

There is a shell script and a PowerShell script for a reason.
Use them.

Contributing

We welcome contributions to The Last Archive! Whether it's reporting a bug, suggesting a feature, or submitting a pull request, your help is appreciated.

Fork the repository.
Create your feature branch (git checkout -b feature/AmazingFeature).
Commit your changes (git commit -m 'Add some AmazingFeature').
Push to the branch (git push origin feature/AmazingFeature).
Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for the full text.

Disclaimer

The Last Archive is provided as-is for educational and personal archival purposes.

Ethical Crawling: This tool is designed to respect robots.txt rules. Users are responsible for ensuring their usage complies with the terms of service of any websites they archive.
Liability: The authors and contributors are not responsible for any misuse of this tool, including but not limited to unauthorized data scraping, copyright infringement, or any legal consequences arising from the use of this software.

Built for privacy, search, and the preservation of knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Last Archive

Overview

Key Features

Architecture

Tech Stack

Quick Start

Prerequisites

1. Initialize & Start Services

2. Seed the Archive

3. Access the Interface

Configuration

Repository Structure

Very Important Note (Read Before Complaining)

Read this carefully:

About Issues (Yes, This Is Blunt)

Design Reality Check

Stability Statement

How to Run the System (Yes, This Matters)

Why?

Final Warning (Seriously)

Contributing

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
embedding_service		embedding_service
frontend		frontend
llama-go		llama-go
qdrant		qdrant
server		server
spider		spider
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
archive.ps1		archive.ps1
archive.sh		archive.sh
logo.png		logo.png

Folders and files

Latest commit

History

Repository files navigation

The Last Archive

Overview

Key Features

Architecture

Tech Stack

Quick Start

Prerequisites

1. Initialize & Start Services

2. Seed the Archive

3. Access the Interface

Configuration

Repository Structure

Very Important Note (Read Before Complaining)

Read this carefully:

About Issues (Yes, This Is Blunt)

Design Reality Check

Stability Statement

How to Run the System (Yes, This Matters)

Why?

Final Warning (Seriously)

Contributing

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages