The North Star for automated research.
A reproducible infrastructure for collecting, enriching, and querying papers from top AI conferences in the CCF rankings.
简体中文 · English
| Papers | Venues | Years | Vector index |
|---|---|---|---|
| 155,652 | 11 top AI conferences | 1982 – 2026 | 155 K × 384 |
Polaris turns the public scholarly web into a clean, queryable knowledge base. It ships with a reference dataset on HuggingFace and an end-to-end pipeline you can extend to new venues.
Venues covered
- CCF-A strict (7) · AAAI · NeurIPS · ACL · CVPR · ICCV · ICML · IJCAI
- High-value extended (4) · ICLR · ECCV (CCF-B) · EMNLP (CCF-B) · CHI (CCF-A, HCI track)
| Layer | Component |
|---|---|
| Collection | Cross-source collection from open conference proceedings (CVF · ACL Anthology · OpenReview · PMLR) + DBLP |
| Identity | DBLP-anchored deduplication; merge on DOI / arXiv ID / OpenReview ID / title + author |
| Enrichment | Semantic Scholar (citations, TLDR, fields of study) · OpenAlex · arXiv |
| Storage | PostgreSQL schema with 4 tables (papers / authors / paper_authors / links), 2 M+ rows |
| Knowledge | RAG semantic search · research-trend clustering · method × dataset × task knowledge graph · paper recommendation · related-work generation |
| Dashboard | Read-only web UI at http://127.0.0.1:8765 with 7 JSON APIs |
# 1. Install
git clone https://github.com/20bytes/Polaris.git
cd Polaris
pip install -r requirements.txt
# 2. Get the dataset (~1 GB)
pip install huggingface-hub
huggingface-cli download beatless-ai/polaris-ccf \
--repo-type dataset --local-dir outputs/ai_ccf_a
# 3. Launch the dashboard
python -m dashboard.server --host 127.0.0.1 --port 8765Open http://127.0.0.1:8765 and you're done.
Optional · load into PostgreSQL
createdb ai_research
psql -d ai_research -f sql/postgres_schema_ai_v2.sql
PGPASSWORD=<your-password> python scripts/import_ai_v2.py \
--host 127.0.0.1 --user <your-user> --database ai_researchThe dashboard auto-falls-back to CSV when no Postgres is available, so this step is optional for read-only use.
All endpoints return JSON. CSV fallback works without a database.
| Endpoint | Purpose |
|---|---|
GET /api/papers?conference=NeurIPS&year=2025&q=diffusion |
Paginated paper search |
GET /api/filters |
Available conferences / years / paper types |
GET /api/search?q=retrieval+augmented+generation&top_k=10 |
RAG semantic search |
GET /api/recommend?paper_id=<id> or ?q=<text> |
Similar-paper recommendation |
GET /api/related-work?q=<abstract> |
Auto-generated related-work paragraph (set OPENAI_API_KEY for LLM mode) |
GET /api/trends?type=emerging&conference=CVPR |
Hot / emerging / declining topics |
GET /api/kg?entity=transformer&relation=applied_to |
Knowledge-graph triplet query |
The scripts/ directory contains the canonical pipeline:
# Anchor to DBLP and resolve identity
python scripts/parse_dblp_ai_ccf_a.py
python scripts/match_current_to_dblp.py
# Enrich with Semantic Scholar
python scripts/enrich_semantic_scholar.py
# Normalize tracks and groups; produce final exports
python scripts/normalize_ai_groups.py
python scripts/build_ai_ccf_a_export.py
# Build embeddings + knowledge graph + trends
python scripts/build_ai_knowledge_index.py
python scripts/build_knowledge_graph.py
python scripts/analyze_research_trends.pyNeed a venue DBLP doesn't cover? Drop a collector under
scripts/that emits the canonical CSV schema — the rest of the pipeline is source-agnostic.
See README_ZH.md for the original Chinese walkthrough with every flag explained.
polaris/
├── dashboard/ HTTP server + RAG / KG / recommend / related-work
├── scripts/ End-to-end pipeline (collection → enrichment → indexing)
├── rules/ Domain knowledge (CCF whitelist, source priority, group rules)
├── sql/ PostgreSQL schemas
└── outputs/ Pipeline artifacts (gitignored; download from HuggingFace)
| Source | License | Use |
|---|---|---|
| DBLP | CC0 (Public Domain) | Identity anchor and venue keys |
| Semantic Scholar API | Subject to S2 API Terms | Citation counts, TLDRs, fields of study |
| OpenAlex | CC0 | Backup metadata and abstracts |
| arXiv | arXiv Terms of Use | PDF links and IDs only |
| CCF rankings | Cited as metadata | Venue tier classification |
| CVF · ACL Anthology · OpenReview · PMLR | Public conference pages | Original title / author metadata |
The published dataset is released under CC-BY-4.0 with full source attribution in the HuggingFace dataset card. The code in this repository is released under the MIT License.
Configuration · environment variables
| Env var | Default | Purpose |
|---|---|---|
PGHOST |
127.0.0.1 |
PostgreSQL host |
PGPORT |
5432 |
PostgreSQL port |
PGDATABASE |
ai_research |
Database name |
PGUSER |
— | Database user (required if using Postgres) |
PGPASSWORD |
— | Database password (required if using Postgres) |
OPENAI_API_KEY |
— | Enables LLM-mode related-work generation |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI-compatible endpoint |
LLM_MODEL |
gpt-4o-mini |
Model to call for LLM mode |
@misc{polaris2026,
title = {Polaris: a research infrastructure for top AI conferences in CCF rankings},
year = {2026},
url = {https://github.com/20bytes/Polaris},
note = {Dataset: https://huggingface.co/datasets/beatless-ai/polaris-ccf}
}Issues and pull requests are welcome. The project uses
beads for task tracking — run bd ready
to see what's open.