Skip to content

20bytes/Polaris

Repository files navigation

Polaris

Polaris

The North Star for automated research.

A reproducible infrastructure for collecting, enriching, and querying papers from top AI conferences in the CCF rankings.

License: MIT Python 3.10+ Dataset on HF

简体中文 · English


At a glance

Papers Venues Years Vector index
155,652 11 top AI conferences 1982 – 2026 155 K × 384

Polaris turns the public scholarly web into a clean, queryable knowledge base. It ships with a reference dataset on HuggingFace and an end-to-end pipeline you can extend to new venues.

Venues covered

  • CCF-A strict (7)  ·  AAAI · NeurIPS · ACL · CVPR · ICCV · ICML · IJCAI
  • High-value extended (4)  ·  ICLR · ECCV (CCF-B) · EMNLP (CCF-B) · CHI (CCF-A, HCI track)

What you get

Layer Component
Collection Cross-source collection from open conference proceedings (CVF · ACL Anthology · OpenReview · PMLR) + DBLP
Identity DBLP-anchored deduplication; merge on DOI / arXiv ID / OpenReview ID / title + author
Enrichment Semantic Scholar (citations, TLDR, fields of study) · OpenAlex · arXiv
Storage PostgreSQL schema with 4 tables (papers / authors / paper_authors / links), 2 M+ rows
Knowledge RAG semantic search · research-trend clustering · method × dataset × task knowledge graph · paper recommendation · related-work generation
Dashboard Read-only web UI at http://127.0.0.1:8765 with 7 JSON APIs

Quick start

# 1. Install
git clone https://github.com/20bytes/Polaris.git
cd Polaris
pip install -r requirements.txt

# 2. Get the dataset (~1 GB)
pip install huggingface-hub
huggingface-cli download beatless-ai/polaris-ccf \
  --repo-type dataset --local-dir outputs/ai_ccf_a

# 3. Launch the dashboard
python -m dashboard.server --host 127.0.0.1 --port 8765

Open http://127.0.0.1:8765 and you're done.

Optional · load into PostgreSQL
createdb ai_research
psql -d ai_research -f sql/postgres_schema_ai_v2.sql
PGPASSWORD=<your-password> python scripts/import_ai_v2.py \
  --host 127.0.0.1 --user <your-user> --database ai_research

The dashboard auto-falls-back to CSV when no Postgres is available, so this step is optional for read-only use.


Reference API

All endpoints return JSON. CSV fallback works without a database.

Endpoint Purpose
GET /api/papers?conference=NeurIPS&year=2025&q=diffusion Paginated paper search
GET /api/filters Available conferences / years / paper types
GET /api/search?q=retrieval+augmented+generation&top_k=10 RAG semantic search
GET /api/recommend?paper_id=<id>  or  ?q=<text> Similar-paper recommendation
GET /api/related-work?q=<abstract> Auto-generated related-work paragraph (set OPENAI_API_KEY for LLM mode)
GET /api/trends?type=emerging&conference=CVPR Hot / emerging / declining topics
GET /api/kg?entity=transformer&relation=applied_to Knowledge-graph triplet query

Running the full pipeline yourself

The scripts/ directory contains the canonical pipeline:

# Anchor to DBLP and resolve identity
python scripts/parse_dblp_ai_ccf_a.py
python scripts/match_current_to_dblp.py

# Enrich with Semantic Scholar
python scripts/enrich_semantic_scholar.py

# Normalize tracks and groups; produce final exports
python scripts/normalize_ai_groups.py
python scripts/build_ai_ccf_a_export.py

# Build embeddings + knowledge graph + trends
python scripts/build_ai_knowledge_index.py
python scripts/build_knowledge_graph.py
python scripts/analyze_research_trends.py

Need a venue DBLP doesn't cover? Drop a collector under scripts/ that emits the canonical CSV schema — the rest of the pipeline is source-agnostic.

See README_ZH.md for the original Chinese walkthrough with every flag explained.


Project layout

polaris/
├── dashboard/        HTTP server + RAG / KG / recommend / related-work
├── scripts/          End-to-end pipeline (collection → enrichment → indexing)
├── rules/            Domain knowledge (CCF whitelist, source priority, group rules)
├── sql/              PostgreSQL schemas
└── outputs/          Pipeline artifacts (gitignored; download from HuggingFace)

Data sources & licensing

Source License Use
DBLP CC0 (Public Domain) Identity anchor and venue keys
Semantic Scholar API Subject to S2 API Terms Citation counts, TLDRs, fields of study
OpenAlex CC0 Backup metadata and abstracts
arXiv arXiv Terms of Use PDF links and IDs only
CCF rankings Cited as metadata Venue tier classification
CVF · ACL Anthology · OpenReview · PMLR Public conference pages Original title / author metadata

The published dataset is released under CC-BY-4.0 with full source attribution in the HuggingFace dataset card. The code in this repository is released under the MIT License.


Configuration · environment variables
Env var Default Purpose
PGHOST 127.0.0.1 PostgreSQL host
PGPORT 5432 PostgreSQL port
PGDATABASE ai_research Database name
PGUSER Database user (required if using Postgres)
PGPASSWORD Database password (required if using Postgres)
OPENAI_API_KEY Enables LLM-mode related-work generation
OPENAI_BASE_URL https://api.openai.com/v1 OpenAI-compatible endpoint
LLM_MODEL gpt-4o-mini Model to call for LLM mode

Citation

@misc{polaris2026,
  title  = {Polaris: a research infrastructure for top AI conferences in CCF rankings},
  year   = {2026},
  url    = {https://github.com/20bytes/Polaris},
  note   = {Dataset: https://huggingface.co/datasets/beatless-ai/polaris-ccf}
}

Contributing

Issues and pull requests are welcome. The project uses beads for task tracking — run bd ready to see what's open.

License

Code: MIT  ·  Dataset: CC-BY-4.0

About

A reproducible infrastructure for collecting, enriching, and querying papers from top AI conferences in the CCF rankings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors