Polaris

The North Star for automated research.

A reproducible infrastructure for collecting, enriching, and querying papers from top AI conferences in the CCF rankings.

简体中文 · English

At a glance

Papers	Venues	Years	Vector index
155,652	11 top AI conferences	1982 – 2026	155 K × 384

Polaris turns the public scholarly web into a clean, queryable knowledge base. It ships with a reference dataset on HuggingFace and an end-to-end pipeline you can extend to new venues.

Venues covered

CCF-A strict (7) · AAAI · NeurIPS · ACL · CVPR · ICCV · ICML · IJCAI
High-value extended (4) · ICLR · ECCV (CCF-B) · EMNLP (CCF-B) · CHI (CCF-A, HCI track)

What you get

Layer	Component
Collection	Cross-source collection from open conference proceedings (CVF · ACL Anthology · OpenReview · PMLR) + DBLP
Identity	DBLP-anchored deduplication; merge on DOI / arXiv ID / OpenReview ID / title + author
Enrichment	Semantic Scholar (citations, TLDR, fields of study) · OpenAlex · arXiv
Storage	PostgreSQL schema with 4 tables (papers / authors / paper_authors / links), 2 M+ rows
Knowledge	RAG semantic search · research-trend clustering · method × dataset × task knowledge graph · paper recommendation · related-work generation
Dashboard	Read-only web UI at `http://127.0.0.1:8765` with 7 JSON APIs

Quick start

# 1. Install
git clone https://github.com/20bytes/Polaris.git
cd Polaris
pip install -r requirements.txt

# 2. Get the dataset (~1 GB)
pip install huggingface-hub
huggingface-cli download beatless-ai/polaris-ccf \
  --repo-type dataset --local-dir outputs/ai_ccf_a

# 3. Launch the dashboard
python -m dashboard.server --host 127.0.0.1 --port 8765

Open http://127.0.0.1:8765 and you're done.

Optional · load into PostgreSQL

createdb ai_research
psql -d ai_research -f sql/postgres_schema_ai_v2.sql
PGPASSWORD=<your-password> python scripts/import_ai_v2.py \
  --host 127.0.0.1 --user <your-user> --database ai_research

The dashboard auto-falls-back to CSV when no Postgres is available, so this step is optional for read-only use.

Reference API

All endpoints return JSON. CSV fallback works without a database.

Endpoint	Purpose
`GET /api/papers?conference=NeurIPS&year=2025&q=diffusion`	Paginated paper search
`GET /api/filters`	Available conferences / years / paper types
`GET /api/search?q=retrieval+augmented+generation&top_k=10`	RAG semantic search
`GET /api/recommend?paper_id=<id>` or `?q=<text>`	Similar-paper recommendation
`GET /api/related-work?q=<abstract>`	Auto-generated related-work paragraph (set `OPENAI_API_KEY` for LLM mode)
`GET /api/trends?type=emerging&conference=CVPR`	Hot / emerging / declining topics
`GET /api/kg?entity=transformer&relation=applied_to`	Knowledge-graph triplet query

Running the full pipeline yourself

The scripts/ directory contains the canonical pipeline:

# Anchor to DBLP and resolve identity
python scripts/parse_dblp_ai_ccf_a.py
python scripts/match_current_to_dblp.py

# Enrich with Semantic Scholar
python scripts/enrich_semantic_scholar.py

# Normalize tracks and groups; produce final exports
python scripts/normalize_ai_groups.py
python scripts/build_ai_ccf_a_export.py

# Build embeddings + knowledge graph + trends
python scripts/build_ai_knowledge_index.py
python scripts/build_knowledge_graph.py
python scripts/analyze_research_trends.py

Need a venue DBLP doesn't cover? Drop a collector under scripts/ that emits the canonical CSV schema — the rest of the pipeline is source-agnostic.

See README_ZH.md for the original Chinese walkthrough with every flag explained.

Project layout

polaris/
├── dashboard/        HTTP server + RAG / KG / recommend / related-work
├── scripts/          End-to-end pipeline (collection → enrichment → indexing)
├── rules/            Domain knowledge (CCF whitelist, source priority, group rules)
├── sql/              PostgreSQL schemas
└── outputs/          Pipeline artifacts (gitignored; download from HuggingFace)

Data sources & licensing

Source	License	Use
DBLP	CC0 (Public Domain)	Identity anchor and venue keys
Semantic Scholar API	Subject to S2 API Terms	Citation counts, TLDRs, fields of study
OpenAlex	CC0	Backup metadata and abstracts
arXiv	arXiv Terms of Use	PDF links and IDs only
CCF rankings	Cited as metadata	Venue tier classification
CVF · ACL Anthology · OpenReview · PMLR	Public conference pages	Original title / author metadata

The published dataset is released under CC-BY-4.0 with full source attribution in the HuggingFace dataset card. The code in this repository is released under the MIT License.

Configuration · environment variables

Env var	Default	Purpose
`PGHOST`	`127.0.0.1`	PostgreSQL host
`PGPORT`	`5432`	PostgreSQL port
`PGDATABASE`	`ai_research`	Database name
`PGUSER`	—	Database user (required if using Postgres)
`PGPASSWORD`	—	Database password (required if using Postgres)
`OPENAI_API_KEY`	—	Enables LLM-mode related-work generation
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	OpenAI-compatible endpoint
`LLM_MODEL`	`gpt-4o-mini`	Model to call for LLM mode

Citation

@misc{polaris2026,
  title  = {Polaris: a research infrastructure for top AI conferences in CCF rankings},
  year   = {2026},
  url    = {https://github.com/20bytes/Polaris},
  note   = {Dataset: https://huggingface.co/datasets/beatless-ai/polaris-ccf}
}

Contributing

Issues and pull requests are welcome. The project uses beads for task tracking — run bd ready to see what's open.

License

Code: MIT · Dataset: CC-BY-4.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
dashboard		dashboard
rules		rules
scripts		scripts
sql		sql
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
HUGGINGFACE_DATASET_CARD.md		HUGGINGFACE_DATASET_CARD.md
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polaris

At a glance

What you get

Quick start

Reference API

Running the full pipeline yourself

Project layout

Data sources & licensing

Citation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Polaris

At a glance

What you get

Quick start

Reference API

Running the full pipeline yourself

Project layout

Data sources & licensing

Citation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages