Skip to content

feat(codebase): deep-enrich + graph-aware recall + team-wiki-codebase skill#56

Open
m0Nst3r873 wants to merge 10 commits into
Tencent:mainfrom
m0Nst3r873:feat/deep-enrich-v2
Open

feat(codebase): deep-enrich + graph-aware recall + team-wiki-codebase skill#56
m0Nst3r873 wants to merge 10 commits into
Tencent:mainfrom
m0Nst3r873:feat/deep-enrich-v2

Conversation

@m0Nst3r873

Copy link
Copy Markdown
Contributor

Summary

Part 3 of 3. Depends on #55 (feat/import-enrichment-v2).

  • deep-enrich.ts: Background AI knowledge generation (component design docs + architecture overview + graph docs G1-G3) with resume support via _review/progress.json
  • code-knowledge-recall.ts: BM25 + graph-boost retrieval engine for teamwiki/ knowledge graph
  • codebase-upgrade-wiki.ts: Migration from docs/team-codebase/ to teamwiki/ format
  • codebase-wiki-lint.ts: Graph health diagnostics (connectivity, orphans, staleness)
  • team-wiki-codebase skill bundle (909-line methodology)
  • CI: MR comment API + graph change detection
  • Recall: --depth option (route/context/lookup), graph-aware agent instructions, model_hint

Critical recall pipeline fixes

Fix Description
B8 Graph boost was dead code — path format mismatch. Fixed via slug/title matching.
B10 BM25 dl used deduplicated token count → broke length normalization. Fixed.
B11 BM25 scores (20-50+) always dominated learnings (0-10). Added normalization.
B14 CJK queries couldn't match Chinese text. Added bigram segmentation.
B24 Graph boost extended to 2-hop neighbors (halved weight).
B25 deep-enrich concurrency 2→5.
camelCase splitting in tokenizer (getUserById → get, user, by, id).

How it's reachable

  • teamai deep-enrich --project <slug> → generate docs/*.md
  • teamai codebase --lint (when teamwiki/ exists) → wiki-lint health check
  • teamai codebase --upgrade-wiki → migrate legacy format
  • teamai recall --depth lookup <query> → graph-boosted retrieval

Test plan

  • npx tsc --noEmit — zero errors
  • npx vitest run — 1505 tests passed

Dependency chain

#54 (wiki-engine) → #55 (AI enrichment) → PR 3 (this)

Replaces #52 + #53.

jaelgeng and others added 10 commits June 26, 2026 19:31
Vendored from team-wiki by @lurkacai (git.woa.com/lurkacai/team-wiki).
Import paths adjusted for teamai-cli project structure.

Files copied (all pure deterministic, no AI dependency):
- core/graph-index.schema.ts: graph node/edge types, merge, save/load
- core/wiki-protocol.ts: wiki category/confidence types, slugify
- code-knowledge/code-collector.ts: file collection with git-aware filtering
- code-knowledge/code-extractors.ts: multi-language fact extraction dispatch
- code-knowledge/code-graph.ts: build CodeGraphIndex from facts
- code-knowledge/code-incremental.ts: detect changed files via manifest
- code-knowledge/extractors/*: TS/Python/Go/Java/Rust/Config extractors
- interface-scanner.ts: HTTP/MQ/RPC endpoint detection (5 languages)
- call-chain-tracer.ts: 4-layer call chain tracing
- code-graph-overlay.ts: directory-level architecture nodes
- doc-graph-extractor.ts: extract API/config/error nodes from docs
- manifest-schema.ts: V2 manifest types (entrypoints, responsibilities)
Wire up vendored modules into the teamai extraction flow:

- adapters/index.ts: unified export layer for all wiki-engine modules
- adapters/templates.ts: router.md + index.md generation templates
- codebase-extract.ts: full extraction pipeline
  collectCode → extractCodeFacts → scanInterfaces → traceCallChains
  → buildEvidencePages (interfaces.md + call-chains.md)
  → buildIndexHubOverlay → mergedGraph → graph-index.json
  → buildModuleSummaries → detectKnowledgeGaps → router/index/hot/gaps
- utils/hook-output.ts: multi-tool Stop hook output formatting
- interface-scanner: HTTP/MQ/RPC detection across languages (12 tests)
- call-chain-tracer: entry detection, layer classification (8 tests)
- code-graph-overlay: buildIndexHubOverlay node/edge generation (5 tests)
- doc-graph-extractor: structure + entity extraction (8 tests)
- hook-output: formatStopHookOutput multi-tool format (6 tests)

All tests use in-memory data, no filesystem/network dependencies.
Bug fixes applied:
- B1: unify graph-index path to .indices/ (was .teamwiki/.indices/)
- B2: fix router.md links (evidence/code/ prefix)
- B3: add teamwiki to safeIgnore
- B4: remove stale .teamwiki/evidence check
- B5: use saveGraphIndex() instead of manual writeFile
- B9: unify graph schema to GraphIndex (remove CodeGraphIndex)
- B13: filter third-party npm imports from relation facts
- B15: priority sort: key files first, then shallow dirs
- B16: generate deterministic overview.md
- B17: rename call-chains to dependency-paths (not runtime calls)
- B18: Python extractor: only service-pattern functions as components
- B19: facts deduplication by kind:name
- B21: doc-graph config pattern restricted to SCREAMING_SNAKE_CASE
- B22: API path pattern no longer requires /v\d*/ prefix

CLI integration:
- Add --extract, --incremental, --project, --max-files to codebase command
- Add extract branch to codebase-cmd.ts
- Add teamwiki/ to .gitignore
New modules (vendored/adapted from team-wiki by @lurkacai):
- knowledge-reconciler.ts: 9-phase product↔code reconciliation
- reconciler-v2-types.ts: NumericConfidence scoring types
- manifest-compiler.ts: consume ManifestV2 → wiki pages

New teamai modules:
- enrich-with-ai.ts: per-module AI responsibility inference +
  repo-level domain classification via callClaudeParallel
- rebuild-wiki-index.ts: generate table-based router.md + stats index.md
  from _manifest.json + _domains.json + overview.md
- utils/git.ts: add autoPushTeamRepo for auto-push after import

Updated:
- wiki-engine/adapters/index.ts: export reconciler + confidence types
- wiki-engine/adapters/templates.ts: DomainGroup router + IndexStats
…ication

- import-repo.ts: add reconcile call after extraction, remove entire
  legacy AI domain classification flow (recommendDomain → domains.yaml)
- import-org.ts: add rebuildWikiIndex + autoPush after batch import
- codebase-extract.ts: integrate AI enrichment (enrichWithAI +
  writeManifest + _domains.json), domain-grouped router/index
- Tests updated to match new import flow
1. deep-enrich.ts: background THPC-quality knowledge generation
   - Phase 1: Component design docs per module (parallel AI calls)
   - Phase 2: Architecture overview document
   - Phase 3: Graph documents G1-G3 (deterministic)
   - Progress tracking with _review/progress.json resume support

2. skills/team-wiki-codebase/: bundled deep generation skill (by @lurkacai)
   - 909-line SKILL.md methodology (K0-K4 phases)
   - Sub-agents: kb-doc-generator, graph-rag-agent
   - Registered in builtin-skills.ts for auto-deploy on pull
Security (M1/M2/M4):
- enrich-with-ai.ts: sanitizeForPrompt() for prompt injection defense
- import-repo.ts: independent JSON.parse try/catch with warn logging
- knowledge-reconciler.ts: reject '../' and absolute paths

Integration from main:
- import-repo.ts: deep-enrich trigger + reconcile call
- index.ts: hidden deep-enrich command + recall depth option
- recall.ts + code-knowledge-recall.ts: codebase graph recall
- contribute-check.ts: scoring adjustments + hook output fix
- hook-handlers.ts: formatStopHookOutput multi-tool compat
- clone.ts: HTTPS upgrade + SSH conversion
- pull.ts: MCP registration + teamwiki sync
- ci/extract-mr.ts: graph change detection in MR pipeline
- README: teamwiki docs + CLI command table simplification
- Various test updates to match new behavior
1. Recall agent model_hint (builtin-rules.ts + agents/teamai-recall.md):
   - Guide main agent to use mid-tier model for recall subagent
   - Balances cost/latency with tool-calling capability requirement

2. Auto-recall test fix (auto-recall.test.ts):
   - Add missing version/type/domain/df fields to test search index
   - Prevents isLegacyIndex() from triggering rebuild in isolated tests
Critical recall fixes:
- B7: use protocol loadGraphIndex instead of local hardcoded version
- B8: fix graph boost path resolution — match by slug/title not raw file paths
  (previously graph boost was dead code due to path format mismatch)
- B10: BM25 document length uses raw token count, not deduplicated count
- B11: normalize BM25 scores to 0-10 range before merging with learnings
- B14: add CJK bigram tokenization for Chinese query matching
- B24: extend graph boost to 2-hop neighbors (halved weight)
- B25: deep-enrich BATCH increased 2→5

Also: camelCase splitting in tokenizer, GraphNode field migration
(slug/title/type instead of id/label/kind/file) in extract-mr, wiki-lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant