Skip to content

feat: add im-markdown output for doc fetch#1550

Open
liujiashu-shiro wants to merge 12 commits into
mainfrom
feat/doc_im_markdown
Open

feat: add im-markdown output for doc fetch#1550
liujiashu-shiro wants to merge 12 commits into
mainfrom
feat/doc_im_markdown

Conversation

@liujiashu-shiro

@liujiashu-shiro liujiashu-shiro commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add an im-markdown output format for doc fetch, converting Docx content into Markdown suitable for IM messages. The change expands conversion coverage for common document structures and documents the intended lark-doc to lark-im
usage path.

Changes

  • Add Docx-to-IM-Markdown conversion logic
  • Support --doc-format im-markdown in doc fetch
  • Cover headings, lists, code blocks, tables, images, links, nested structures, and edge cases in unit tests
  • Extend docs_fetch_v2 tests for the new format behavior
  • Document im-markdown in lark-doc fetch references as a fetch-only format for lark-im usage
  • Document the lark-im sending workflow for forwarding fetched doc content with --markdown

Test Plan

  • Unit tests passed: go test ./shortcuts/doc
  • Format check passed: gofmt -l shortcuts/doc/docs_fetch_im_markdown.go shortcuts/doc/docs_fetch_im_markdown_test.go shortcuts/doc/docs_fetch_v2.go shortcuts/doc/docs_fetch_v2_test.go
  • Diff whitespace check passed: git diff --check
  • Manually verify lark-cli docs +fetch --doc-format im-markdown output can be sent through lark-im with --markdown

Related Issues

None

Summary by CodeRabbit

  • New Features

    • Added im-markdown as an allowed --doc-format for v2 +fetch. It fetches as standard Markdown from the API, then converts IM-style markup (headings, callouts, blockquotes, lists, grids/columns, tables, sheets/bookmarks, and citations) into clean Markdown, including nested and partially malformed fragments.
  • Bug Fixes

    • Improved post-processing robustness for unclosed containers and scanner/attribute edge cases, preserving or safely dropping fragments as appropriate.
  • Tests

    • Expanded unit and integration-style coverage for request construction/downgrades and Markdown conversion behaviors, including escaping.
  • Documentation

    • Updated lark-doc/lark-im docs to clarify im-markdown is fetch-only and how to send converted content as a message.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds im-markdown as a new --doc-format option for +fetch. The flag value is downgraded to markdown when calling the /fetch API. The returned XML-ish IM-markup content is then post-processed by a new converter that scans for registered tags and rewrites them as standard Markdown, with comprehensive test coverage for all tag handlers and edge cases. Documentation updates explain the fetch-only usage in lark-im scenarios and the workflow for sending doc content as messages.

Changes

IM-Markdown Conversion Pipeline

Layer / File(s) Summary
Fetch v2 flag, format downgrade, and post-processing hook
shortcuts/doc/docs_fetch_v2.go
Adds im-markdown to the --doc-format enum, introduces effectiveFetchFormat to map im-markdownmarkdown for the outgoing API request, and inserts a post-processing call to applyFetchIMMarkdown after the fetch response is received.
Converter context, handler registry, and utilities
shortcuts/doc/docs_fetch_im_markdown.go
Defines imMarkdownContext and handler types, precompiles close-regexes and attribute/cell/link detection regexes, registers all supported tag handlers in init, and implements newIMMarkdownContext with tenant-aware base URL extraction and blockquote depth tracking.
Main tag scanning and dispatch loop
shortcuts/doc/docs_fetch_im_markdown.go
Implements the main convertToIMMarkdown loop: scans for the next registered tag, preserves intervening text unchanged, parses attributes with HTML unescaping, routes self-closing tags directly to handlers, and locates matching closing tags using depth tracking.
Block-level element handlers
shortcuts/doc/docs_fetch_im_markdown.go
Implements handlers for title/headings, paragraphs, line breaks, lists (ul/ol with li via seq), callout with optional emoji prefixes, blockquotes with depth tracking, and passthrough containers (grid/column), plus a generic discard handler.
Code, media, and resource link handlers
shortcuts/doc/docs_fetch_im_markdown.go
Implements handlers for inline backtick code (whiteboard), fenced code blocks with optional language and backtick-run sizing, inline LaTeX, horizontal rules, image/source rendering, and sheet/bookmark conversion to Markdown links using computed base URL.
Table-to-Markdown conversion
shortcuts/doc/docs_fetch_im_markdown.go
Extracts tr/td/th structures with depth-aware matching, recursively converts nested registered tags in cells, normalizes <br> to newlines, strips unknown tags while preserving <at> content, HTML-unescapes and pipe-escapes cell text, and pads rows to consistent column counts.
List-body conversion helper
shortcuts/doc/docs_fetch_im_markdown.go
Implements ul/ol list conversion by iterating li blocks via depth-tracked matching, converting each li body to Markdown, applying ordered numbering rules (seq or fallback index), and indenting continuation lines.
Citation, link, text, and escaping utilities
shortcuts/doc/docs_fetch_im_markdown.go
Adds helpers to extract inner anchor href/text, convert markup to plain text with tag stripping and unescaping, build Markdown links with character escaping, compute inline/fenced code fences from backtick runs, apply list continuation indentation, and select first non-empty value.
Converter test infrastructure and apply-function test
shortcuts/doc/docs_fetch_im_markdown_test.go
Adds test case structure and helpers for table-driven converter assertions, plus TestApplyFetchIMMarkdown to verify mutation behavior when document.content is a string and tenant URL extraction for context initialization.
Unit tests for tag handlers
shortcuts/doc/docs_fetch_im_markdown_test.go
Tests individual handler behavior: title (trimming, inner markup, concatenation, case-insensitivity, unclosed), callout (emoji, nesting, recursive same-name, embedded tags, unclosed), blockquote (nested markers, paragraph handling), grid/column (newline separation, nesting, empty behavior, unclosed), table (header/data inference, pipe escaping, br normalization, nested tags, padding, unclosed), discard tags (dropping specific containers including self-closing variants), whiteboard (backtick escaping, paired/self-closing, unclosed), sheet (context-dependent links, missing attributes), bookmark (label precedence, href fallback, escaping, missing href, wrapped tags), and cite variants (user/doc/citation/unknown with attribute precedence and fallbacks).
Edge case and integration tests
shortcuts/doc/docs_fetch_im_markdown_test.go
Tests scanner/parsing boundaries (unknown tag preservation with known child conversion, single-quoted attributes, leading text before tags, XML comments, br conversion, malformed attributes), composite nesting (callout-grid-table-cites-sheets, nested grids in table cells, bookmark-wrapping-callout fallback), unclosed fragments (preserving opening tags, leaving nested content unconverted across multiple tag types), deep nesting robustness (repeated grid/column and emoji-wrapped callout containers), document-wide tag/escaping smoke test (headings, paragraphs, lists, inline formatting, links, LaTeX, code blocks with backtick escaping, media), mixed-document smoke test (verifying conversions appear while raw fragments don't), and base URL extraction from various URL formats and token-only inputs.
Fetch v2 integration tests for im-markdown
shortcuts/doc/docs_fetch_v2_test.go
Adds tests for fetch-body construction (revision_id, export_option for with-ids detail, read_option mapping for scope/boundary), detail downgrade no-op assertion for supported format/detail combinations, im-markdown dry-run format downgrade, detail-downgrade export_option validation for markdown and im-markdown, API error handling, and end-to-end test verifying IM XML tags in stubbed response are converted in JSON output.
User-facing documentation updates
skills/lark-doc/SKILL.md, skills/lark-doc/references/lark-doc-fetch.md, skills/lark-doc/references/lark-doc-md.md, skills/lark-im/SKILL.md
Clarifies that im-markdown is a fetch-only format for lark-im scenarios (not for create/update), updates the --doc-format parameter table and notes to include im-markdown, and adds workflow guidance to lark-im SKILL.md explaining how to fetch doc content in im-markdown and send via --markdown while preserving user cites.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant CLI as +fetch CLI
  participant executeFetchV2
  participant effectiveFetchFormat
  participant APIServer as /fetch API
  participant applyFetchIMMarkdown
  participant convertToIMMarkdown

  User->>CLI: +fetch --doc-format im-markdown <docToken>
  CLI->>executeFetchV2: execute with im-markdown
  executeFetchV2->>effectiveFetchFormat: compute wire format
  effectiveFetchFormat-->>executeFetchV2: "markdown"
  executeFetchV2->>APIServer: POST /fetch {format: "markdown"}
  APIServer-->>executeFetchV2: {content: "<title>...</title><callout>...</callout>..."}
  executeFetchV2->>applyFetchIMMarkdown: post-process document
  applyFetchIMMarkdown->>convertToIMMarkdown: scan and convert IM tags
  convertToIMMarkdown-->>applyFetchIMMarkdown: "# Title\n> callout...\n..."
  applyFetchIMMarkdown-->>executeFetchV2: document with Markdown content
  executeFetchV2-->>User: JSON output with converted content
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • larksuite/cli#1466: Modifies the same executeFetchV2 flow and --doc-format downgrade handling in docs_fetch_v2.go, with direct coupling to this PR's format-routing and post-processing wiring.
  • larksuite/cli#1291: Standardizes +fetch command onto v2 flag/validation paths; this PR extends v2FetchFlags and hooks into the same v2 fetch execution flow.
  • larksuite/cli#638: Introduces the original v2 fetch pipeline and executeFetchV2 infrastructure that this PR now extends with post-processing and im-markdown conversion support.

Suggested labels

feature, size/XL

Suggested reviewers

  • YangJunzhou-01
  • SunPeiYang996

Poem

🐇 A new format hops into the fold,
IM-markdown tags turned into gold!
<callout> and <title> take their bow,
Converted to Markdown — neat as a plow.
The rabbit scans tags with depth and care,
Tables and links bloom fresh in the air! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a new im-markdown output format for doc fetch functionality.
Description check ✅ Passed The description covers all required template sections: Summary, Changes with bullet points, Test Plan with checkboxes, and Related Issues. Testing details are documented.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/doc_im_markdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added domain/ccm PR touches the ccm domain size/L Large or sensitive change across domains or core paths labels Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.29%. Comparing base (d71bab0) to head (df09172).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1550      +/-   ##
==========================================
+ Coverage   74.04%   74.29%   +0.24%     
==========================================
  Files         787      788       +1     
  Lines       76353    76916     +563     
==========================================
+ Hits        56534    57141     +607     
+ Misses      15572    15536      -36     
+ Partials     4247     4239       -8     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

🚀 PR Preview Install Guide

🧰 CLI update

npm i -g https://pkg.pr.new/larksuite/cli/@larksuite/cli@df091727e11420ecd182a00a39efd75696c86258

🧩 Skill update

npx skills add larksuite/cli#feat/doc_im_markdown -y -g

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
shortcuts/doc/docs_fetch_im_markdown.go (1)

296-313: 🎯 Functional Correctness | 🔵 Trivial

Nested tables inside table cells will be mis-parsed due to non-greedy regex matching.

The non-greedy (.*?)</t[dh]> pattern in imMarkdownCellsRE matches the first closing </td> or </th>, which would be the inner table's cell tag if a <table> is nested within a <td>. This truncates the cell content and corrupts the row. The handler correctly handles other nested elements like <grid> (which use different tag names), but <table> uses the same tag names and will break.

No tests currently cover nested tables. If Docx exports can produce nested tables, add a test case or document this as a known limitation and fall back to inline code for cells containing nested <table> elements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shortcuts/doc/docs_fetch_im_markdown.go` around lines 296 - 313, The
handleIMMarkdownTable function has a bug where nested tables inside cells will
be mis-parsed because the imMarkdownCellsRE regex uses a non-greedy pattern that
matches the first closing </td> or </th> tag, which would be from an inner table
instead of the outer cell. To fix this, before processing the cell content in
the inner loop where cellMatch[1] is used, add a check to detect if the cell
content contains a nested <table> element. If it does, either fall back to
calling imMarkdownInlineCode on the segment or skip processing that row to avoid
corrupting the output. This guard should be placed before the
normalizeIMMarkdownTableCell and convertToIMMarkdown calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@shortcuts/doc/docs_fetch_im_markdown.go`:
- Around line 412-421: The `markdownLink` function does not URL-encode the href
parameter before inserting it into the markdown link format. Apply URL encoding
to the href string in the `markdownLink` function before passing it to
fmt.Sprintf, ensuring that special characters like spaces are encoded as %20 and
parentheses as %28 and %29 to comply with Lark/Feishu Markdown requirements. Use
the appropriate URL encoding function from the standard library to encode the
href while maintaining the fmt.Sprintf call structure.

---

Nitpick comments:
In `@shortcuts/doc/docs_fetch_im_markdown.go`:
- Around line 296-313: The handleIMMarkdownTable function has a bug where nested
tables inside cells will be mis-parsed because the imMarkdownCellsRE regex uses
a non-greedy pattern that matches the first closing </td> or </th> tag, which
would be from an inner table instead of the outer cell. To fix this, before
processing the cell content in the inner loop where cellMatch[1] is used, add a
check to detect if the cell content contains a nested <table> element. If it
does, either fall back to calling imMarkdownInlineCode on the segment or skip
processing that row to avoid corrupting the output. This guard should be placed
before the normalizeIMMarkdownTableCell and convertToIMMarkdown calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29d4a8f7-db41-46e6-ba9c-332e69158f45

📥 Commits

Reviewing files that changed from the base of the PR and between 736b131 and 453c74b.

📒 Files selected for processing (4)
  • shortcuts/doc/docs_fetch_im_markdown.go
  • shortcuts/doc/docs_fetch_im_markdown_test.go
  • shortcuts/doc/docs_fetch_v2.go
  • shortcuts/doc/docs_fetch_v2_test.go

Comment thread shortcuts/doc/docs_fetch_im_markdown.go
@github-actions github-actions Bot added the domain/im PR touches the im domain label Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@shortcuts/doc/docs_fetch_v2_test.go`:
- Around line 667-669: In the TestDocsFetchV2ReturnsAPIError test, replace the
simple strings.Contains check for "fetch failed" with comprehensive typed error
assertions. Use errs.ProblemOf to extract and validate the error's typed
metadata including category, subtype, and param fields to ensure the API error
contract is properly maintained. Additionally, verify that the error cause chain
is preserved by unwrapping the error to check that the underlying error is
accessible, rather than only validating the error message text.
- Around line 325-331: The test for validateReadModeFlags() currently validates
error details using string substring matching with strings.Contains(err.Error(),
tt.wantParam), which doesn't catch classification regressions. Replace this with
typed error metadata assertions by removing the substring check and instead use
errs.ProblemOf to assert the error's category and subtype, and use errors.As to
extract the *errs.ValidationError and directly assert its Param field. Apply
this pattern to all error-path tests in the file including the instance at lines
421-423.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cbea290c-ced3-4ba5-8a54-08e298cfa67a

📥 Commits

Reviewing files that changed from the base of the PR and between 389d80f and c882023.

📒 Files selected for processing (2)
  • shortcuts/doc/docs_fetch_im_markdown_test.go
  • shortcuts/doc/docs_fetch_v2_test.go

Comment thread shortcuts/doc/docs_fetch_v2_test.go Outdated
Comment thread shortcuts/doc/docs_fetch_v2_test.go Outdated
Comment thread skills/lark-doc/SKILL.md Outdated
Comment thread skills/lark-doc/SKILL.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain/ccm PR touches the ccm domain domain/im PR touches the im domain size/L Large or sensitive change across domains or core paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants