feat: improve API-ref extraction quality and coverage in the RAG#83
feat: improve API-ref extraction quality and coverage in the RAG#83omkarjoshi0304 wants to merge 6 commits into
Conversation
ef662f1 to
9516e0e
Compare
9516e0e to
55cdbad
Compare
Akrog
left a comment
There was a problem hiding this comment.
As a rule the source of truth is the git repository, so we should try to include as much information as possible in the commit message.
In this case I see there is more information on the PR description than the commit message itself.
| if [ "$OS_API_DOCS" = "true" ] && [ -d "./api-ref/source" ]; then | ||
| if ! grep -q "text-api-ref" tox.ini; then | ||
| echo "$tox_text_api_ref_target" >> tox.ini | ||
| # Build API-Ref |
There was a problem hiding this comment.
nit: Why remote the "if enabled" part of the comment?
There was a problem hiding this comment.
My bad i accidentally removed it
| text_file="./$api_dir/build/text/${rel_path%.html}.txt" | ||
| mkdir -p "$(dirname "$text_file")" | ||
|
|
||
| if command -v pandoc &> /dev/null; then |
There was a problem hiding this comment.
Not sure it's a good idea to allow building things with 2 different tools, as they would create different outputs, so we may start having different DB contents and not realize. This would make it harder to find out WHY things don't behave like they did beefore.
There was a problem hiding this comment.
I have removed the html2text fallback and standardized on Pandoc only.
The reasoning is that Pandoc produces significantly better structural output (especially for tables and code blocks) compared to html2text
|
|
||
| if command -v pandoc &> /dev/null; then | ||
| # shellcheck disable=SC2015 | ||
| pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" 2>/dev/null && ((converted_count++)) || true |
There was a problem hiding this comment.
Why is it OK for conversion to fail?
This could hide problems with the process.
There was a problem hiding this comment.
I removed the || true safety net. The script will now fail explicitly if a file cannot be converted, preventing us from silently indexing corrupted or missing data.
|
|
||
| if [ "$api_file_count" -gt 0 ]; then | ||
| echo "API-Ref: Found $api_file_count content files for $project" | ||
| # Only remove index.txt if other content files exist |
There was a problem hiding this comment.
I don't find this comment very helpful. I can see in the code at a glance that's what is happening, but it's not explained WHY we are deleting it. For example because when there are other files index.txt is a navigation page pointing to other pages, so we don't want it.
There was a problem hiding this comment.
I have updated the comment
Thanks for addressing i will edit the commit message |
42f3a1d to
ce60cfb
Compare
lpiwowar
left a comment
There was a problem hiding this comment.
Just two comments that crossed my mind this morning.
951ea86 to
212b8ed
Compare
| # Replace .txt with .html | ||
| remaining_path = remaining_path.replace(".txt", ".html") | ||
|
|
||
| # Build API-Ref URL: /api-ref/{service}/{remaining_path} |
There was a problem hiding this comment.
nit: The path in the comment doesn't look too useful to me. In my opinion the actual path doesn't add value since we can see it very clearly on the next line.
It is even misleading, as it has {service} but then we use {service_name}.
There was a problem hiding this comment.
I have removed it
| find ./api-ref/build/text -mindepth 1 -depth -type d -empty -delete 2>/dev/null || true | ||
| if ! tox -etext-api-ref; then | ||
| echo "WARNING: API-Ref build failed for $project" | ||
| api_ref_failed="true" |
There was a problem hiding this comment.
-1: Why don't we fail the process if we fail to build the api-ref?
Wouldn't this hide errors?
I think we should fail, but maybe I'm missing something.
There was a problem hiding this comment.
updated it to exit 1
| [ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs | ||
| # Copy regular docs if they were built (not for neutron-lib) | ||
| if [ -d "doc/build/text" ]; then | ||
| cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs |
There was a problem hiding this comment.
nit: Why multiple double quotes? `"${project_output_dir}/${_output_version}_docs"
| # Copy API-Ref documentation if it was built successfully | ||
| if [ "$OS_API_DOCS" = "true" ] && [ "$api_ref_failed" != "true" ] && \ | ||
| [ -n "$api_dir" ] && [ -d "$api_dir/build/text" ]; then | ||
| cp -r "$api_dir/build/text" "$project_output_dir"/"$_output_version"_api-ref |
There was a problem hiding this comment.
nit: Same here about multiple double quotes
b8fedc6 to
f5984b2
Compare
umago
left a comment
There was a problem hiding this comment.
I assume many part of bash scripting code has been AI generated ? I don't necessarily understand all of the commands there but the logic seems OK.
Assuming this has been tested and works I'm OK with it as we already have other parts of the code that has been assisted as well/.
index.txt
|
Code reviewFound 1 issue:
rag-content/scripts/get_openstack_plaintext_docs.sh Lines 240 to 255 in f5984b2 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
ece8977 to
eb09cb5
Compare
Thanks for address counter incremented |
lpiwowar
left a comment
There was a problem hiding this comment.
Apologies, these are just small things, but I think it should be fixed.
No worries I will fix it:) |
e890d85 to
347b7d0
Compare
69254f1 to
347b7d0
Compare
d4b1b21 to
16e2f3f
Compare
- Standardize on Pandoc for consistent RAG output; remove html2text fallback. - Add strict error handling for conversion failures and missing tools. - Fix "Url not reachable" errors by recursively pruning empty navigation index files. - Auto-detect api-ref and api-guide sources to avoid missing project docs. PR Comment fix
78d3bf4 to
755363b
Compare
lpiwowar
left a comment
There was a problem hiding this comment.
Looks good to me! But I'm not giving an approval since I haven't tested it yet. I want to build the image myself and see what is on the inside. Until then, I do not feel comfortable giving approve 🙈 . I'll try to get to this soon!:) If somebody wants to do an approval anyway I'm ok with it.
Thanks, @omkarjoshi0304, for all the changes so far! The commit history looks nicer:) 💯
Thoughts:
-
It would be helpful if this PR was accompanied by some sort of proof of testing comment showcasing how did some API-REF chunks changed (e.g.: this is how I build the image, this is how the chunks looked before and this is how they look now).
-
Since we are later moving to OKP not sure how needed this is. But I think it would be nice if our pipeline sampled some of the chunks and showed them to stdout. Not needed to do this in this PR. It is just a thought:).
-
I think it would be good to later move the api-ref code to a separate function. The
generate_text_doc()is getting quite busy.
Thanks for your comments:)
|
Proof of testingImage Build proof of html2text [Image link](podman pull quay.io/rh-ee-ojoshi/api-ref-html) There were few projects which were creating junk files and wrong chunking with this approach |
|
Image Build proof of Pandoc [Image link](podman pull quay.io/rh-ee-ojoshi/api-ref-pandoc) This approach has better chunking quality and no junk navigation files |
lpiwowar
left a comment
There was a problem hiding this comment.
LGTM 👍 Thanks for the PR and for the proof of testing comments!:) 🎉
Notes
- I think this is safe to merge since the PR does not influence the rest of the upstream documentation generation.
API-ref docs observations
- (non-blocking) They contain
¶symbols. - (non-blocking) They are missing any sense of hierarchy. There is no indication for headers and so on. I'm wondering why we do not convert to
rstormd:
pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" || {
echo "ERROR: Failed to convert $html_file"
return 1
}
- (non-blocking) It would be nice if they were converted to a similar style as the rest of the upstream documentation, but I guess we do not have to deal with that now. It is disabled by default, and we can polish this with the rest of the upstream documentation later - converting everything to nice Markdown (this would allow us to chunk the API ref documentation more nicely later).
- (non-blocking) These things ideally should be at least documented in Issues in this repo after merging.
Issue With Api Ref docs
- I ran the building of the image with the API docs locally, and it is failing for me on the following error (on main as well -> not caused by this PR). It might be related to the error we are seeing in the CI for the upstream docs though. Somebody should take a look on this.
$ make build-image-os BUILD_OCP_DOCS=true FLAVOR=gpu BUILD_GPU_ARGS="--device nvidia.com/gpu=all" OS_API_DOCS=true OS_PROJECTS="keystone"
...
+ tox -etext-api-ref
text-api-ref: install_deps> python -I -m pip install os-api-ref -r /tmp/os_docs_temp/keystone/doc/requirements.txt -c https://releases.openstack.org/constraints/upper/2025.2
.pkg: _optional_hooks> python /opt/app-root/lib64/python3.11/site-packages/pyproject_api/_backend.py True pbr.build
.pkg: get_requires_for_build_editable> python /opt/app-root/lib64/python3.11/site-packages/pyproject_api/_backend.py True pbr.build
.pkg: build_editable> python /opt/app-root/lib64/python3.11/site-packages/pyproject_api/_backend.py True pbr.build
text-api-ref: install_package_deps> python -I -m pip install 'Flask!=0.11,>=1.0.2' 'Flask-RESTful>=0.3.5' 'PyJWT>=1.6.1' 'SQLAlchemy>=1.4.0' 'WebOb>=1.7.1' 'Werkzeug>=0.15.0' 'bcrypt>=3.1.3' 'cryptography>=2.7' 'dogpile.cache>=1.0.2' 'jsonschema>=3.2.0' 'keystonemiddleware>=7.0.0' 'msgpack>=0.5.0' 'oauthlib>=0.6.2' 'oslo.cache>=1.26.0' 'oslo.config>=6.8.0' 'oslo.context>=2.22.0' 'oslo.db>=6.0.0' 'oslo.i18n>=3.15.3' 'oslo.log>=3.44.0' 'oslo.messaging>=5.29.0' 'oslo.middleware>=3.31.0' 'oslo.policy>=4.5.0' 'oslo.serialization!=2.19.1,>=2.18.0' 'oslo.upgradecheck>=1.3.0' 'oslo.utils>=3.33.0' 'osprofiler>=1.4.0' 'pbr!=2.1.0,>=2.0.0' 'pycadf!=2.0.0,>=1.1.0' 'pysaml2>=5.0.0' 'python-keystoneclient>=3.8.0' 'scrypt>=0.8.0' 'stevedore>=1.20.0'
text-api-ref: install_package> python -I -m pip install --force-reinstall --no-deps /tmp/os_docs_temp/keystone/.tox/.tmp/package/2/keystone-0.0.0-0.editable-py3-none-any.whl
text-api-ref: commands[0]> sphinx-build --keep-going -j auto -b html -d api-ref/build/doctrees api-ref/source api-ref/build/html
Running Sphinx v8.1.3
loading translations [en]... done
[openstackdocstheme] version: 3.5.0
[openstackdocstheme] connecting html-page-context event handler
Extension error:
Could not import extension os_openapi (exception: No module named 'os_openapi')
text-api-ref: exit 2 (0.57 seconds) /tmp/os_docs_temp/keystone> sphinx-build --keep-going -j auto -b html -d api-ref/build/doctrees api-ref/source api-ref/build/html pid=1709
.pkg: _exit> python /opt/app-root/lib64/python3.11/site-packages/pyproject_api/_backend.py True pbr.build
text-api-ref: FAIL code 2 (29.07=setup[28.50]+cmd[0.57] seconds)
evaluation failed :( (29.19 seconds)
+ echo 'WARNING: API-Ref build failed for keystone'
WARNING: API-Ref build failed for keystone
+ exit 1
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lpiwowar, omkarjoshi0304 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughContainer build and docs pipeline updated: Pandoc 3.1.11.1 is installed in the upstream docs stage, Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/generate_embeddings_openstack.py (1)
65-71:⚠️ Potential issue | 🟡 MinorKeep the fallback path as a
Path.If
relative_to()fails, Line 69 storespath_obj.nameas astr, but Line 71 unconditionally calls.as_posix(). That turns the fallback branch into anAttributeError.Proposed fix
try: relative_path = path_obj.relative_to(self.folder_path.resolve()) except ValueError: - relative_path = path_obj.name + relative_path = Path(path_obj.name) relative_path_str = relative_path.as_posix()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings_openstack.py` around lines 65 - 71, The fallback in the except block currently sets relative_path to a str (path_obj.name) but later calls relative_path.as_posix(), causing AttributeError; change the except branch to assign a Path object (e.g., relative_path = Path(path_obj.name) or Path(path_obj.name).resolve()) so relative_path is always a pathlib.Path and relative_path_str = relative_path.as_posix() works; update references around path_obj.relative_to(self.folder_path.resolve()), relative_path, and relative_path_str accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Containerfile`:
- Around line 24-27: The RUN block gated by BUILD_UPSTREAM_DOCS installs pandoc
using a non-existent release (pandoc-3.1.11.1) and uses curl without checksum
verification; replace the inlined curl -L download in that RUN step with a
sequence that references a real released pandoc version (e.g., 3.1.8 or newer),
use curl -fsSL to download the tarball to a temporary file, fetch the
corresponding published SHA256 checksum for that exact release, verify the
downloaded archive's checksum before extraction, then extract
(--strip-components=1) and remove the archive; update the filename referenced in
the RUN command and any hardcoded pandoc-* identifier to match the chosen
released version.
In `@scripts/generate_embeddings_openstack.py`:
- Around line 77-93: The code silently falls back to the repository name when
building API-Ref URLs using API_REF_SERVICE_MAPPING.get(project_name,
project_name); change this to require an explicit mapping: look up project_name
in API_REF_SERVICE_MAPPING (use API_REF_SERVICE_MAPPING[project_name] or check
'if project_name not in API_REF_SERVICE_MAPPING'), and if missing either raise
an exception (e.g., ValueError) or log an error and fail fast so you don't emit
incorrect /api-ref/{project}/... URLs; also add any missing mappings (e.g.,
include 'zun' -> the correct service slug) to API_REF_SERVICE_MAPPING to cover
supported API-ref projects.
In `@scripts/get_openstack_plaintext_docs.sh`:
- Around line 247-251: The script currently validates startup dependencies (tox
and PYTHON) but misses pandoc, which is required when OS_API_DOCS=true before
the HTML->plain conversion in the pandoc call (used in the block referencing
"$html_file" -> "$text_file"); add an early fast-fail check so when
OS_API_DOCS=true the script runs a command existence check (e.g. command -v
pandoc) and prints a clear error and exits non-zero if pandoc is not found,
matching the style used for the existing dependency checks.
- Around line 224-231: The substitution of tox_text_api_ref_target using
"${tox_text_api_ref_target//api-ref/$api_dir}" wrongly rewrites environment
names and other tokens; change the logic so you only rewrite path segments
inside tox_text_api_ref_target (e.g., replace occurrences of "api-ref/" or
"/api-ref" or the specific build/source path components) or compute an
adjusted_env variable (keep "text-api-ref" if the env exists, otherwise set
adjusted_env="text-${api_dir##*/}" or similar) and use that adjusted_env in the
tox invocation (replace the hardcoded "tox -etext-api-ref" with "tox
-e${adjusted_env}"); also add a pre-check that pandoc is installed before its
invocation (e.g., command -v pandoc || { echo ...; exit 1; }) to avoid running
pandoc without verification.
---
Outside diff comments:
In `@scripts/generate_embeddings_openstack.py`:
- Around line 65-71: The fallback in the except block currently sets
relative_path to a str (path_obj.name) but later calls relative_path.as_posix(),
causing AttributeError; change the except branch to assign a Path object (e.g.,
relative_path = Path(path_obj.name) or Path(path_obj.name).resolve()) so
relative_path is always a pathlib.Path and relative_path_str =
relative_path.as_posix() works; update references around
path_obj.relative_to(self.folder_path.resolve()), relative_path, and
relative_path_str accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec1defaf-30cc-40f0-83fc-ddc67b202f3c
📒 Files selected for processing (4)
Containerfilerequirements.txtscripts/generate_embeddings_openstack.pyscripts/get_openstack_plaintext_docs.sh
💤 Files with no reviewable changes (1)
- requirements.txt
| RUN if [ "$BUILD_UPSTREAM_DOCS" = "true" ]; then \ | ||
| dnf install -y graphviz python-devel pcre-devel pip && \ | ||
| pip install tox html2text && \ | ||
| bash -c 'curl -L https://github.com/jgm/pandoc/releases/download/3.1.11.1/pandoc-3.1.11.1-linux-amd64.tar.gz | tar -zx --strip-components=1 -C /usr/local/' && \ | ||
| pip install tox && \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat Containerfile | head -50Repository: openstack-lightspeed/rag-content
Length of output: 1625
🏁 Script executed:
# Search for Pandoc-related lines in the Containerfile
rg -n "pandoc" Containerfile -i -A 3 -B 3Repository: openstack-lightspeed/rag-content
Length of output: 538
🏁 Script executed:
cat -n ContainerfileRepository: openstack-lightspeed/rag-content
Length of output: 8242
🌐 Web query:
Pandoc 3.1.11.1 release checksum signature verification GitHub
💡 Result:
No official Pandoc 3.1.11.1 release exists on GitHub at https://github.com/jgm/pandoc/releases. The latest releases are 3.9.0.2 (2026-03-19), with earlier ones like 3.1.8 (2023-09-09) and 3.1.2 (2023-03-28). Pandoc 3.1.11.1 appears to be a Debian/Ubuntu package version (haskell-pandoc 3.1.11.1+ds-2 in Debian testing/unstable), not an upstream release, so it lacks official GitHub binaries, checksums, or PGP signatures from jgm/pandoc. Pandoc releases on GitHub provide SHA256 checksums next to each binary asset (e.g., for 3.9.0.2-linux-amd64.tar.gz: a69abfababda8a56969a254b09f9553a7be89ddec00d4e0fe9fd585d71a67508). No PGP signatures are provided; verification relies on GitHub's hosting and SHA256 checksums. macOS .pkg files are Apple-signed and notarized. An announcement discussion exists at jgm/pandoc#9307, but no assets/checksums there for 3.1.11.1. For Debian packages, use apt cryptographic verification or compute local checksums against published Debian values if available.
Citations:
- 1: ANN: pandoc 3.1.11.1 jgm/pandoc#9307
- 2: https://github.com/jgm/pandoc/releases
- 3: https://github.com/jgm/pandoc/releases/tag/3.1
- 4: https://github.com/jgm/pandoc/releases/tag/3.1.8
- 5: https://pandoc.org/releases.html
- 6: Tests failing on 3.1.11.1 jgm/pandoc#9589
- 7: https://sources.debian.org/src/haskell-pandoc/3.1.11.1-3/INSTALL.md/
- 8: https://github.com/jgm/pandoc/releases/tag/3.1.2
- 9: https://tracker.debian.org/pkg/pandoc
- 10: https://packages.debian.org/sid/pandoc
- 11: https://github.com/jgm/pandoc/releases/tag/1.13.1
🌐 Web query:
curl -L vs curl -fsSL HTTP error handling differences
💡 Result:
The key difference in HTTP error handling between curl -L and curl -fsSL lies in how they respond to HTTP errors (response codes >=400): - curl -L (follow redirects only): Does not treat HTTP errors as failures. It outputs the error page body to stdout, shows progress meter and any errors, and exits with code 0 (success). Scripts checking $? will not detect the error. - curl -fsSL (-f: fail on errors, -s: silent/no progress, -S: show-error, -L: follow redirects): Treats HTTP >=400 responses (after following redirects) as failures. Exits with code 22, suppresses the error body output (no data to stdout), hides progress but shows error message to stderr. Ideal for scripts to detect failures reliably. Both follow redirects (-L). Without -f, curl considers only connection issues as errors, not HTTP status codes. With -f, it checks the final response after redirects. Examples: curl -L https://httpstat.us/500 # Outputs error HTML, exit 0 curl -fsSL https://httpstat.us/500 # No output, "curl: (22) ... 500", exit 22 Note: -f may not catch 401/407 auth errors. Use --fail-with-body (curl >=7.76.0) to also output body on failure.
Citations:
- 1: https://curl.se/docs/manpage.html
- 2: https://linuxman7.org/linux/man-pages/man1/curl.1.html
- 3: https://www.joyfulbikeshedding.com/blog/2020-05-11-best-practices-when-using-curl-in-shell-scripts.html
- 4: https://superuser.com/questions/590099/can-i-make-curl-fail-with-an-exitcode-different-than-0-if-the-http-status-code-i
- 5: https://dotboris.io/articles/correct-curl/
- 6: https://stackoverflow.com/questions/89989/is-there-a-curl-wget-option-that-prevents-saving-files-in-case-of-http-errors
Use an actual released version of Pandoc and add checksum verification.
Pandoc 3.1.11.1 does not exist as an official GitHub release—it's a Debian/Ubuntu package version only. The download URL will fail with a 404 error. Use a released version from the Pandoc releases page (e.g., 3.1.8 or later). Additionally, switch from curl -L to curl -fsSL to fail fast on HTTP errors, download the archive explicitly, verify the published SHA256 checksum before extraction, and then extract it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Containerfile` around lines 24 - 27, The RUN block gated by
BUILD_UPSTREAM_DOCS installs pandoc using a non-existent release
(pandoc-3.1.11.1) and uses curl without checksum verification; replace the
inlined curl -L download in that RUN step with a sequence that references a real
released pandoc version (e.g., 3.1.8 or newer), use curl -fsSL to download the
tarball to a temporary file, fetch the corresponding published SHA256 checksum
for that exact release, verify the downloaded archive's checksum before
extraction, then extract (--strip-components=1) and remove the archive; update
the filename referenced in the RUN command and any hardcoded pandoc-* identifier
to match the chosen released version.
| # Check if this is API-Ref documentation | ||
| if "_api-ref/" in relative_path_str: | ||
| # This is API-Ref documentation - use different URL pattern | ||
| # Pattern: project/version_api-ref/... -> /api-ref/service/... | ||
|
|
||
| # Get the service name from mapping | ||
| service_name = self.API_REF_SERVICE_MAPPING.get(project_name, project_name) | ||
|
|
||
| # Remove _api-ref suffix: /cinder/2025.2_api-ref/ → /cinder/2025.2/api-ref/ | ||
| relative_path = re.sub(r"/(\d+\.\d+)_api-ref/", r"/\1/api-ref/", relative_path) | ||
| # Remove project name and version_api-ref prefix | ||
| # Example: heat/2025.2_api-ref/v1/index.txt -> v1/index.txt | ||
| api_ref_pattern = re.compile(r"^[^/]+/(?:\d+\.\d+|latest)_api-ref/") | ||
| remaining_path = api_ref_pattern.sub("", relative_path_str) | ||
|
|
||
| # Replace .txt with .html | ||
| remaining_path = remaining_path.replace(".txt", ".html") | ||
| # Build API-Ref URL | ||
| return f"{self.base_url}/api-ref/{service_name}/{remaining_path}" |
There was a problem hiding this comment.
Don’t silently fall back to the repository name for API-Ref services.
service_name = self.API_REF_SERVICE_MAPPING.get(project_name, project_name) means any newly ingested _api-ref project that is missing from this table will still emit /api-ref/{project}/... URLs with no signal that the mapping is incomplete. The PR notes already mention Zun API-ref output, but zun is not covered here. Please either complete the mapping for supported API-ref projects or fail fast when an API-ref project has no explicit service slug.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_embeddings_openstack.py` around lines 77 - 93, The code
silently falls back to the repository name when building API-Ref URLs using
API_REF_SERVICE_MAPPING.get(project_name, project_name); change this to require
an explicit mapping: look up project_name in API_REF_SERVICE_MAPPING (use
API_REF_SERVICE_MAPPING[project_name] or check 'if project_name not in
API_REF_SERVICE_MAPPING'), and if missing either raise an exception (e.g.,
ValueError) or log an error and fail fast so you don't emit incorrect
/api-ref/{project}/... URLs; also add any missing mappings (e.g., include 'zun'
-> the correct service slug) to API_REF_SERVICE_MAPPING to cover supported
API-ref projects.
| if ! grep -q "text-api-ref" tox.ini; then | ||
| # Adjust the target if it's api-guide instead of api-ref | ||
| local adjusted_target | ||
| adjusted_target="${tox_text_api_ref_target//api-ref/$api_dir}" | ||
| echo "$adjusted_target" >> tox.ini | ||
| fi | ||
|
|
||
| if ! tox -etext-api-ref; then |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the script and understand the context
head -250 scripts/get_openstack_plaintext_docs.sh | tail -50Repository: openstack-lightspeed/rag-content
Length of output: 2228
🏁 Script executed:
# Get more context around the problematic lines
sed -n '200,260p' scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 2774
🏁 Script executed:
# Look at where tox_text_api_ref_target is set
rg "tox_text_api_ref_target" scripts/get_openstack_plaintext_docs.sh -nRepository: openstack-lightspeed/rag-content
Length of output: 198
🏁 Script executed:
# Look at api_dir usage
rg "api_dir" scripts/get_openstack_plaintext_docs.sh -n -B2 -A2Repository: openstack-lightspeed/rag-content
Length of output: 3138
🏁 Script executed:
# Check for pandoc references
rg "pandoc" scripts/get_openstack_plaintext_docs.sh -n -B2 -A2Repository: openstack-lightspeed/rag-content
Length of output: 431
🏁 Script executed:
# Get the complete definition of tox_text_api_ref_target starting at line 142
sed -n '140,180p' scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 1713
🏁 Script executed:
# Check if pandoc is validated anywhere in the script (startup checks)
rg "pandoc" scripts/get_openstack_plaintext_docs.sh -B5 -A5Repository: openstack-lightspeed/rag-content
Length of output: 691
🏁 Script executed:
# Check for startup validation of tools
sed -n '1,100p' scripts/get_openstack_plaintext_docs.sh | grep -A20 "Check"Repository: openstack-lightspeed/rag-content
Length of output: 750
🏁 Script executed:
# Look for any tool availability checks at the beginning
sed -n '1,150p' scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 5564
🏁 Script executed:
# Let's verify the actual behavior by tracing through what happens with api-guide
# Create a test to show what the substitution produces
cat > /tmp/test_subst.sh << 'EOF'
#!/bin/bash
api_dir="api-guide"
tox_text_api_ref_target="
[testenv:text-api-ref]
description =
Build API reference documentation in HTML format.
commands =
sphinx-build --keep-going -j auto -b html -d api-ref/build/doctrees api-ref/source api-ref/build/html
deps =
-c{env:TOX_CONSTRAINTS_FILE:https://releases.openstack.org/constraints/upper/2025.2}
-r{toxinidir}/doc/requirements.txt
os-api-ref
"
echo "=== Original tox_text_api_ref_target ==="
echo "$tox_text_api_ref_target"
echo ""
echo "=== After global substitution with api_dir=api-guide ==="
adjusted_target="${tox_text_api_ref_target//api-ref/$api_dir}"
echo "$adjusted_target"
echo ""
echo "=== What tox environment name got created ==="
echo "$adjusted_target" | grep "testenv:"
echo ""
echo "=== Script tries to run ==="
echo "tox -etext-api-ref"
EOF
bash /tmp/test_subst.shRepository: openstack-lightspeed/rag-content
Length of output: 1008
The api-guide rewrite breaks the tox environment invocation.
${tox_text_api_ref_target//api-ref/$api_dir} performs a global substitution. When api_dir=api-guide, it rewrites [testenv:text-api-ref] to [testenv:text-api-guide] and os-api-ref to os-api-guide, but Line 231 still invokes tox -etext-api-ref. This causes the build to fail with "Unknown environment 'text-api-ref'" for repositories with only an api-guide directory.
The proposed fix handling only path segments (e.g., api-ref/build and api-ref/source) is incomplete—it doesn't prevent the environment name from being rewritten to text-api-guide, nor does it prevent the dependency from changing to os-api-guide. The proper fix requires either (1) keeping the environment name unchanged and building conditionally based on which directory exists, or (2) updating the hardcoded tox -etext-api-ref invocation to use the adjusted environment name.
Additionally, pandoc is invoked at line 248 without prior validation, but the startup checks only verify tox and $PYTHON.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/get_openstack_plaintext_docs.sh` around lines 224 - 231, The
substitution of tox_text_api_ref_target using
"${tox_text_api_ref_target//api-ref/$api_dir}" wrongly rewrites environment
names and other tokens; change the logic so you only rewrite path segments
inside tox_text_api_ref_target (e.g., replace occurrences of "api-ref/" or
"/api-ref" or the specific build/source path components) or compute an
adjusted_env variable (keep "text-api-ref" if the env exists, otherwise set
adjusted_env="text-${api_dir##*/}" or similar) and use that adjusted_env in the
tox invocation (replace the hardcoded "tox -etext-api-ref" with "tox
-e${adjusted_env}"); also add a pre-check that pandoc is installed before its
invocation (e.g., command -v pandoc || { echo ...; exit 1; }) to avoid running
pandoc without verification.
| # Convert HTML to plain text using pandoc (consistent output) | ||
| pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" || { | ||
| echo "ERROR: Failed to convert $html_file" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file size and read the startup/early sections
wc -l scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 119
🏁 Script executed:
# Read the beginning of the script to find startup validation
head -100 scripts/get_openstack_plaintext_docs.sh | cat -nRepository: openstack-lightspeed/rag-content
Length of output: 4796
🏁 Script executed:
# Search for dependency checks (tox, PYTHON, pandoc) in the script
rg -n "(command -v|which|type )" scripts/get_openstack_plaintext_docs.sh | head -20Repository: openstack-lightspeed/rag-content
Length of output: 612
🏁 Script executed:
# Check for pandoc usage throughout the script
rg -n "pandoc" scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 259
🏁 Script executed:
# Search for any pandoc availability checks in the entire script
rg -n "pandoc" scripts/get_openstack_plaintext_docs.shRepository: openstack-lightspeed/rag-content
Length of output: 259
🏁 Script executed:
# Check the context around lines 247-251 to understand control flow
sed -n '200,280p' scripts/get_openstack_plaintext_docs.sh | cat -nRepository: openstack-lightspeed/rag-content
Length of output: 4434
🏁 Script executed:
# Check if there are any early checks for OS_API_DOCS and related setup
sed -n '35,150p' scripts/get_openstack_plaintext_docs.sh | cat -nRepository: openstack-lightspeed/rag-content
Length of output: 5321
🏁 Script executed:
# Check if there's any conditional logic around pandoc that we might have missed
sed -n '210,260p' scripts/get_openstack_plaintext_docs.sh | cat -nRepository: openstack-lightspeed/rag-content
Length of output: 2703
🏁 Script executed:
# Verify how OS_API_DOCS is used and when this code path executes
rg -B5 -A5 "OS_API_DOCS" scripts/get_openstack_plaintext_docs.sh | head -40Repository: openstack-lightspeed/rag-content
Length of output: 1359
Add pandoc to startup dependency checks.
When OS_API_DOCS=true, the script requires pandoc for HTML-to-plain-text conversion (line 248), but currently validates only tox and PYTHON at startup. Missing pandoc won't be detected until after the HTML build completes, wasting time on unnecessary work. Add an early check to fail fast if pandoc is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/get_openstack_plaintext_docs.sh` around lines 247 - 251, The script
currently validates startup dependencies (tox and PYTHON) but misses pandoc,
which is required when OS_API_DOCS=true before the HTML->plain conversion in the
pandoc call (used in the block referencing "$html_file" -> "$text_file"); add an
early fast-fail check so when OS_API_DOCS=true the script runs a command
existence check (e.g. command -v pandoc) and prints a clear error and exits
non-zero if pandoc is not found, matching the style used for the existing
dependency checks.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/generate_embeddings_openstack.py (1)
65-71:⚠️ Potential issue | 🟡 MinorKeep the fallback branch as a
Path.When
relative_to()fails on line 68, line 69 assignspath_obj.name(astr), but line 71 calls.as_posix()on it, raisingAttributeError. Wrap the fallback inPath()to maintain the type consistency.Fix
try: relative_path = path_obj.relative_to(self.folder_path.resolve()) except ValueError: - relative_path = path_obj.name + relative_path = Path(path_obj.name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings_openstack.py` around lines 65 - 71, The fallback branch currently sets relative_path = path_obj.name (a str) causing .as_posix() to fail; change the except block to wrap the fallback in a Path so relative_path remains a Path instance (i.e., use Path(path_obj.name) or Path(path).name wrapped with Path) so that calling relative_path.as_posix() later works; locate this logic around the variables path_obj, relative_path and self.folder_path in the try/except where relative_to() is used and ensure the fallback produces a Path object.
♻️ Duplicate comments (1)
scripts/generate_embeddings_openstack.py (1)
36-56:⚠️ Potential issue | 🟠 MajorFail fast on unmapped API-Ref services.
This map is still incomplete for projects already exercised in this PR—
zunandmagnumare absent here—and Line 83 silently falls back toproject_name. That can generate baddocs_urlvalues and then warn, drop, or fail the run depending on--unreachable-action. Require an explicit mapping here instead of inventing one at runtime, even if some entries are identity mappings.🛠️ Minimal fix
- service_name = self.API_REF_SERVICE_MAPPING.get(project_name, project_name) + try: + service_name = self.API_REF_SERVICE_MAPPING[project_name] + except KeyError as exc: + raise ValueError( + f"Missing API-Ref service mapping for project '{project_name}'" + ) from excAlso applies to: 82-93
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_embeddings_openstack.py` around lines 36 - 56, The API_REF_SERVICE_MAPPING is incomplete (missing entries like "zun" and "magnum") and the code currently falls back to project_name when a mapping is missing; update API_REF_SERVICE_MAPPING to include explicit mappings for all projects exercised in this PR (at least add "zun" and "magnum" with their correct service names) and change the lookup logic that builds docs_url to fail fast when a project key is absent (use explicit key access on API_REF_SERVICE_MAPPING and raise/exit with a clear error message referencing the missing project name instead of silently falling back to project_name), so downstream behavior and --unreachable-action handling are deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@scripts/generate_embeddings_openstack.py`:
- Around line 65-71: The fallback branch currently sets relative_path =
path_obj.name (a str) causing .as_posix() to fail; change the except block to
wrap the fallback in a Path so relative_path remains a Path instance (i.e., use
Path(path_obj.name) or Path(path).name wrapped with Path) so that calling
relative_path.as_posix() later works; locate this logic around the variables
path_obj, relative_path and self.folder_path in the try/except where
relative_to() is used and ensure the fallback produces a Path object.
---
Duplicate comments:
In `@scripts/generate_embeddings_openstack.py`:
- Around line 36-56: The API_REF_SERVICE_MAPPING is incomplete (missing entries
like "zun" and "magnum") and the code currently falls back to project_name when
a mapping is missing; update API_REF_SERVICE_MAPPING to include explicit
mappings for all projects exercised in this PR (at least add "zun" and "magnum"
with their correct service names) and change the lookup logic that builds
docs_url to fail fast when a project key is absent (use explicit key access on
API_REF_SERVICE_MAPPING and raise/exit with a clear error message referencing
the missing project name instead of silently falling back to project_name), so
downstream behavior and --unreachable-action handling are deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24647331-ec92-4d8a-bfc3-ff07815ef6fc
📒 Files selected for processing (2)
Containerfilescripts/generate_embeddings_openstack.py
🚧 Files skipped from review as they are similar to previous changes (1)
- Containerfile
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Description:-
This PR improves the quality of API reference data ingested into the RAG pipeline and ensures consistent coverage across OpenStack 2025.2 projects.
Use pandoc + html2text for cleaner API extraction (removes nav, headers, and boilerplate)
Apply heuristic filtering to index.txt files to keep meaningful single-page APIs and drop navigation-only indices
Auto-detect api-ref and api-guide sources to avoid missing project docs
Filter out placeholder/junk files using logo metadata detection
Summary by CodeRabbit