Gapic deduplication experimentation#17573
Conversation
- Add fallback to handle single-iteration quantile calculation crashes - Normalize .pyc file paths during line counting and log exceptions - Expose worker process stderr to facilitate debugging CalledProcessError - Fix absolute paths in documentation.md to use relative directory paths
…g, and encodings - Use importlib.util.source_from_cache for robust .pyc resolution - Move importlib.util and logging imports to module level - Refine json.loads to parse only the last line of stdout - Specify UTF-8 encoding when opening files for writing
- Extract worker logic to _run_worker_and_parse for robust JSON parsing - Add early validation for iterations parameter in run_master - Add non-zero exit code check and warning for run_trace failures
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Execute cProfile in a fresh subprocess to ensure cold-start accuracy - Execute tracemalloc in a fresh subprocess to capture all memory allocations - Clean up master process exception handling for missing taskset vs crashed workers
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Use json.dumps to safely escape target_module string in all subprocess commands (mprofile, cprofile, trace)
- Replace detailed implementation architecture and alternative solutions with a concise high-level usage and expected output format per PR review feedback
- Read /proc/self/statm inside worker before and after import - Exposes pure C-extension memory allocations (e.g. grpc/protobuf) missed by Python's internal tracemalloc - Requires zero dependencies, bypassing the need for psutil
- Recursively wipes __pycache__ and .pyc files via pathlib prior to iteration loops. - Employs Python's -B flag on worker subprocesses to prevent re-caching during the benchmark, enforcing raw .py parsing for every execution.
- Replaces --clear-pycache with an opt-out --keep-pycache flag - The profiler now automatically sweeps and enforces disk-level cold starts without requiring explicit flags from the user.
Address PR review comment by moving setup tracking (rss_before, modules_before) outside of the time.perf_counter() window to prevent artificial latency inflation.
Addresses PR feedback to avoid broad exception catching when parsing physical python files for line counts.
Addresses PR review by explicitly documenting the expected PEP 3147/488 error path for irregular .pyc files.
…ssing modules Addresses PR review by replacing null-checks with a cleaner try/except block that assumes normal module behavior and explicitly logs when a C-extension or built-in is skipped due to a missing __file__ attribute.
Addresses PR review by replacing the 'none' string check with a typed NO_CPU_PINNING = -1 constant for cleaner argument parsing and state tracking.
…askset Addresses PR review by removing the fallback unpinned execution logic when taskset is not installed. The script now fails immediately with a clear error message, preventing code duplication.
Addresses PR review by collapsing the repeated statistics.quantiles logic into a reusable _calculate_percentiles helper method.
…terations Addresses PR review by validating that 'loaded_modules' and 'loaded_lines' remain deterministic across cold-start iterations, warning the user if non-deterministic import paths are triggered.
… tracemalloc profiling Addresses PR review by replacing the hard-to-maintain dynamic code string inside run_mprofile with a properly structured multiprocessing 'spawn' context.
There was a problem hiding this comment.
Code Review
This pull request introduces shared utilities for mTLS endpoint handling in the google-cloud-kms package and adds an import profiling toolset under scripts/import_profiler/, which includes a profiler script, documentation, and a script to rewrite __init__.py files for lazy loading. The review feedback suggests several robust improvements to the profiling scripts: converting the rglob generator to a list to prevent traversal issues during directory deletion, updating the __all__ regex parser to support lists in addition to tuples, adding boundary checks to prevent potential IndexError during multi-line import parsing, and implementing __dir__ alongside __getattr__ to support autocompletion for lazy-loaded modules.
| """Recursively deletes all __pycache__ directories and .pyc files to ensure disk-level cold starts.""" | ||
| print("Sweeping directory to delete __pycache__ and force bytecode recompilation...") | ||
| count = 0 | ||
| for p in pathlib.Path('.').rglob('__pycache__'): |
There was a problem hiding this comment.
Modifying the directory structure (deleting __pycache__ directories) while traversing it with rglob can lead to unexpected behavior or FileNotFoundError. Converting the generator to a list first ensures safe traversal.
| for p in pathlib.Path('.').rglob('__pycache__'): | |
| for p in list(pathlib.Path('.').rglob('__pycache__')): |
| content = f.read() | ||
|
|
||
| # Find the __all__ tuple/list | ||
| all_match = re.search(r'__all__\s*=\s*\((.*?)\)', content, re.DOTALL) |
There was a problem hiding this comment.
The __all__ variable in Python can be defined as either a tuple or a list. Updating the regular expression to support both parentheses and square brackets prevents parsing failures.
| all_match = re.search(r'__all__\s*=\s*\((.*?)\)', content, re.DOTALL) | |
| all_match = re.search(r'__all__\s*=\s*[(\[](.*?)[)\]]', content, re.DOTALL) |
| if "(" in names_part and ")" not in names_part: | ||
| # multi-line import | ||
| import_names_str = names_part.replace("(", "") | ||
| i += 1 | ||
| while ")" not in lines[i]: | ||
| import_names_str += lines[i] | ||
| i += 1 | ||
| import_names_str += lines[i].replace(")", "") |
There was a problem hiding this comment.
If a multi-line import is malformed or lacks a closing parenthesis, iterating over lines without checking bounds can raise an IndexError. Adding a boundary check ensures safe parsing.
| if "(" in names_part and ")" not in names_part: | |
| # multi-line import | |
| import_names_str = names_part.replace("(", "") | |
| i += 1 | |
| while ")" not in lines[i]: | |
| import_names_str += lines[i] | |
| i += 1 | |
| import_names_str += lines[i].replace(")", "") | |
| if "(" in names_part and ")" not in names_part: | |
| # multi-line import | |
| import_names_str = names_part.replace("(", "") | |
| i += 1 | |
| while i < len(lines) and ")" not in lines[i]: | |
| import_names_str += lines[i] | |
| i += 1 | |
| if i < len(lines): | |
| import_names_str += lines[i].replace(")", "") |
| def __getattr__(name): | ||
| if name in _LAZY_IMPORTS: | ||
| module = importlib.import_module(_LAZY_IMPORTS[name], package=__name__) | ||
| attr = getattr(module, name) | ||
| globals()[name] = attr | ||
| return attr | ||
| raise AttributeError(f"module {{__name__!r}} has no attribute {{name!r}}") |
There was a problem hiding this comment.
When implementing lazy loading via __getattr__ (PEP 562), it is highly recommended to also define __dir__ so that interactive environments (like IPython, Jupyter, or standard REPL) and IDEs can correctly autocomplete the lazy-imported attributes.
def __getattr__(name):
if name in _LAZY_IMPORTS:
module = importlib.import_module(_LAZY_IMPORTS[name], package=__name__)
attr = getattr(module, name)
globals()[name] = attr
return attr
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
def __dir__():
return sorted(list(globals().keys()) + list(_LAZY_IMPORTS.keys()))
GAPIC Client Deduplication (Shared Utility Layer)
Goal
Reduce package size and improve maintainability of generated GAPIC clients by deduplicating redundant static helper methods and configuration logic into a shared layer.
Background
Currently, every generated service client (e.g., in
google-cloud-compute,google-cloud-aiplatform) contains hundreds of lines of identical static methods for:_get_default_mtls_endpoint)_use_client_cert_effective)_read_environment_variables)This duplication inflates package size and makes horizontal fixes difficult.
Proposed Approach (Shared Utility Layer)
We are introducing a
shared_utils.pymodule within the generated services package. Common logic is extracted into this module. The generated service clients are refactored to delegate to these shared functions via thin pass-through stubs.Benefits:
Manual Validation Results (
google-cloud-kms)A manual experiment was conducted on
google-cloud-kmsto validate this approach before updating the generator templates.Steps Taken:
google/cloud/kms_v1/services/shared_utils.py.get_default_mtls_endpoint,use_client_cert_effective, andread_environment_variablestoshared_utils.KeyManagementServiceClientandEkmServiceClientto delegate toshared_utils.Impact:
Testing Note (Environmental Edge Case Discovered):
During testing on Cloudtop environments, tests assuming default
Falsestates for client certs might fail ifCLOUDSDK_CONTEXT_AWARE_USE_CLIENT_CERTIFICATE=trueis set globally. This is an artifact of the testing environment assumptions, not the deduplication logic. Overriding this variable during testing (CLOUDSDK_CONTEXT_AWARE_USE_CLIENT_CERTIFICATE=false) confirms logic is sound.References