Skip to content

Gapic deduplication experimentation#17573

Draft
hebaalazzeh wants to merge 32 commits into
mainfrom
gapic-deduplication-experimentation
Draft

Gapic deduplication experimentation#17573
hebaalazzeh wants to merge 32 commits into
mainfrom
gapic-deduplication-experimentation

Conversation

@hebaalazzeh

Copy link
Copy Markdown
Contributor

GAPIC Client Deduplication (Shared Utility Layer)

Goal

Reduce package size and improve maintainability of generated GAPIC clients by deduplicating redundant static helper methods and configuration logic into a shared layer.

Background

Currently, every generated service client (e.g., in google-cloud-compute, google-cloud-aiplatform) contains hundreds of lines of identical static methods for:

  • mTLS endpoint resolution (_get_default_mtls_endpoint)
  • Client certificate usage checks (_use_client_cert_effective)
  • Environment variable reading (_read_environment_variables)

This duplication inflates package size and makes horizontal fixes difficult.

Proposed Approach (Shared Utility Layer)

We are introducing a shared_utils.py module within the generated services package. Common logic is extracted into this module. The generated service clients are refactored to delegate to these shared functions via thin pass-through stubs.

Benefits:

  • Significant LOC Reduction: Thousands of lines of code saved across massive libraries.
  • Backward Compatibility Preserved: Backward-compatible stubs on the client class ensure that advanced users monkey-patching "private" methods continue to function.
  • Type Safety: Maintains strict type hints.

Manual Validation Results (google-cloud-kms)

A manual experiment was conducted on google-cloud-kms to validate this approach before updating the generator templates.

Steps Taken:

  1. Created google/cloud/kms_v1/services/shared_utils.py.
  2. Moved get_default_mtls_endpoint, use_client_cert_effective, and read_environment_variables to shared_utils.
  3. Refactored KeyManagementServiceClient and EkmServiceClient to delegate to shared_utils.

Impact:

  • Replaced ~35 lines of duplicate logic per client with one-line delegations.
  • No functional regressions.

Testing Note (Environmental Edge Case Discovered):

During testing on Cloudtop environments, tests assuming default False states for client certs might fail if CLOUDSDK_CONTEXT_AWARE_USE_CLIENT_CERTIFICATE=true is set globally. This is an artifact of the testing environment assumptions, not the deduplication logic. Overriding this variable during testing (CLOUDSDK_CONTEXT_AWARE_USE_CLIENT_CERTIFICATE=false) confirms logic is sound.

References

hebaalazzeh and others added 30 commits June 15, 2026 21:27
- Add fallback to handle single-iteration quantile calculation crashes
- Normalize .pyc file paths during line counting and log exceptions
- Expose worker process stderr to facilitate debugging CalledProcessError
- Fix absolute paths in documentation.md to use relative directory paths
…g, and encodings

- Use importlib.util.source_from_cache for robust .pyc resolution
- Move importlib.util and logging imports to module level
- Refine json.loads to parse only the last line of stdout
- Specify UTF-8 encoding when opening files for writing
- Extract worker logic to _run_worker_and_parse for robust JSON parsing
- Add early validation for iterations parameter in run_master
- Add non-zero exit code check and warning for run_trace failures
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Execute cProfile in a fresh subprocess to ensure cold-start accuracy
- Execute tracemalloc in a fresh subprocess to capture all memory allocations
- Clean up master process exception handling for missing taskset vs crashed workers
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Use json.dumps to safely escape target_module string in all subprocess commands (mprofile, cprofile, trace)
- Replace detailed implementation architecture and alternative solutions with a concise high-level usage and expected output format per PR review feedback
- Read /proc/self/statm inside worker before and after import
- Exposes pure C-extension memory allocations (e.g. grpc/protobuf) missed by Python's internal tracemalloc
- Requires zero dependencies, bypassing the need for psutil
- Recursively wipes __pycache__ and .pyc files via pathlib prior to iteration loops.
- Employs Python's -B flag on worker subprocesses to prevent re-caching during the benchmark, enforcing raw .py parsing for every execution.
- Replaces --clear-pycache with an opt-out --keep-pycache flag
- The profiler now automatically sweeps and enforces disk-level cold starts without requiring explicit flags from the user.
Address PR review comment by moving setup tracking (rss_before, modules_before) outside of the time.perf_counter() window to prevent artificial latency inflation.
Addresses PR feedback to avoid broad exception catching when parsing physical python files for line counts.
Addresses PR review by explicitly documenting the expected PEP 3147/488 error path for irregular .pyc files.
…ssing modules

Addresses PR review by replacing null-checks with a cleaner try/except block that assumes normal module behavior and explicitly logs when a C-extension or built-in is skipped due to a missing __file__ attribute.
Addresses PR review by replacing the 'none' string check with a typed NO_CPU_PINNING = -1 constant for cleaner argument parsing and state tracking.
…askset

Addresses PR review by removing the fallback unpinned execution logic when taskset is not installed. The script now fails immediately with a clear error message, preventing code duplication.
Addresses PR review by collapsing the repeated statistics.quantiles logic into a reusable _calculate_percentiles helper method.
…terations

Addresses PR review by validating that 'loaded_modules' and 'loaded_lines' remain deterministic across cold-start iterations, warning the user if non-deterministic import paths are triggered.
… tracemalloc profiling

Addresses PR review by replacing the hard-to-maintain dynamic code string inside run_mprofile with a properly structured multiprocessing 'spawn' context.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces shared utilities for mTLS endpoint handling in the google-cloud-kms package and adds an import profiling toolset under scripts/import_profiler/, which includes a profiler script, documentation, and a script to rewrite __init__.py files for lazy loading. The review feedback suggests several robust improvements to the profiling scripts: converting the rglob generator to a list to prevent traversal issues during directory deletion, updating the __all__ regex parser to support lists in addition to tuples, adding boundary checks to prevent potential IndexError during multi-line import parsing, and implementing __dir__ alongside __getattr__ to support autocompletion for lazy-loaded modules.

"""Recursively deletes all __pycache__ directories and .pyc files to ensure disk-level cold starts."""
print("Sweeping directory to delete __pycache__ and force bytecode recompilation...")
count = 0
for p in pathlib.Path('.').rglob('__pycache__'):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying the directory structure (deleting __pycache__ directories) while traversing it with rglob can lead to unexpected behavior or FileNotFoundError. Converting the generator to a list first ensures safe traversal.

Suggested change
for p in pathlib.Path('.').rglob('__pycache__'):
for p in list(pathlib.Path('.').rglob('__pycache__')):

content = f.read()

# Find the __all__ tuple/list
all_match = re.search(r'__all__\s*=\s*\((.*?)\)', content, re.DOTALL)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __all__ variable in Python can be defined as either a tuple or a list. Updating the regular expression to support both parentheses and square brackets prevents parsing failures.

Suggested change
all_match = re.search(r'__all__\s*=\s*\((.*?)\)', content, re.DOTALL)
all_match = re.search(r'__all__\s*=\s*[(\[](.*?)[)\]]', content, re.DOTALL)

Comment on lines +44 to +51
if "(" in names_part and ")" not in names_part:
# multi-line import
import_names_str = names_part.replace("(", "")
i += 1
while ")" not in lines[i]:
import_names_str += lines[i]
i += 1
import_names_str += lines[i].replace(")", "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a multi-line import is malformed or lacks a closing parenthesis, iterating over lines without checking bounds can raise an IndexError. Adding a boundary check ensures safe parsing.

Suggested change
if "(" in names_part and ")" not in names_part:
# multi-line import
import_names_str = names_part.replace("(", "")
i += 1
while ")" not in lines[i]:
import_names_str += lines[i]
i += 1
import_names_str += lines[i].replace(")", "")
if "(" in names_part and ")" not in names_part:
# multi-line import
import_names_str = names_part.replace("(", "")
i += 1
while i < len(lines) and ")" not in lines[i]:
import_names_str += lines[i]
i += 1
if i < len(lines):
import_names_str += lines[i].replace(")", "")

Comment on lines +109 to +115
def __getattr__(name):
if name in _LAZY_IMPORTS:
module = importlib.import_module(_LAZY_IMPORTS[name], package=__name__)
attr = getattr(module, name)
globals()[name] = attr
return attr
raise AttributeError(f"module {{__name__!r}} has no attribute {{name!r}}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When implementing lazy loading via __getattr__ (PEP 562), it is highly recommended to also define __dir__ so that interactive environments (like IPython, Jupyter, or standard REPL) and IDEs can correctly autocomplete the lazy-imported attributes.

    def __getattr__(name):
        if name in _LAZY_IMPORTS:
            module = importlib.import_module(_LAZY_IMPORTS[name], package=__name__)
            attr = getattr(module, name)
            globals()[name] = attr
            return attr
        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

    def __dir__():
        return sorted(list(globals().keys()) + list(_LAZY_IMPORTS.keys()))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant