feat: add ddtrace, in-memory cache, and cache TTL to source-declarative-manifest image#932
Conversation
…g profiling support Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1772724202-add-ddtrace-profiling-support#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1772724202-add-ddtrace-profiling-supportPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)3 918 tests +49 3 900 ✅ +43 6m 30s ⏱️ -21s Results for commit 6eee546. ± Comparison against base commit 7f41401. This pull request skips 6 tests.♻️ This comment has been updated with latest results. |
|
/prerelease
|
ddtrace v2.x profiling stack collector references _PyThread_CurrentExceptions which was removed in CPython 3.13. This causes profiling to silently fail (tracing works but profiles are never sent to Datadog). Upgrading to ddtrace>=3,<5 fixes Python 3.13 profiling support. Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
…results Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
…gmentation Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
…emory Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
…te cache When set to 'true', forces requests_cache to use in-memory SQLite instead of file-based SQLite. This avoids OS page cache growth from file I/O, which Kubernetes counts as container memory (container_memory_working_set_bytes). Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
PyTest Results (Full)3 921 tests 3 903 ✅ 10m 31s ⏱️ Results for commit 6eee546. ♻️ This comment has been updated with latest results. |
Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
requests_cache uses lazy expiration: expired entries are only removed when re-accessed, not automatically deleted. For connectors making thousands of unique API calls (paginated endpoints), expired entries accumulate in the SQLite database indefinitely, causing unbounded memory growth. This adds a _purge_expired_cache_entries() method that is called every 100 requests to actively delete expired entries and reclaim memory. Combined with the existing expire_after=3600 TTL, this ensures the cache stays bounded to approximately 1 hour of data instead of growing indefinitely. Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease
|
Reduces expire_after from 3600 to 600 seconds. The 1-hour TTL was still allowing too much data to accumulate in the in-memory SQLite cache, causing container OOM at 2 GB. With 10-minute TTL + periodic purging, the cache should stay much smaller. Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease |
…sponse caching Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
…ching for testing Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
…bugging Co-Authored-By: gl_anatolii.yatsuk <gl_anatolii.yatsuk@airbyte.io>
|
/prerelease |
Summary
Adds
ddtrace(Datadog's tracing/profiling library) to thesource-declarative-manifestDocker image so that manifest-only connectors (like Twilio) can be built with Datadog memory profiling support. The library is installed but inactive at runtime unless profiling is explicitly enabled via environment variables (e.g.DD_PROFILING_ENABLED=true).Also adds an
AIRBYTE_USE_IN_MEMORY_CACHEenv var toHttpClientthat forcesrequests_cacheto use in-memory SQLite instead of file-based SQLite. This prevents OS page cache growth from file I/O, which Kubernetes counts as container memory (container_memory_working_set_bytes).Additionally sets a 1-hour TTL (
expire_after=3600) on allCachedLimiterSessioninstances to prevent unbounded cache growth during long-running syncs.This is the CDK-side prerequisite for enabling Datadog memory profiling on manifest-only connectors. A companion change in the
airbyterepo (#74308) modifies the connector Dockerfile to useddtrace-runas the entrypoint wrapper and inject the required env vars.Updates since last revision
expire_after=3600(1-hour TTL) to HTTP response cache: All cached responses now expire after 1 hour. This bounds cache growth during long syncs. Previously cached entries had no expiration and could accumulate indefinitely.Previous updates
LD_PRELOAD: jemalloc did not reduce container memory growth and broke ddtrace profiling (profiling data disappeared entirely). Removedlibjemalloc2installation andLD_PRELOADfrom the Dockerfile.AIRBYTE_USE_IN_MEMORY_CACHEenv var: Investigation usingkubernetes.memory.rssvskubernetes.memory.usagemetrics revealed that the container's high memory usage (~2 GB) was not a Python memory leak. RSS stayed flat at ~300 MB whileusage(which includes OS page cache) grew to 2 GB. The ~1.7 GB gap is kernel page cache from SQLite file I/O inrequests_cache. The new env var forces in-memory SQLite to verify this theory and potentially eliminate the page cache growth.ddtracefrom>=3,<4back to>=2.16,<3: Testing with ddtrace v3.19.6 showed heap profiling working but reporting ~70 MB for a connector using ~1 GB container memory.ddtracefrom>=3,<5to>=3,<4: ddtrace v4.5.1 heap profiler reported <3 MB due to API changes in_memalloc.heap().ddtracefrom>=2.16,<3to>=3,<5: v2.x profiling stack collector references_PyThread_CurrentExceptions, removed in CPython 3.13.Review & Testing Checklist for Human
expire_after=3600impact on parent-child stream syncs: This TTL applies globally to ALL connectors using the cache, not just Twilio. If a parent stream's cached responses are consumed by child streams and processing takes >1 hour, the parent data will be evicted mid-sync and need to be re-fetched. Verify that no critical connectors have parent streams whose data is accessed by child streams over a span exceeding 1 hour.AIRBYTE_USE_IN_MEMORY_CACHEimpact on process memory: Forcing in-memory SQLite means cached HTTP responses are held in process heap instead of on disk. For connectors with many cached parent stream requests, this could increase Python process RSS. Verify that the trade-off (lowerworking_set_bytesbut potentially higher RSS) is acceptable.file::memory:?cache=shared, which shares the cache across connections in the same process. Verify this doesn't causedatabase table is lockederrors under concurrent access (the file-based path usesfast_save=Trueand WAL mode to mitigate this)._memallocin ddtrace v2: ddtrace v2 does not officially support Python 3.13. The stack collector is disabled viaDD_PROFILING_STACK_ENABLED=falsein the companion PR, but the_memallocheap profiling C extension has not been verified on Python 3.13.kubernetes.memory.usagestays significantly lower withAIRBYTE_USE_IN_MEMORY_CACHE=truecompared to file-based cache, (3)kubernetes.memory.rssdoes not increase substantially (i.e., cached data in-memory doesn't push RSS beyond the previous ~300 MB baseline), (4) Syncs that rely on parent stream caching complete successfully even with the 1-hour TTL.Notes
AIRBYTE_USE_IN_MEMORY_CACHE=true,kubernetes.memory.usagenow matcheskubernetes.memory.rss(~300 MB), confirming that the previous ~1.7 GB gap was OS page cache from SQLite file I/O, not a Python memory leak.pyproject.tomlmanifest-server extra still declaresddtrace = { version = "^3", optional = true }— this Dockerfile explicitly overrides that to v2 for testing.