Skip to content

feat: Add smart retry system with rate limiting and exponential backoff#426

Merged
didiergarcia merged 23 commits intomainfrom
feat/smart-retry-system
Apr 17, 2026
Merged

feat: Add smart retry system with rate limiting and exponential backoff#426
didiergarcia merged 23 commits intomainfrom
feat/smart-retry-system

Conversation

@didiergarcia
Copy link
Copy Markdown
Contributor

Summary

Port of the smart retry system from analytics-kotlin to analytics-swift, adding HTTP 429 rate limiting and 5xx exponential backoff capabilities to the HTTPClient.

  • Implements state machine pattern for retry decision logic
  • Adds persistent state management using Codable and UserDefaults
  • Provides configurable rate limit and backoff behavior via HttpConfig
  • Maintains backward compatibility (legacy mode when httpConfig is nil)
  • Includes comprehensive test coverage with time manipulation for deterministic chain tests

Key Components

  • RetryTypes.swift: Core enums and structs (PipelineState, RetryBehavior, UploadDecision, ResponseInfo)
  • RetryState.swift: Codable persistent state with per-batch metadata tracking
  • HttpConfig.swift: Configuration with validation/clamping for rate limit and backoff settings
  • TimeProvider.swift: Protocol for testable time (SystemTimeProvider, FakeTimeProvider)
  • RetryStateMachine.swift: Decision engine handling 200, 429, and 5xx responses
  • HTTPClient.swift: Integration with shouldUploadBatch/handleResponse
  • Storage.swift: Persistence via PropertyListEncoder/Decoder

Test Coverage

  • 21 tests passing across 6 test files
  • Unit tests for all components
  • Chain tests validating 429→429→200 and 500→500→200 sequences
  • Integration test confirming end-to-end behavior
  • Time manipulation using FakeTimeProvider for deterministic results

Configuration Example

let config = Configuration(writeKey: "key")
    .httpConfig(HttpConfig(
        rateLimitConfig: RateLimitConfig(
            enabled: true,
            maxRetries: 5,
            useRetryAfterHeader: true,
            defaultRetryAfterSeconds: 300
        ),
        backoffConfig: BackoffConfig(
            enabled: true,
            maxRetryCount: 3,
            initialDelaySeconds: 1.0,
            maxDelaySeconds: 300.0,
            multiplier: 2.0,
            jitterFactor: 0.1,
            maxTotalBackoffDuration: 3600
        )
    ))

🤖 Generated with Claude Code

didiergarcia and others added 17 commits March 16, 2026 18:08
- Add PipelineState enum (ready, rateLimited)
- Add RetryBehavior enum (retry, drop)
- Add DropReason and UploadDecision types
- Add ResponseInfo struct

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 93.81188% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.19%. Comparing base (110db3b) to head (90a62da).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...rces/Segment/Utilities/Networking/HTTPClient.swift 73.77% 16 Missing ⚠️
...es/Segment/Utilities/Retry/RetryStateMachine.swift 96.87% 4 Missing ⚠️
Sources/Segment/Plugins/SegmentDestination.swift 95.45% 3 Missing ⚠️
.../Segment/Utilities/Storage/Types/MemoryStore.swift 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #426      +/-   ##
==========================================
+ Coverage   71.20%   73.19%   +1.98%     
==========================================
  Files          49       54       +5     
  Lines        3706     4063     +357     
==========================================
+ Hits         2639     2974     +335     
- Misses       1067     1089      +22     
Files with missing lines Coverage Δ
Sources/Segment/Configuration.swift 77.89% <100.00%> (+0.97%) ⬆️
Sources/Segment/Utilities/Retry/HttpConfig.swift 100.00% <100.00%> (ø)
Sources/Segment/Utilities/Retry/RetryState.swift 100.00% <100.00%> (ø)
Sources/Segment/Utilities/Retry/RetryTypes.swift 100.00% <100.00%> (ø)
Sources/Segment/Utilities/Retry/TimeProvider.swift 100.00% <100.00%> (ø)
Sources/Segment/Utilities/Storage/DataStore.swift 100.00% <ø> (ø)
Sources/Segment/Utilities/Storage/Storage.swift 95.52% <100.00%> (+0.56%) ⬆️
...ources/Segment/Utilities/Storage/TransientDB.swift 100.00% <100.00%> (ø)
...gment/Utilities/Storage/Types/DirectoryStore.swift 85.36% <100.00%> (ø)
.../Segment/Utilities/Storage/Types/MemoryStore.swift 96.92% <60.00%> (-3.08%) ⬇️
... and 3 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Expand test coverage to match analytics-kotlin implementation:
- Add 20 new RetryStateMachine tests (5 → 25)
- Add 10 new HttpConfig tests (3 → 13)
- Add 5 new Storage tests (2 → 7)
- Total: 52 tests (up from 17)

New test coverage:
- Status code overrides (408→RETRY, 501→DROP, etc)
- 4xx/5xx default behaviors and unknown codes
- Exponential backoff calculation verification
- Rate limit edge cases (clamps, defaults, global retry count reset)
- shouldUploadBatch drops (max retries, max duration exceeded)
- getRetryCount all scenarios (new batch, per-batch, global, max)
- Legacy mode comprehensive tests (all features disabled)
- Storage persistence edge cases (null fields, overwrites, multiple batches)

All 52 tests passing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@didiergarcia
Copy link
Copy Markdown
Contributor Author

Test Coverage Update

Expanded test coverage from 17 to 52 tests to match analytics-kotlin implementation.

Detailed Breakdown:

RetryStateMachine_Tests: 25 tests (+20)

  • ✅ Status code overrides (408→RETRY, 501→DROP)
  • ✅ 4xx/5xx default behaviors
  • ✅ Unknown code handling
  • ✅ Exponential backoff verification
  • ✅ Rate limit edge cases (clamps, defaults, global retry count reset)
  • ✅ shouldUploadBatch drops (max retries, max duration exceeded)
  • ✅ getRetryCount all scenarios (new batch, per-batch, global, max)
  • ✅ Legacy mode comprehensive tests (all features disabled)

HttpConfig_Tests: 13 tests (+10)

  • ✅ Validation and clamping (min/max bounds)
  • ✅ Status code override filtering (invalid codes removed)
  • ✅ Negative value handling
  • ✅ Automatic validation on init

Storage_RetryState_Tests: 7 tests (+5)

  • ✅ Persistence edge cases (null fields, overwrites, multiple batches)

RetryChain_Tests: 2 tests

  • ✅ 429→429→200 chain validation
  • ✅ 500→500→200 chain validation

RetryState_Tests: 4 tests
RetryTypes_Tests: 1 test

All 52 tests passing ✅

Add 7 validation tests to guard against corrupted persisted state:

**RetryState_Tests (+5 tests):**
- testIsRateLimited_HandlesUnreasonableWaitTime: Documents infinite blocking risk when waitUntilTime is corrupted
- testExceedsMaxDuration_HandlesClockSkewGracefully: Verifies conservative behavior when firstFailureTime is in future (clock went backwards)
- testBatchMetadata_HandlesNegativeFailureCount: Documents that negative failureCount bypasses max retry check
- testIsRateLimited_ReturnsFalseWhenWaitTimeIsNil: Verifies guard clause protects against nil waitUntilTime
- testExceedsMaxDuration_ReturnsFalseWhenFirstFailureTimeIsNil: Verifies guard clause protects against nil firstFailureTime

**Storage_RetryState_Tests (+2 tests):**
- testLoadRetryState_ReturnsDefaultsForCorruptData: Verifies PropertyListDecoder error handling returns safe defaults
- testLoadRetryState_HandlesUnreasonablePersistedValues: Documents that extreme values (Int.max, far-future timestamps) are loaded without error

These tests address potential failure modes from:
- System clock changes (NTP sync, user manual adjustment, daylight saving)
- Storage corruption (disk errors, incomplete writes)
- App updates with schema changes

Based on React Native RetryManager persistence validation patterns.

Total test count: 52 → 59 tests

All 59 tests passing ✅

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@didiergarcia
Copy link
Copy Markdown
Contributor Author

Persistence Validation Tests Added

Based on React Native RetryManager review (PR #1159), added 7 validation tests to guard against corrupted persisted state.

Tests Added:

RetryState_Tests (+5):

  • ✅ Unreasonable waitUntilTime handling (infinite blocking risk)
  • ✅ Clock skew on firstFailureTime (conservative behavior when clock goes backwards)
  • ✅ Negative failureCount behavior (documents bypass of max retry check)
  • ✅ Nil waitUntilTime protection
  • ✅ Nil firstFailureTime protection

Storage_RetryState_Tests (+2):

  • ✅ Corrupt PropertyList data returns safe defaults
  • ✅ Extreme values (Int.max, far-future timestamps) load without error

Why These Tests Matter:

Clock Skew Scenarios:

  • NTP time sync corrections
  • User manual clock adjustment
  • Daylight saving transitions
  • Device replacing battery (clock reset)

Storage Corruption:

  • Disk I/O errors
  • App crash during write
  • iOS storage cleanup
  • Schema changes between app versions

Test Count: 52 → 59 tests

All 59 tests passing ✅

Commit: 2ed67f1

* fix: Enable retry system for memory storage mode and add e2e retry tests

The retry state machine was only wired into the file-based upload path.
Memory mode (flushData) bypassed shouldUploadBatch, didn't send
X-Retry-Count headers, and silently retried non-retryable status codes.

SDK changes:
- Route memory uploads through shouldUploadBatch via checkBatchUpload()
- Add X-Retry-Count header to both file and data upload paths
- Drop batches on non-retryable status codes in both flushData/flushFiles
- Track dropped batches via @atomic droppedBatchCount on SegmentDestination
- Expose droppedBatchCount on Analytics for CLI/consumer use

E2E CLI changes:
- Configure HttpConfig with rate limiting + exponential backoff
- Use droppedBatchCount to detect dropped (not delivered) events
- Enable basic, retry, and settings test suites (59/59 passing)

* Replace droppedBatchCount with errorHandler for delivery failure detection

Remove the test-only droppedBatchCount property from the SDK and use the
existing errorHandler callback instead, matching the Kotlin SDK pattern.

SDK changes:
- Report errors via reportInternalError when batches are dropped (both
  in flushData's checkBatchUpload path and HTTPClient's shouldUploadBatch)
- Remove @atomic droppedBatchCount from SegmentDestination and Analytics

CLI changes:
- Add file-backed DeliveryErrorTracker with two channels: transient errors
  (cleared between retries) and permanent drops (never cleared)
- Use errorHandler with AnalyticsError pattern matching to classify errors
- Handle synchronous mode auto-flush timing where errorHandler fires
  during sendEvent, not during the explicit flush loop

* Revert e2e HTTP patch from SDK source

Remove the "E2E PATCH — DO NOT COMMIT" scheme detection that was
accidentally committed. This change is applied via the patch mechanism
in sdk-e2e-tests/patches/analytics-swift-http.patch at test time.

* feat: read httpConfig from CDN settings and fix retry enforcement (#429)

- Add custom Codable init(from:) to HttpConfig/BackoffConfig/RateLimitConfig
  to handle partial JSON from CDN (JSONDecoder requires all fields otherwise)
- Decode statusCodeOverrides from string-keyed JSON to [Int: RetryBehavior]
- Read httpConfig from integrations["Segment.io"] in SegmentDestination.update()
  and rebuild HTTPClient when CDN config arrives
- Default enabled to true for CDN-sourced configs (presence implies active)
- Enforce rateLimitConfig.maxRetryCount via globalRetryCount in
  RetryStateMachine.shouldUploadBatch()
- Add retry-settings test suite to e2e-config.json
flushData fetched ALL pending events as a single batch. When a failed
retry event accumulated with a new non-retryable event, the entire batch
was dropped — including events that should have been retried.

Add offset parameter to DataStore.fetch and process events in
flushAt-sized batches so each batch is independent, matching file mode
behavior where each file gets its own upload.
- HttpConfig Codable: decode from partial JSON, empty JSON, encode/decode
  round-trip for RateLimitConfig, BackoffConfig, HttpConfig
- FlushData: CDN httpConfig parsing rebuilds HTTPClient, dropBatch on max
  retries, skipThisBatch during backoff, network error handling, 429 with
  Retry-After, retry after server error, memory-mode batch isolation
The completion-based flush { } + wait(for:) pattern deadlocks on iOS
when the main thread is blocked waiting on the expectation. Switch all
tests to flush() + RunLoop.main.run(until:) which is the established
pattern for synchronous-mode tests.
@didiergarcia didiergarcia merged commit 4d9c0df into main Apr 17, 2026
12 of 13 checks passed
@didiergarcia didiergarcia deleted the feat/smart-retry-system branch April 17, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants