feat: Add monitoring and alerting on critical webhook failures (#23)#28
Conversation
…Org#23) Closes Kolo-Org#23 - Introduce ObservabilityService for critical failures - Update BotController to leverage observability for enqueue failures - Return 500 status on critical message enqueue failure to trigger WhatsApp retries - Test integration for observability flows
|
Thanks for the contribution @thebabalola, The observability improvements, structured logging, global error handlers, and additional tests are all valuable additions. I have one major concern before approving this PR. Webhook Retry BehaviorThe controller behavior has changed from returning res.sendStatus(500);The PR description states that this is intended to trigger WhatsApp retries for failed webhook processing. While I understand the reasoning, I do not see any evidence that webhook processing is idempotent. For a financial application, retries can introduce duplicate processing if the same webhook is delivered more than once. Could you please clarify:
Before merging, I would like confirmation that enabling retries cannot result in duplicate processing of financial actions. Additionally, please provide test coverage (or documentation) demonstrating the expected behavior when WhatsApp retries the same webhook multiple times. Once this concern is addressed, I'm happy to take another look. |
|
@Queenode . addressed your feedback — updated the bot controller to extract the native whatsapp message id ( |
Thank you for your contribution. |
Description
This pull request addresses issue #23 by introducing robust monitoring and alerting for critical webhook failures.
Changes Made
ObservabilityServicethat provides structured logging (info/warnings) and critical alerts ([CRITICAL ALERT]) for failures requiring immediate attention.ObservabilityServiceintoBotControllerto properly log webhook message enqueue failures.500status code on critical message enqueue failures, which ensures that WhatsApp's natural retry mechanism is triggered.BotController,MessageProcessor,WebhookIntegration, andStellarServiceto accommodate the new error-handling behavior. All 139 tests are now passing successfully.Closes #23