Skip to content

Enhance EFS mount health monitoring with real-time I/O testing and modernized CSI probes #1675

@oyiz-michael

Description

@oyiz-michael

Is your feature request related to a problem?/Why is this needed
/feature

The AWS EFS CSI Driver currently lacks robust health monitoring capabilities for EFS mount points, leading to several critical production issues:

Pod Crash-Loop Problems: When EFS mounts degrade or become unavailable, pods continue to report as "healthy" through basic liveness probes, but actual I/O operations fail, causing application crashes and restart loops (as reported in issues #336, #1411, #1156).

Insufficient Health Visibility: The current health probes only check if the CSI driver process is running, not whether EFS mounts are actually functional for I/O operations. This leads to false positives where pods appear healthy but cannot access storage.

Reactive Problem Detection: Issues with EFS mounts are typically discovered after applications fail, rather than being proactively detected through proper health monitoring.

Limited Observability: There's no standardized way to monitor the health of individual EFS mounts or get detailed metrics about mount performance and reliability.

CSI Specification Gaps: The driver doesn't fully leverage modern CSI health monitoring capabilities that could provide better integration with Kubernetes health checking mechanisms.

Describe the solution you'd like in detail
I propose implementing a comprehensive Enhanced EFS Mount Health Monitoring System with the following components:

Core Health Monitoring
Real-time Mount Health Tracking: Monitor each EFS mount point continuously with configurable intervals
Actual I/O Testing: Perform real file read/write operations to verify mount functionality (not just mount status checks)
Timeout Protection: Configurable timeouts to prevent health checks from hanging on unresponsive mounts
Background Processing: Non-blocking health monitoring that doesn't interfere with normal CSI operations
Multiple Health Endpoints
/healthz - Overall driver health status
/healthz/ready - Readiness probe compatible endpoint
/healthz/live - Liveness probe compatible endpoint
/healthz/mounts - Detailed per-mount health information with JSON response
Observability & Metrics
Prometheus-style Metrics: Export mount health, response times, and volume metrics
Structured Logging: Detailed health check results with proper log levels
Health Summary API: Programmatic access to mount health status for monitoring tools
CSI Integration
Enhanced Probe Method: Modernized CSI Identity service Probe implementation
Mount Registration: Automatic registration/unregistration of mounts during volume operations
Graceful Lifecycle Management: Proper startup and shutdown of health monitoring components
Configuration Options
Configurable health check intervals (default: 30 seconds)
Configurable I/O operation timeouts (default: 10 seconds)
Optional health server port configuration
Integration with existing volume metrics settings

Describe alternatives you've considered
Simple Probe Configuration (as attempted in PR #1428): Only making existing probe timeouts configurable, but this doesn't address the root cause of mount health issues.

External Health Monitoring: Using external tools like Prometheus node exporter or custom monitoring scripts, but this adds complexity and doesn't integrate with CSI driver lifecycle.

Application-Level Health Checks: Having applications implement their own EFS health checks, but this duplicates effort and doesn't provide driver-level visibility.

Basic Mount Point Checks: Only checking if mount points exist without I/O testing, but this provides false positives when mounts are stale or unresponsive.

Existing CSI Health Standards: Using only standard CSI health probes, but these don't provide storage-specific health validation.

Additional context
Related Issues
Issue #336: Liveness Probe Failure - demonstrates ongoing probe reliability problems
Issue #1411: Health should be configurable - shows community need for better health configuration
Issue #1156: Liveness probe failures after EFS version changes
PR #1428: Make health options configurable - approved but stalled attempt at basic probe configuration

Implementation Benefits
Prevents Pod Crash-Loops: Proactive detection of EFS mount issues before applications fail
Production Ready: Comprehensive error handling, timeouts, and resource cleanup
Cloud Native Integration: Full Kubernetes and Prometheus ecosystem compatibility
Zero Downtime: Background health monitoring without blocking normal operations
Enterprise Observability: Detailed metrics and logging for production monitoring

Metadata

Metadata

Assignees

Labels

lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions