chore: add validation for TransparentHugePage config#8015
chore: add validation for TransparentHugePage config#8015
Conversation
There was a problem hiding this comment.
Pull request overview
Adds validation for CustomLinuxOSConfig Transparent Huge Pages (THP) settings to ensure only kernel-supported values are accepted before generating Linux node bootstrapping artifacts.
Changes:
- Introduces constants for valid THP
enabledanddefragvalues and adds(*CustomLinuxOSConfig).ValidateTHPConfig()validation. - Wires THP validation into the Linux path of
GetNodeBootstrapping. - Adds unit tests for
ValidateTHPConfig().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/agent/datamodel/types.go | Adds THP allowed-value constants and ValidateTHPConfig() to validate transparentHugePageEnabled/Defrag. |
| pkg/agent/datamodel/types_test.go | Adds table-driven tests covering valid/invalid THP configurations. |
| pkg/agent/bakerapi.go | Calls THP validation before Linux bootstrapping generation. |
|
The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).
|
| "never": true, | ||
| } | ||
|
|
||
| if v := osConfig.GetTransparentHugepageSupport(); v != "" && !validEnabled[strings.ToLower(v)] { |
There was a problem hiding this comment.
empty string check is redundant since validEnabled[strings.ToLower(v)] would resolve to false regardless if v is empty
There was a problem hiding this comment.
technically, if I didn't skip on empty string, then an empty string would throw the warning instead of proceeding silently. May be unnecessarily concerning to a user who intentionally left the config blank
| if config.AgentPoolProfile.IsWindows() { | ||
| validateAndSetWindowsNodeBootstrappingConfiguration(config) | ||
| } else { | ||
| config.AgentPoolProfile.CustomLinuxOSConfig.ValidateTHPConfig() |
There was a problem hiding this comment.
I might rename this to ValidateAndDefaultTHPConfig(), since this defaults the respective settings to empty strings if the input is invalid
There was a problem hiding this comment.
(this applies here and in aks-node-controller)
There was a problem hiding this comment.
I'm open to this rename, but the empty strings are also just for best practices. The original code leaves the invalid string in the config but it doesn't do anything, which causes the default to happen anyway.
| osConfig.TransparentHugepageSupport = "" | ||
| } | ||
| if v := osConfig.GetTransparentDefrag(); v != "" && !validDefrag[strings.ToLower(v)] { | ||
| log.Printf("WARNING: invalid transparent_defrag value %q, must be one of: always, defer, defer+madvise, madvise, never; resetting to default", v) |
There was a problem hiding this comment.
I'm not sure log is the correct mechanism by which we should be logging these messages out - @r2k1 @Devinwong can clarify maybe
| // ValidateTHPConfig checks that TransparentHugepageSupport and TransparentDefrag | ||
| // contain valid kernel values when set. Invalid values are logged as warnings | ||
| // and reset to empty (kernel default). | ||
| func ValidateTHPConfig(cfg *aksnodeconfigv1.Configuration) { |
There was a problem hiding this comment.
what's the actual intended behavior? if this is a customer-exposed setting, wouldn't we want RP to return a 4xx if they specify an invalid value, rather than falling back to empty for TransparentHugepageSupport and GetTransparentDefrag? for that to happen, this validation would need to happen within the RP rather than here
There was a problem hiding this comment.
From Tim:
"I'm in two minds about this. If we abort with an error then any user doing something strange in prod right now will suddenly have failing nodepool scale operations. Which could cause a sev2 or more depending on the Cx. But if we silently ignore the value, then customer's clusters might regress."
There was a problem hiding this comment.
In its current state, our code writes the invalid user string to the file, i.e. "false." The user intended "false" to disable THP, but since it's the wrong key word, it does nothing, which causes the default behavior ("always") to occur.
By clearing the user string, I'm not changing the code's existing behavior, I'm just cleaning it up a bit.
What this PR does / why we need it:
Added validation for TransparentHugePage config to ensure that legal values are used.
Which issue(s) this PR fixes:
Fixes #