Implement notification pattern for sync trigger operations#11690
Implement notification pattern for sync trigger operations#11690kamperiadis wants to merge 3 commits intodevfrom
Conversation
fc05dc0 to
7886b27
Compare
7886b27 to
c0a5fc4
Compare
| public const string ContainerName = "CONTAINER_NAME"; | ||
| public const string WebsitePodName = "WEBSITE_POD_NAME"; | ||
| public const string LegionServiceHost = "LEGION_SERVICE_HOST"; | ||
| public const string TriggersSyncSignalEnabled = "TRIGGERS_SYNC_SIGNAL_ENABLED"; |
There was a problem hiding this comment.
I didn't see mention of this env setting in any design docs. When will this mode be enabled - is it tied to FDM or ATOL? Depending on your answer, I'm wondering if instead of per feature settings like this you don't need just a single var that indicates we're in that mode - that can then be used by other features you need to make conditional.
There was a problem hiding this comment.
This is not tied to FDM since this doesn't impact the deployment flow but rather the non-deployment sync trigger operations (e.g. key rotation, customer sync triggers call). This is more ATOL-related and we already have env settings for ATOL detection, but the intention for this env setting is to ensure that the platform has the implementation to support this.
There was a problem hiding this comment.
So the behavior is tied to ATOL then - I.e. on stamps with ATOL enabled, you'll be injecting this into the container context always? Then my comment still applies - shouldn't we have a single env var that indicates we're running in the ATOL mode. What ATOL detection do we have already in host? I couldn't find any.
There was a problem hiding this comment.
For ATOL detection, we need to check IsFlexConsumptionSku() && IsConsumptionOnLegion() but checking if we are in ATOL mode wouldn't really help here since I am concerned that we'll release the host with a version of Legion that might not yet have the backing changes (e.g. delays in reviews for Legion PR). And then we are left in a state where the host makes the notification assuming that the platform is handling the triggers when in reality that is not the case.
| return result; | ||
| } | ||
|
|
||
| if (_environment.IsTriggersSyncSignalEnabled()) |
There was a problem hiding this comment.
Can we move this enable/disable check to mesh client.
| { | ||
| try | ||
| { | ||
| await _meshServiceClient.NotifyTriggersChanged(); |
There was a problem hiding this comment.
I'd prefer a better decoupling between sync manager and specific environment behaviors. Can we use a general event callback and have mesh client listen to that event?
Or some other decoupling instead of having the manager know to call mesh client in these circumstances.
There was a problem hiding this comment.
I was contemplating a similar comment, but hadn't made it yet because for better or worse, IMeshServiceClient is being used by other services in a similar way. I was thinking what event approaches might look like. Another option would be a custom IFunctionsSyncManager implementation that gets used in this mode, but that'd require a class hierarchy to be added such that the new impl. inherits behavior and only overrides a protected SetTriggersAsync for example to send its notification.
There was a problem hiding this comment.
I personally favor the custom IFunctionsSyncManager implementation over the event callback approach just because in this case, there would be a single subscriber listening for this event and if there were an error we might want to return that to the caller.
To build on what you mentioned @mathewc, wouldn't we want to override all of TrySyncTriggersAsync? If we only override SetTriggersAsync, we would still be generating the triggers payload which would be essentially thrown away.
There was a problem hiding this comment.
Just refactored to create a new SignalBasedFunctionsSyncManager.cs
cb8c16e to
699e30f
Compare
There was a problem hiding this comment.
Pull request overview
Implements a new “signal/notification” flow for SyncTriggers in Flex Consumption by adding a SignalBasedFunctionsSyncManager that notifies the mesh/appserver instead of performing the legacy trigger-sync payload flow, behind an environment flag.
Changes:
- Add
SignalBasedFunctionsSyncManagerand wire it into DI whenFUNCTIONS_TRIGGERS_SYNC_SIGNAL_ENABLEDis enabled. - Extend
IMeshServiceClientwithNotifyTriggersChanged()and implement it inMeshServiceClient/NullMeshServiceClient. - Add/adjust integration tests and update release notes.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/WebJobs.Script.Tests.Integration/Management/SignalBasedFunctionsSyncManagerTests.cs | New tests validating the notification-based sync manager behavior |
| test/WebJobs.Script.Tests.Integration/Management/MeshServiceClientTests.cs | Adds a test asserting the mesh request for trigger-change notification |
| test/WebJobs.Script.Tests.Integration/Management/FunctionsSyncManagerTests.cs | Updates strict environment mock to include the new env var |
| src/WebJobs.Script/Environment/EnvironmentSettingNames.cs | Adds FUNCTIONS_TRIGGERS_SYNC_SIGNAL_ENABLED constant |
| src/WebJobs.Script/Environment/EnvironmentExtensions.cs | Adds IsTriggersSyncSignalEnabled() helper |
| src/WebJobs.Script.WebHost/WebHostServiceCollectionExtensions.cs | Switches IFunctionsSyncManager registration based on the new flag |
| src/WebJobs.Script.WebHost/Management/SignalBasedFunctionsSyncManager.cs | New manager that sends mesh notification for trigger updates |
| src/WebJobs.Script.WebHost/Management/NullMeshServiceClient.cs | Adds no-op NotifyTriggersChanged() implementation |
| src/WebJobs.Script.WebHost/Management/MeshServiceClient.cs | Implements NotifyTriggersChanged() by posting update-triggers-timestamp |
| src/WebJobs.Script.WebHost/Management/IMeshServiceClient.cs | Adds NotifyTriggersChanged() to the interface |
| src/WebJobs.Script.WebHost/Management/FunctionsSyncManager.cs | Makes TrySyncTriggersAsync virtual to allow overriding |
| release_notes.md | Adds an entry describing the new notification behavior |
f2a78dd to
2c5804d
Compare
9a3f87b to
b0e7aa1
Compare
|
@jviau @mathewc Thank you so much for all your feedback. I refactored the changes based on our last discussion:
Can you please take another look? Thank you so much. |
| _environment = environment; | ||
| } | ||
|
|
||
| public override async Task<TriggersOperationResult> TrySyncTriggersAsync(bool isBackgroundSync = false) |
There was a problem hiding this comment.
Is this the right method to override? There's some amount of duplication happening here and maintenance moving forward will also have to account for this derived type.
If the difference between the two implementations is purely the way the platform is notified of the change, can we be more granular about what we override (e.g., either override SetTriggersAsync or introduce a method for this purpose)?
|
|
||
| try | ||
| { | ||
| var payload = await GetSyncTriggersPayload(); |
There was a problem hiding this comment.
There's a problem with this implementation (part of that would be addressed with the suggestion I made above). GetSyncTriggersPayload is guarded in the base type, but it's called without any synchronization here. The call to that method operates on shared mutable state, and removing that guard will lead to problems. One clear example is the call chain into GetSyncTriggersPayload() → PrepareSyncTriggers() →
_secretManagerProvider.Current.ClearCache().
| { | ||
| var payload = await GetSyncTriggersPayload(); | ||
| string contentHash = ComputeContentHash(payload.Content); | ||
| await _meshServiceClient.NotifyTriggersChanged(contentHash); |
There was a problem hiding this comment.
Are we delegating change validation to the platform? The NotifyTriggersChanged method name imply a change, but the current implementation will be calling this for every instance and relying on the platform to determine whether there has been a change. Where is the platform persisting the hash for comparison?
I realize you wanted to avoid blob dependencies, but I'm concerned this is leading to additional requirements already handled by the base implementation. If you were to preserve the current behavior and simply change the notification mechanism, that would address problems like the unnecessary notifications coming out of the host.
There was a problem hiding this comment.
Thank you for your feedback @fabiocav
I have updated the class so that we only override the SetTriggersAsync method instead of overriding TrySyncTriggersAsync
There was a problem hiding this comment.
It might be wasteful that we are generating the triggers payload when we are not really using it for anything other than just sending the notification.
Question - in the design meeting, it was mentioned that we might want to send the hash to Legion for dedupe purposes. But wouldn't it be considered expensive to generate the triggers payload and calculate the hash to send it over in the notification?
There was a problem hiding this comment.
Question - in the design meeting, it was mentioned that we might want to send the hash to Legion for dedupe purposes. But wouldn't it be considered expensive to generate the triggers payload and calculate the hash to send it over in the notification?
It's unnecessary work, and it leads to other inefficiencies from the platform side as well. I don't think we should be sending this hash, given the notification semantics.
There was a problem hiding this comment.
@kamperiadis I had a chat with @fabiocav and he makes a good point to leave this for later with this recent change. We are already doing all the blob storage checks before we get into your new set triggers handler.
Hopefully it won't be hard to change the notification payload at a later date if we want to remove storage dependency in the future.
There was a problem hiding this comment.
Given how we may need more significant changes to this signal in the future for multi-revision, and we don't have a path to removing the storage dependency just yet, my original motivation for adding this is no longer applicable. So I agree we can remove it.
There was a problem hiding this comment.
Thank you @jviau and @fabiocav. There is one edge case that I thought of that would be caught by sending the hash. Let's say the customer took an action where we rely on the background sync triggers to catch the update in the triggers payload.
- Host startup triggered the background sync triggers call
- The host notified the platform that there is an update to the triggers payload by sending the notification and then we updated the blob with the new hash
- The platform received the notification, but by the time it attempted to call GET admin/host/triggers, the worker was deleted
- Because the hash was already updated in the blob, in the next worker startup, we will not reattempt to notify the platform to fetch the triggers. We will only reattempt to get the triggers payload the next time we go through this flow for a non-background-sync-triggers scenario (e.g. key rotation, customer-facing sync triggers call)
Sending the hash in the notification would mitigate this in that the platform would compare this hash with what is in the db and reattempt to get the triggers payload.
But at the same time, my understanding is that we consider the background sync triggers as a best-effort attempt to have the triggers payload updated in the db (please correct me if I'm wrong) so in that case, we might not want to always send the hash just to mitigate the scenario I described above, especially since the probability of it happening is very small. What are your thoughts on this?
There was a problem hiding this comment.
Also, I was giving it more thought, we only do the hash check when isBackgroundSync == true, so for key rotation, customer-facing sync call, etc., for the SignalBasedFunctionsSyncManager class, it is wasteful to be generating the triggers payload when we are relying on the platform to fetch it later and this payload is just thrown away. So wouldn't it be better if I did the following in SignalBasedFunctionsSyncManager:
public override async Task<TriggersOperationResult> TrySyncTriggersAsync(bool isBackgroundSync = false)
{
if (isBackgroundSync)
{
return await base.TrySyncTriggersAsync(isBackgroundSync);
}
var result = new TriggersOperationResult
{
Success = true
};
if (!IsSyncTriggersEnvironment(_webHostEnvironment, _environment))
{
result.Success = false;
result.Error = "Invalid environment for SyncTriggers operation.";
_logger.LogWarning(result.Error);
return result;
}
try
{
await _meshServiceClient.NotifyTriggersChanged();
}
catch (Exception ex)
{
string message = "Failed to notify triggers changed.";
_logger.LogError(ex, message);
result.Success = false;
result.Error = message;
}
return result;
}
protected override async Task<(bool Success, string ErrorMessage)> SetTriggersAsync(string content)
{
try
{
await _meshServiceClient.NotifyTriggersChanged();
return (true, null);
}
catch (Exception ex)
{
string message = "Failed to notify triggers changed.";
_logger.LogError(ex, message);
return (false, message);
}
}
There was a problem hiding this comment.
Basically, for the background sync triggers flow, we just follow the existing logic in FunctionsSyncManager (i.e. generating triggers payload, comparing the hash with what is in blob storage, etc.) except that for SetTriggersAsync, we will be doing the notification to the platform.
And for non-background sync triggers flows, we will just be doing the notification to the platform.
What do you think?
There was a problem hiding this comment.
it is wasteful to be generating the triggers payload when we are relying on the platform to fetch it later and this payload is just thrown away
I wouldn't index too much on this. The overall cost of this should be negligible in the grand scale of an instance's lifetime. Plus, it is on a background thread, so not impacting cold start.
If we ignore that optimization as a goal, does it affect your approach here?
0dcef3b to
c2ad17e
Compare
| using System.Threading.Tasks; | ||
| using Xunit; | ||
|
|
||
| namespace Microsoft.Azure.WebJobs.Script.Tests.Managment |
| public const string ContainerName = "CONTAINER_NAME"; | ||
| public const string WebsitePodName = "WEBSITE_POD_NAME"; | ||
| public const string LegionServiceHost = "LEGION_SERVICE_HOST"; | ||
| public const string FunctionsTriggersSyncSignalEnabled = "FUNCTIONS_TRIGGERS_SYNC_SIGNAL_ENABLED"; |
There was a problem hiding this comment.
What do you think of a setting named FUNCTIONS_SYNC_TRIGGERS_MODE. The default value would be "FrontEnd" when not specified, and for ATOL you'd use "Platform". We'd map this to a TriggerSyncMode enum and use that for parsing.
|
|
||
| // Management services | ||
| services.AddSingleton<IFunctionsSyncManager, FunctionsSyncManager>(); | ||
| if (SystemEnvironment.Instance.IsTriggersSyncSignalEnabled()) |
There was a problem hiding this comment.
If you like the mode suggestion I make below, this'd become Instance.GetTrigerSyncMode() == TriggerSyncMode.AppServer
|
|
||
| var response = await SendAsync( | ||
| [ | ||
| KeyValuePair.Create(Operation, "update-triggers-timestamp") |
There was a problem hiding this comment.
Wondering if you shouldn't generalize this operation name to something like "update-triggers". That way if for whatever reason you want to pass any additional info along on the operation it makes sense.
|
|
||
| private static bool IsNotifyTriggersChangedRequest(HttpRequestMessage request) | ||
| { | ||
| var formData = request.Content.ReadAsFormDataAsync().Result; |
There was a problem hiding this comment.
I realize the rest of the methods in this test are using the sync over async pattern, but we should clean this up and make it truly async (ask @fabiocav :))
|
|
||
| public async Task NotifyTriggersChanged() | ||
| { | ||
| _logger.LogDebug("Posting triggers changed notification to appserver."); |
There was a problem hiding this comment.
I don't think we should log "appserver" - its not really a concept that exists currently in this codebase. Instead we could just be opaque and say "Performing platform triggers changed notification."
Issue describing the changes in this PR
Implement a new notification pattern for Flex Consumption sync trigger operations
Pull request checklist
IMPORTANT: Currently, changes must be backported to the
in-procbranch to be included in Core Tools and non-Flex deployments.in-procbranch is not requiredrelease_notes.md