feat: Add multi-cloud FOCUS test data generator for FinOps Hub#2006
feat: Add multi-cloud FOCUS test data generator for FinOps Hub#2006FallenHoot wants to merge 5 commits intomicrosoft:devfrom
Conversation
RolandKrummenacher
left a comment
There was a problem hiding this comment.
Review: Generate-MultiCloudTestData.ps1
Great concept — this fills a real gap for FinOps Hub end-to-end testing. The FOCUS column coverage and multi-cloud provider modeling are thorough. However, there are several issues to address before merging:
Critical
Get-Randomoverflow with 12-digit AWS account IDs (lines 335, 361) — will throw at runtime- Python dependency should be eliminated — budget scaling and Parquet output can be done in pure PowerShell, removing ~80 lines of fragile cross-language code with path-injection risk and dead code
Required by repo conventions
- Missing changelog entry (v14 section in
docs-mslearn/toolkit/changelog.md) - Missing README.md in the test directory
- Missing
#Requiresstatement and.LINKin help - No
-WhatIf/-Confirmsupport for destructive operations (file creation, uploads, trigger starts)
Recommended
- Add Pester tests for helper functions
- Prefer Azure AD auth over storage account keys
- Add
-Seedparameter for reproducible test data - FOCUS version parameter is metadata-only — either vary the schema or simplify
Minor
- Inconsistent cost rounding (10 vs 2 decimal places)
- ADF trigger names hardcoded — should be parameterized or documented
FOCUS Specification Compliance AnalysisI did a detailed comparison of the script's output against the official FOCUS specification at focus.finops.org for all four claimed versions (1.0, 1.1, 1.2, 1.3). Here's what I found: Critical Issue:
|
| Version | Columns Present | Columns Missing | Extra Columns | Compliant? |
|---|---|---|---|---|
| v1.0 | 43/43 | 0 | 8 | ❌ |
| v1.1 | 46/50 | 4 | 5 | ❌ |
| v1.2 | 49/57 | 8 | 2 | ❌ |
| v1.3 | 51/64+ | 13+ | 0 | ❌ |
Recommendation
Either:
- Target a single version (e.g., v1.0 or v1.3) and get that version fully correct, or
- Use
$FocusVersionto dynamically select columns — only output columns valid for the chosen version and include all required columns for that version.
The closest match today is v1.3, but it's still missing conditional/recommended columns and the entire Contract Commitment dataset. For test data generation purposes, it may be acceptable to scope this to Cost and Usage only with a documented caveat, but the column set should still match the selected version.
Additional FOCUS Spec Compliance FindingsA few more items found during deeper analysis: 1. ServiceSubcategory — Invalid Values Against Spec's Closed EnumerationThe FOCUS spec (v1.1+) defines a closed list of allowed
Values that are correct include: 2. Cost Column Invariants — Math Broken by Anomaly RowsThe FOCUS spec requires: Unit prices are derived on line 741-742: $listUnitPrice = [math]::Round($listCost / $pricingQuantity, 10)
$contractedUnitPrice = [math]::Round($contractedCost / $pricingQuantity, 10)But the "data quality anomaly" block on lines 751-756 mutates costs AFTER unit prices were already calculated: if ($qualityRoll -eq 0) {
$effectiveCost = [math]::Round($contractedCost * 1.1, 10) # breaks EffectiveCost invariant
} elseif ($qualityRoll -eq 1) {
$contractedCost = [math]::Round($listCost * 1.05, 10) # breaks ContractedCost = ContractedUnitPrice × PricingQuantity
}~2% of rows will have cost/unit-price mismatches that violate the spec's mathematical constraints. If these are intentionally anomalous test data, they should be documented as such (e.g., via 3. InvoiceId — Assigned to All Charge CategoriesThe script generates an InvoiceId for every row (lines 760-764), including
This is a lower-severity finding (more about realism than strict spec violation), but worth considering for test data that claims multi-version FOCUS compliance. |
Comprehensive rewrite of Generate-MultiCloudTestData.ps1: Critical fixes: - Fix Get-Random [int] overflow with 12-digit AWS account IDs (New-AwsAccountId) - Eliminate Python dependency entirely (inline budget scaling via scale factor) - Remove dead code from Python/Parquet block Required by repo conventions: - Add #Requires -Version 7.0 - Add .LINK to comment-based help - Add [CmdletBinding(SupportsShouldProcess)] with WhatIf/Confirm support - Add changelog entry - Add test directory README.md FOCUS specification compliance: - Fix ~12 ServiceSubcategory values to match FOCUS closed enumeration - Fix cost invariants: unit prices calculated AFTER all cost modifications - Anomaly rows now set ChargeClass=Correction (exempt from invariant rules) - Credits/Adjustments get null InvoiceId (per FOCUS spec) - Version-aware column sets: v1.1+ gets CommitmentDiscountQuantity/Unit, v1.2+ gets BillingAccountType/SubAccountType/InvoiceId, v1.3+ gets HostProviderName/ServiceProviderName - Document scope as Cost and Usage dataset only Recommended improvements: - Add -Seed parameter for reproducible test data - Add -UseStorageKey switch, default to Azure AD auth (--auth-mode login) - Fix Get-RandomDecimal to use [long] instead of [int] for large ranges
RolandKrummenacher
left a comment
There was a problem hiding this comment.
Missing FOCUS Columns
The script is still missing several columns defined in the FOCUS specification across versions. These should either be implemented or explicitly documented as out-of-scope:
v1.1+ (4 columns)
CapacityReservationId— Identifier for capacity reservationsCapacityReservationStatus— Whether capacity reservation was used/unusedSkuMeter— Meter-level SKU detailsSkuPriceDetails— JSON column with pricing metadata
v1.2+ (4 columns)
PricingCurrency— Currency used for pricing columnsPricingCurrencyContractedUnitPrice— Contracted unit price in pricing currencyPricingCurrencyEffectiveCost— Effective cost in pricing currencyPricingCurrencyListUnitPrice— List unit price in pricing currency
v1.3+ (5 columns)
AllocatedMethodDetails— Details about cost allocation methodAllocatedResourceId— Resource ID for split/allocated costsAllocatedResourceName— Resource name for split/allocated costsAllocatedResourceType— Resource type for split/allocated costsContractApplied— JSON column bridging Cost and Usage rows to Contract Commitment dataset
Total: 13 columns missing across versions. Without these, the output cannot be fully compliant with any FOCUS version from v1.1 onward. At minimum, please document which columns are intentionally excluded and why (e.g., Contract Commitment dataset is already noted as out of scope in the help text — the same treatment should apply to these).
|
@RolandKrummenacher — Thank you for the thorough review! Really appreciate the detailed feedback. Your comments were spot-on and gave us the opportunity to go back and revisit logic that was missing during a live demo. We've addressed all the review feedback in this latest push: What changedPR review items — all addressed:
Additional improvements (discovered while re-testing):
Pester testsWe'll look into adding Pester unit tests (for |
|
Fix pushed (61a6d0c): Resolve |
Add Generate-MultiCloudTestData.ps1 that generates synthetic, multi-cloud FOCUS-compliant cost data (v1.0-1.3) for testing FinOps Hub deployments. Supports Azure, AWS, GCP, and DataCenter providers with realistic data including commitment discounts, Azure Hybrid Benefit, spot pricing, marketplace purchases, tag coverage variation, and budget scaling. Generates up to 500K+ rows with Parquet/CSV output and optional Azure Storage upload with ADF trigger management.
Comprehensive rewrite of Generate-MultiCloudTestData.ps1: Critical fixes: - Fix Get-Random [int] overflow with 12-digit AWS account IDs (New-AwsAccountId) - Eliminate Python dependency entirely (inline budget scaling via scale factor) - Remove dead code from Python/Parquet block Required by repo conventions: - Add #Requires -Version 7.0 - Add .LINK to comment-based help - Add [CmdletBinding(SupportsShouldProcess)] with WhatIf/Confirm support - Add changelog entry - Add test directory README.md FOCUS specification compliance: - Fix ~12 ServiceSubcategory values to match FOCUS closed enumeration - Fix cost invariants: unit prices calculated AFTER all cost modifications - Anomaly rows now set ChargeClass=Correction (exempt from invariant rules) - Credits/Adjustments get null InvoiceId (per FOCUS spec) - Version-aware column sets: v1.1+ gets CommitmentDiscountQuantity/Unit, v1.2+ gets BillingAccountType/SubAccountType/InvoiceId, v1.3+ gets HostProviderName/ServiceProviderName - Document scope as Cost and Usage dataset only Recommended improvements: - Add -Seed parameter for reproducible test data - Add -UseStorageKey switch, default to Azure AD auth (--auth-mode login) - Fix Get-RandomDecimal to use [long] instead of [int] for large ranges
- Add SuppressMessage attributes on internal New-* helper functions (New-AwsAccountId, New-ProviderIdentity, New-FocusRow create in-memory objects, not system state changes) - Rename New-ProviderIdentities -> New-ProviderIdentity (singular noun) - Gate AHB simulation with IncludeHybridBenefit switch (was declared but never checked, causing PSReviewUnusedParameter warning) - Ran Invoke-ScriptAnalyzer: 0 errors, 0 warnings (excluding WriteHost)
- Added AllocatedResourceType column (FOCUS v1.3) - Populated ContractApplied with JSON for committed-discount rows (v1.3+) - Added split cost allocation simulation (~10% AKS/EKS/GKE rows) - Extracted ADF trigger names to reusable variable - Documented column emission per FOCUS version in summary output - Updated README with NukeTestData section, output formats, additional datasets - Removed .duplicate backup file
61a6d0c to
4fde70e
Compare
There was a problem hiding this comment.
Can we move this to src/powershell/Public/New-FinOpsTestData.ps1 so it can be published in the PS module? I'm fine with another verb name, but we should use an approved verb.
Side question: Should this be a generic script for any purpose or do we want to make it hubs-specific? I'm fine either way, but we'd follow different conventions.
| .SYNOPSIS | ||
| Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation. |
There was a problem hiding this comment.
Can you look at the formatting conventions we have and apply them here as well. In this case, we indent the doc properties alongside the values.
| .SYNOPSIS | |
| Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation. | |
| .SYNOPSIS | |
| Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation. |
| - Prices (Azure EA/MCA price sheet → Prices_raw → Prices_final_v1_2) | ||
| - CommitmentDiscountUsage (Reservation details → CommitmentDiscountUsage_raw) | ||
| - Recommendations (Reservation recommendations → Recommendations_raw) | ||
| - Transactions (Reservation transactions → Transactions_raw) |
There was a problem hiding this comment.
These are mentioning hubs-specific tables. Do you want to keep this specific to hubs? If so, I'd probably change the name to New-FinOpsHubTestData. But I also see value in breaking this out to support any number of scenarios:
- New-FinOpsTestData
- Set-FinOpsStorageBlobContent
- New-FinOpsExportManifest
- Add-FinOpsHubTestData
I see these as just breaking down what you have into smaller chunks. We don't need to do this now. I'm just thinking out loud about a growth path that would be reusable for more scenarios, if/when needed.
| .PARAMETER OutputPath | ||
| Directory to save generated files. Default: ./test-data | ||
|
|
||
| .PARAMETER CloudProvider |
There was a problem hiding this comment.
I need to double check what's in FOCUS 1.3, but I believe the best term here is ServiceProvider to account for SaaS services that we could hypothetically support in the future.
| .PARAMETER CloudProvider | |
| .PARAMETER ServiceProvider |
| .PARAMETER EndDate | ||
| End date for generated data. Default: Today | ||
|
|
||
| .PARAMETER TotalRowTarget |
There was a problem hiding this comment.
nit: MaxRowCount or maybe just RowCount?
| ServiceProviderName = "Microsoft" | ||
| InvoiceIssuerName = "Microsoft" | ||
| HostProviderName = "Microsoft" | ||
| BillingAccountType = "Billing Profile" |
There was a problem hiding this comment.
nit: Billing Profile type doesn't match EA account agreement.
| # any that are missing or empty. | ||
| # ============================================================================ | ||
|
|
||
| function Invoke-EnsureUpdatePolicy |
There was a problem hiding this comment.
Is this needed? Have you ever seen a case where this check failed? I'd love to get to a point where this code isn't necessary.
| else | ||
| { | ||
| Write-Host " Starting $trigger..." -ForegroundColor Cyan | ||
| az datafactory trigger start --factory-name $AdfName --resource-group $ResourceGroupName --name $trigger --only-show-errors 2>$null |
There was a problem hiding this comment.
Don't use the Az CLI. You're in PowerShell. Stick with Az PowerShell. Applies to all commands.
| $blobPath = "$blobFolder/$dataFile" | ||
| $manifestBlobPath = "$blobFolder/manifest.json" | ||
|
|
||
| $manifest = @{ |
There was a problem hiding this comment.
nit: Creating a manifest is an awesome capability in and of itself. I'd love to see this as a separate New-FinOpsExportManifest command.
| Write-Host " 3. Start ADF triggers to process the data" | ||
| } | ||
|
|
||
| Write-Host "" |
There was a problem hiding this comment.
Approaching 3K lines is a bit much. I'd love to see this broken out into multiple files.
Add multi-cloud FOCUS test data generator for FinOps Hub
Description
Adds
Generate-MultiCloudTestData.ps1— a PowerShell script that generates synthetic, multi-cloud, FOCUS-compliant cost data for testing and validating FinOps Hub deployments end-to-end.Closes #2005
What's Included
Generate-MultiCloudTestData.ps1(~1,430 lines) — Self-contained script that generates FOCUS 1.0–1.3 synthetic cost data for Azure, AWS, GCP, and DataCenter providersWhy This Script Is Needed
Testing a FinOps Hub deployment today requires real Cost Management export data. This script fills that gap by generating realistic synthetic data that:
Key Features
Testing
Tested with:
Prerequisites
pandasandpyarrow(for Parquet conversion)Checklist