Skip to content

Comments

feat: Add multi-cloud FOCUS test data generator for FinOps Hub#2006

Open
FallenHoot wants to merge 5 commits intomicrosoft:devfrom
FallenHoot:feature/multi-cloud-test-data-generator
Open

feat: Add multi-cloud FOCUS test data generator for FinOps Hub#2006
FallenHoot wants to merge 5 commits intomicrosoft:devfrom
FallenHoot:feature/multi-cloud-test-data-generator

Conversation

@FallenHoot
Copy link

Add multi-cloud FOCUS test data generator for FinOps Hub

Description

Adds Generate-MultiCloudTestData.ps1 — a PowerShell script that generates synthetic, multi-cloud, FOCUS-compliant cost data for testing and validating FinOps Hub deployments end-to-end.

Closes #2005

What's Included

  • Generate-MultiCloudTestData.ps1 (~1,430 lines) — Self-contained script that generates FOCUS 1.0–1.3 synthetic cost data for Azure, AWS, GCP, and DataCenter providers

Why This Script Is Needed

Testing a FinOps Hub deployment today requires real Cost Management export data. This script fills that gap by generating realistic synthetic data that:

  1. Covers all 4 supported cloud providers with provider-specific conventions (Azure resource IDs, AWS ARNs, GCP resource paths)
  2. Populates every column referenced by FinOps Hub dashboard KQL queries
  3. Simulates real-world patterns: commitment discounts (Reservations + Savings Plans), Azure Hybrid Benefit, spot/dynamic pricing, marketplace purchases, negotiated discounts, and tag coverage variation
  4. Generates data with proper Cost Management manifest.json files for ingestion pipeline compatibility
  5. Optionally uploads to Azure Storage and manages ADF triggers

Key Features

Feature Details
FOCUS compliance All mandatory + conditional FOCUS columns (v1.0–1.3)
Persistent identities Resources, billing accounts, subscriptions consistent across days
Budget scaling Costs scaled to target budget via Python/pandas
Memory-safe Streams rows daily to CSV, avoids OOM on 500K+ row datasets
Output formats Parquet (pyarrow), CSV, or both
Upload support Uploads to msexports + ingestion containers with proper blob paths

Testing

Tested with:

  • Default settings (500K rows, 6 months, all providers, $500K budget)
  • Single provider mode (Azure-only, 200K rows)
  • Full pipeline (generate → upload → ADF trigger → ADX ingestion → dashboard validation)
  • FOCUS versions 1.0, 1.2, and 1.3

Prerequisites

  • PowerShell 7+
  • Python 3 with pandas and pyarrow (for Parquet conversion)
  • Azure CLI (for upload functionality)

Checklist

  • Script follows FOCUS specification conventions
  • Microsoft copyright header included
  • Comment-based help with SYNOPSIS, DESCRIPTION, PARAMETERS, EXAMPLES
  • No hardcoded paths or environment-specific references
  • Tested with 498K+ rows successfully ingested into FinOps Hub

Copy link
Collaborator

@RolandKrummenacher RolandKrummenacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Generate-MultiCloudTestData.ps1

Great concept — this fills a real gap for FinOps Hub end-to-end testing. The FOCUS column coverage and multi-cloud provider modeling are thorough. However, there are several issues to address before merging:

Critical

  • Get-Random overflow with 12-digit AWS account IDs (lines 335, 361) — will throw at runtime
  • Python dependency should be eliminated — budget scaling and Parquet output can be done in pure PowerShell, removing ~80 lines of fragile cross-language code with path-injection risk and dead code

Required by repo conventions

  • Missing changelog entry (v14 section in docs-mslearn/toolkit/changelog.md)
  • Missing README.md in the test directory
  • Missing #Requires statement and .LINK in help
  • No -WhatIf/-Confirm support for destructive operations (file creation, uploads, trigger starts)

Recommended

  • Add Pester tests for helper functions
  • Prefer Azure AD auth over storage account keys
  • Add -Seed parameter for reproducible test data
  • FOCUS version parameter is metadata-only — either vary the schema or simplify

Minor

  • Inconsistent cost rounding (10 vs 2 decimal places)
  • ADF trigger names hardcoded — should be parameterized or documented

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity and removed Needs: Review 👀 PR that is ready to be reviewed labels Feb 16, 2026
@RolandKrummenacher
Copy link
Collaborator

FOCUS Specification Compliance Analysis

I did a detailed comparison of the script's output against the official FOCUS specification at focus.finops.org for all four claimed versions (1.0, 1.1, 1.2, 1.3). Here's what I found:


Critical Issue: $FocusVersion parameter is cosmetic only

The script accepts $FocusVersion (ValidateSet "1.0", "1.1", "1.2", "1.3") but never uses it to vary the output schema. The same columns are emitted regardless of version. The value is only written to x_FocusVersion. This means the output cannot be properly compliant with any single FOCUS version — it's a superset/subset mix.


Per-Version Column Compliance (Cost and Usage Dataset)

FOCUS v1.0 (43 columns)

  • ✅ All 43 v1.0 columns are present
  • 8 extra columns that don't exist in v1.0: BillingAccountType (v1.2), SubAccountType (v1.2), InvoiceId (v1.2), CommitmentDiscountQuantity (v1.1), CommitmentDiscountUnit (v1.1), ServiceSubcategory (v1.1), HostProviderName (v1.3), ServiceProviderName (v1.3)

FOCUS v1.1 (50 columns)

  • 4 columns missing: CapacityReservationId, CapacityReservationStatus, SkuMeter, SkuPriceDetails
  • 5 extra columns not in v1.1: BillingAccountType, SubAccountType, InvoiceId, HostProviderName, ServiceProviderName

FOCUS v1.2 (57 columns)

  • 8 columns missing: CapacityReservationId, CapacityReservationStatus, SkuMeter, SkuPriceDetails, PricingCurrency, PricingCurrencyContractedUnitPrice, PricingCurrencyEffectiveCost, PricingCurrencyListUnitPrice
  • 2 extra columns not in v1.2: HostProviderName, ServiceProviderName

FOCUS v1.3 (64+ columns)

  • 13 columns missing: AllocatedMethodDetails, AllocatedResourceId, AllocatedResourceName, AllocatedResourceType, CapacityReservationId, CapacityReservationStatus, ContractApplied, PricingCurrency, PricingCurrencyContractedUnitPrice, PricingCurrencyEffectiveCost, PricingCurrencyListUnitPrice, SkuMeter, SkuPriceDetails

Column Naming Issue

The script uses ServiceProviderName as the mandatory provider column for all versions, but the correct Column ID is ProviderName for v1.0–v1.2. ServiceProviderName only replaces ProviderName (deprecated) in v1.3. The script does output ProviderName too, but categorizes it under "FinOps Hub / Dashboard required columns" — it should be the primary mandatory column for v1.0–v1.2.


Missing v1.3 Structural Features

1. Contract Commitment Dataset (entirely absent)

FOCUS v1.3 introduced a second dataset with 13 mandatory columns (ContractId, ContractCommitmentId, ContractCommitmentCategory, ContractCommitmentCost, ContractCommitmentDescription, ContractCommitmentPeriodEnd/Start, ContractCommitmentQuantity, ContractCommitmentType, ContractCommitmentUnit, ContractPeriodEnd/Start, BillingCurrency). The script only generates Cost and Usage data — no Contract Commitment dataset is produced.

2. Data Generator-Calculated Split Cost Allocation (absent)

The 4 Allocated* columns (AllocatedMethodDetails, AllocatedResourceId, AllocatedResourceName, AllocatedResourceType) support shared cost splitting (e.g., K8s clusters, shared storage). Not implemented.

3. ContractApplied column (absent)

The JSON column that bridges Cost and Usage rows to the Contract Commitment dataset is not generated.


Summary Table

Version Columns Present Columns Missing Extra Columns Compliant?
v1.0 43/43 0 8
v1.1 46/50 4 5
v1.2 49/57 8 2
v1.3 51/64+ 13+ 0

Recommendation

Either:

  1. Target a single version (e.g., v1.0 or v1.3) and get that version fully correct, or
  2. Use $FocusVersion to dynamically select columns — only output columns valid for the chosen version and include all required columns for that version.

The closest match today is v1.3, but it's still missing conditional/recommended columns and the entire Contract Commitment dataset. For test data generation purposes, it may be acceptable to scope this to Cost and Usage only with a documented caveat, but the column set should still match the selected version.

@RolandKrummenacher
Copy link
Collaborator

Additional FOCUS Spec Compliance Findings

A few more items found during deeper analysis:


1. ServiceSubcategory — Invalid Values Against Spec's Closed Enumeration

The FOCUS spec (v1.1+) defines a closed list of allowed ServiceSubcategory values, each with a mandatory parent ServiceCategory. ~12 out of ~30 service entries use values that are not in the spec's allowed list:

Line Service Category Subcategory in Script Issue
169 Storage Accounts Storage General Purpose v2 Should be Object Storage or Block Storage
170 Azure Cosmos DB Databases NoSQL Databases Should be NoSQL
171 Azure Data Explorer Analytics Data Analytics Not in spec — closest: Log Analytics or Other (Analytics)
172 Azure App Service Compute App Services Not in spec — closest: Containers or Other (Compute)
173 Azure Functions Compute Serverless Compute Should be Functions
174 Azure Key Vault Security Key Management Not in spec — closest: Other (Security)
175 Bandwidth Networking Data Transfer Not in spec — closest: Content Delivery or Other (Networking)
176 Marketplace - 3rd Party Compute Marketplace Not a valid subcategory
220 Amazon DynamoDB Databases NoSQL Databases Should be NoSQL
247 Cloud Spanner Databases Distributed Databases Not in spec — closest: Other (Databases)
248 Cloud Run Compute Serverless Containers Should be Containers
267 Physical Servers Compute Bare Metal Not in spec — closest: Other (Compute)

Values that are correct include: Virtual Machines, Containers, Relational Databases, Object Storage, Block Storage, Content Delivery, Network Infrastructure, Data Warehouses.


2. Cost Column Invariants — Math Broken by Anomaly Rows

The FOCUS spec requires: ListCost = ListUnitPrice × PricingQuantity (and similarly for ContractedCost) when unit price and quantity are non-null and ChargeClass ≠ "Correction".

Unit prices are derived on line 741-742:

$listUnitPrice = [math]::Round($listCost / $pricingQuantity, 10)
$contractedUnitPrice = [math]::Round($contractedCost / $pricingQuantity, 10)

But the "data quality anomaly" block on lines 751-756 mutates costs AFTER unit prices were already calculated:

if ($qualityRoll -eq 0) {
    $effectiveCost = [math]::Round($contractedCost * 1.1, 10)   # breaks EffectiveCost invariant
} elseif ($qualityRoll -eq 1) {
    $contractedCost = [math]::Round($listCost * 1.05, 10)       # breaks ContractedCost = ContractedUnitPrice × PricingQuantity
}

~2% of rows will have cost/unit-price mismatches that violate the spec's mathematical constraints. If these are intentionally anomalous test data, they should be documented as such (e.g., via x_SourceChanges), and ChargeClass should be set to "Correction" to exempt them from the spec's invariant rules.


3. InvoiceId — Assigned to All Charge Categories

The script generates an InvoiceId for every row (lines 760-764), including Credit and Adjustment charges. In practice:

  • Some credits and adjustments are not tied to a specific invoice and should have InvoiceId = $null
  • ChargeClass = "Correction" rows reference a previously invoiced billing period and might carry the original invoice's ID, not a new one

This is a lower-severity finding (more about realism than strict spec violation), but worth considering for test data that claims multi-version FOCUS compliance.

FallenHoot pushed a commit to FallenHoot/finops-toolkit that referenced this pull request Feb 16, 2026
Comprehensive rewrite of Generate-MultiCloudTestData.ps1:

Critical fixes:
- Fix Get-Random [int] overflow with 12-digit AWS account IDs (New-AwsAccountId)
- Eliminate Python dependency entirely (inline budget scaling via scale factor)
- Remove dead code from Python/Parquet block

Required by repo conventions:
- Add #Requires -Version 7.0
- Add .LINK to comment-based help
- Add [CmdletBinding(SupportsShouldProcess)] with WhatIf/Confirm support
- Add changelog entry
- Add test directory README.md

FOCUS specification compliance:
- Fix ~12 ServiceSubcategory values to match FOCUS closed enumeration
- Fix cost invariants: unit prices calculated AFTER all cost modifications
- Anomaly rows now set ChargeClass=Correction (exempt from invariant rules)
- Credits/Adjustments get null InvoiceId (per FOCUS spec)
- Version-aware column sets: v1.1+ gets CommitmentDiscountQuantity/Unit,
  v1.2+ gets BillingAccountType/SubAccountType/InvoiceId,
  v1.3+ gets HostProviderName/ServiceProviderName
- Document scope as Cost and Usage dataset only

Recommended improvements:
- Add -Seed parameter for reproducible test data
- Add -UseStorageKey switch, default to Azure AD auth (--auth-mode login)
- Fix Get-RandomDecimal to use [long] instead of [int] for large ranges
@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Review 👀 PR that is ready to be reviewed and removed Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity labels Feb 16, 2026
Copy link
Collaborator

@RolandKrummenacher RolandKrummenacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing FOCUS Columns

The script is still missing several columns defined in the FOCUS specification across versions. These should either be implemented or explicitly documented as out-of-scope:

v1.1+ (4 columns)

  • CapacityReservationId — Identifier for capacity reservations
  • CapacityReservationStatus — Whether capacity reservation was used/unused
  • SkuMeter — Meter-level SKU details
  • SkuPriceDetails — JSON column with pricing metadata

v1.2+ (4 columns)

  • PricingCurrency — Currency used for pricing columns
  • PricingCurrencyContractedUnitPrice — Contracted unit price in pricing currency
  • PricingCurrencyEffectiveCost — Effective cost in pricing currency
  • PricingCurrencyListUnitPrice — List unit price in pricing currency

v1.3+ (5 columns)

  • AllocatedMethodDetails — Details about cost allocation method
  • AllocatedResourceId — Resource ID for split/allocated costs
  • AllocatedResourceName — Resource name for split/allocated costs
  • AllocatedResourceType — Resource type for split/allocated costs
  • ContractApplied — JSON column bridging Cost and Usage rows to Contract Commitment dataset

Total: 13 columns missing across versions. Without these, the output cannot be fully compliant with any FOCUS version from v1.1 onward. At minimum, please document which columns are intentionally excluded and why (e.g., Contract Commitment dataset is already noted as out of scope in the help text — the same treatment should apply to these).

@FallenHoot
Copy link
Author

@RolandKrummenacher — Thank you for the thorough review! Really appreciate the detailed feedback.

Your comments were spot-on and gave us the opportunity to go back and revisit logic that was missing during a live demo. We've addressed all the review feedback in this latest push:

What changed

PR review items — all addressed:

  • AllocatedResourceType — Added as the missing FOCUS v1.3 column
  • ContractApplied — Now populated with JSON contract references for committed-discount rows (v1.3+)
  • Split cost allocation — ~10% of AKS/EKS/GKE rows now populate Allocated* columns with namespace-level allocation simulation
  • ADF trigger names — Extracted to a reusable `` variable (was hardcoded in 2 places)
  • Column emission documented — FOCUS Column Coverage summary now explicitly lists which columns are emitted per version

Additional improvements (discovered while re-testing):

  • Expanded README with NukeTestData section, output formats, and additional datasets documentation
  • Added NukeTestData Quick Start examples
  • Removed the .duplicate backup file that was accidentally included

Pester tests

We'll look into adding Pester unit tests (for Get-RandomDecimal, New-AwsAccountId, Get-WeightedRandomService, etc.) in a follow-up PR to keep this one focused on the generator itself.

@FallenHoot
Copy link
Author

Fix pushed (61a6d0c): Resolve OutputPath to an absolute path before use. Export-Parquet is a .NET cmdlet that uses [IO.Directory]::GetCurrentDirectory() which can differ from PowerShell's C:\Users\zaolinsk\finops-toolkit — this caused 'Could not find a part of the path' errors when running from a different working directory. Fixed by calling System.Management.Automation.EngineIntrinsics.SessionState.Path.GetUnresolvedProviderPathFromPSPath() on OutputPath after parameter binding.

Zach Olinske added 5 commits February 17, 2026 08:59
Add Generate-MultiCloudTestData.ps1 that generates synthetic, multi-cloud
FOCUS-compliant cost data (v1.0-1.3) for testing FinOps Hub deployments.

Supports Azure, AWS, GCP, and DataCenter providers with realistic data
including commitment discounts, Azure Hybrid Benefit, spot pricing,
marketplace purchases, tag coverage variation, and budget scaling.

Generates up to 500K+ rows with Parquet/CSV output and optional Azure
Storage upload with ADF trigger management.
Comprehensive rewrite of Generate-MultiCloudTestData.ps1:

Critical fixes:
- Fix Get-Random [int] overflow with 12-digit AWS account IDs (New-AwsAccountId)
- Eliminate Python dependency entirely (inline budget scaling via scale factor)
- Remove dead code from Python/Parquet block

Required by repo conventions:
- Add #Requires -Version 7.0
- Add .LINK to comment-based help
- Add [CmdletBinding(SupportsShouldProcess)] with WhatIf/Confirm support
- Add changelog entry
- Add test directory README.md

FOCUS specification compliance:
- Fix ~12 ServiceSubcategory values to match FOCUS closed enumeration
- Fix cost invariants: unit prices calculated AFTER all cost modifications
- Anomaly rows now set ChargeClass=Correction (exempt from invariant rules)
- Credits/Adjustments get null InvoiceId (per FOCUS spec)
- Version-aware column sets: v1.1+ gets CommitmentDiscountQuantity/Unit,
  v1.2+ gets BillingAccountType/SubAccountType/InvoiceId,
  v1.3+ gets HostProviderName/ServiceProviderName
- Document scope as Cost and Usage dataset only

Recommended improvements:
- Add -Seed parameter for reproducible test data
- Add -UseStorageKey switch, default to Azure AD auth (--auth-mode login)
- Fix Get-RandomDecimal to use [long] instead of [int] for large ranges
- Add SuppressMessage attributes on internal New-* helper functions
  (New-AwsAccountId, New-ProviderIdentity, New-FocusRow create in-memory
  objects, not system state changes)
- Rename New-ProviderIdentities -> New-ProviderIdentity (singular noun)
- Gate AHB simulation with IncludeHybridBenefit switch (was declared
  but never checked, causing PSReviewUnusedParameter warning)
- Ran Invoke-ScriptAnalyzer: 0 errors, 0 warnings (excluding WriteHost)
- Added AllocatedResourceType column (FOCUS v1.3)
- Populated ContractApplied with JSON for committed-discount rows (v1.3+)
- Added split cost allocation simulation (~10% AKS/EKS/GKE rows)
- Extracted ADF trigger names to reusable variable
- Documented column emission per FOCUS version in summary output
- Updated README with NukeTestData section, output formats, additional datasets
- Removed .duplicate backup file
@FallenHoot FallenHoot force-pushed the feature/multi-cloud-test-data-generator branch from 61a6d0c to 4fde70e Compare February 17, 2026 08:01
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to src/powershell/Public/New-FinOpsTestData.ps1 so it can be published in the PS module? I'm fine with another verb name, but we should use an approved verb.

Side question: Should this be a generic script for any purpose or do we want to make it hubs-specific? I'm fine either way, but we'd follow different conventions.

Comment on lines +7 to +8
.SYNOPSIS
Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at the formatting conventions we have and apply them here as well. In this case, we indent the doc properties alongside the values.

Suggested change
.SYNOPSIS
Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation.
.SYNOPSIS
Generates multi-cloud FOCUS-compliant test data for FinOps Hub validation.

Comment on lines +18 to +21
- Prices (Azure EA/MCA price sheet → Prices_raw → Prices_final_v1_2)
- CommitmentDiscountUsage (Reservation details → CommitmentDiscountUsage_raw)
- Recommendations (Reservation recommendations → Recommendations_raw)
- Transactions (Reservation transactions → Transactions_raw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mentioning hubs-specific tables. Do you want to keep this specific to hubs? If so, I'd probably change the name to New-FinOpsHubTestData. But I also see value in breaking this out to support any number of scenarios:

  • New-FinOpsTestData
  • Set-FinOpsStorageBlobContent
  • New-FinOpsExportManifest
  • Add-FinOpsHubTestData

I see these as just breaking down what you have into smaller chunks. We don't need to do this now. I'm just thinking out loud about a growth path that would be reusable for more scenarios, if/when needed.

.PARAMETER OutputPath
Directory to save generated files. Default: ./test-data

.PARAMETER CloudProvider
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check what's in FOCUS 1.3, but I believe the best term here is ServiceProvider to account for SaaS services that we could hypothetically support in the future.

Suggested change
.PARAMETER CloudProvider
.PARAMETER ServiceProvider

.PARAMETER EndDate
End date for generated data. Default: Today

.PARAMETER TotalRowTarget
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MaxRowCount or maybe just RowCount?

ServiceProviderName = "Microsoft"
InvoiceIssuerName = "Microsoft"
HostProviderName = "Microsoft"
BillingAccountType = "Billing Profile"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Billing Profile type doesn't match EA account agreement.

# any that are missing or empty.
# ============================================================================

function Invoke-EnsureUpdatePolicy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Have you ever seen a case where this check failed? I'd love to get to a point where this code isn't necessary.

else
{
Write-Host " Starting $trigger..." -ForegroundColor Cyan
az datafactory trigger start --factory-name $AdfName --resource-group $ResourceGroupName --name $trigger --only-show-errors 2>$null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the Az CLI. You're in PowerShell. Stick with Az PowerShell. Applies to all commands.

$blobPath = "$blobFolder/$dataFile"
$manifestBlobPath = "$blobFolder/manifest.json"

$manifest = @{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Creating a manifest is an awesome capability in and of itself. I'd love to see this as a separate New-FinOpsExportManifest command.

Write-Host " 3. Start ADF triggers to process the data"
}

Write-Host ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approaching 3K lines is a bit much. I'd love to see this broken out into multiple files.

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity and removed Needs: Review 👀 PR that is ready to be reviewed labels Feb 17, 2026
@flanakin flanakin added the Tool: PowerShell PowerShell scripts and automation label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity Tool: FinOps hubs Data pipeline solution Tool: PowerShell PowerShell scripts and automation Type: Feature 💎 Idea to improve the product

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Multi-Cloud FOCUS Test Data Generator Script

7 participants