Skip to content

feat: Implement Statvar Calculation aggregation#589

Open
SandeepTuniki wants to merge 4 commits into
masterfrom
statvar-calculation-aggregation
Open

feat: Implement Statvar Calculation aggregation#589
SandeepTuniki wants to merge 4 commits into
masterfrom
statvar-calculation-aggregation

Conversation

@SandeepTuniki

Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production

codacy-production Bot commented Jun 25, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 9 medium · 17 minor

Alerts:
⚠ 26 issues (≤ 0 issues of at least minor severity)

Results:
26 new issues

Category Results
Documentation 11 minor
Security 4 medium
CodeStyle 6 minor
Complexity 5 medium

View in Codacy

🟢 Metrics 39 complexity

Metric Results
Complexity 39

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the StatVarCalculationGenerator class along with its corresponding end-to-end integration tests. This generator builds and executes multi-statement SQL scripts using BigQuery Federation to perform mathematical operations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) on statistical variables, constructing output TimeSeries and Observation rows to write back to Spanner. The review feedback highlights two critical SQL injection vulnerabilities: the multiplier value is embedded into the SQL query without validation or casting, and the regular expressions (sv_regex and mm_regex) are not escaped, which could allow malicious inputs to alter the query structure. Both issues should be addressed by casting the multiplier to a float and escaping single quotes in the regex filters.

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
@SandeepTuniki

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the StatVarCalculationGenerator class to perform statistical variable calculations (such as DIVIDE, MULTIPLY, ADD, and SUBTRACT) using BigQuery Federation and write the results back to Spanner, along with comprehensive integration tests. The review feedback highlights several critical issues: a major performance bottleneck caused by multiple full table scans on the Spanner Observation table, a correctness bug where the provenance column is missing from the TimeSeries export, an incorrect escaping helper used for BigQuery-only literals, and a regex escaping bug inside BigQuery raw string literals.

Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation/stat_var_calculation_generator.py Outdated
@SandeepTuniki SandeepTuniki marked this pull request as ready for review June 26, 2026 03:47

@n-h-diaz n-h-diaz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! very cool sql :)

just want to double check: I see import_name_regex as an option to filter facets in the config - is anything needed to handle this here? or will the other facet matching take care of it? (and for my own understanding, does the calculation do a cross product of all input imports that match the facets or only within an import?)

also curious if you've run this on the existing configs and what the performance is like? (the climate calculation is huge(!), so want to make sure this is feasible. for example, if it gets very large, maybe we can do more aggressive filtering early on vs fetching the entire input import from spanner)

also since we're doing this for a subset of imports, we should ensure that all required input imports are included within the same aggregation run (not for this pr, but something to keep in mind)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants