L0 regularization paper (WIP) by juaristi22 · Pull Request #685 · PolicyEngine/policyengine-us-data

juaristi22 · 2026-04-03T05:50:10Z

No description provided.

Paper: "L0 regularization for subnational microsimulation calibration" targeting the International Journal of Microsimulation. Current state of the manuscript: - Full paper structure: abstract, introduction, background, data, methodology, results, discussion, conclusion, appendix - Formal survey calibration problem definition with GREG and IPF explained in depth, including benefits, drawbacks, and current practice in operational models (CBO, JCT, TPC, EUROMOD, TAXSIM) - Four-stage pipeline methodology (clone, matrix, L0 optimize, assemble) documented against the pipeline source code - Detailed appendix target tables populated from policy_data.db (37,758 targets: 33,572 district, 4,080 state, 106 national) - All writing in US English, citations linked via plainnat/natbib Still TODO: - [ ] Implement IPF and GREG baselines on the same calibration matrix to populate the comparison table (tables/comparison.tex) - [ ] Run calibration experiments and fill in all [TBC] placeholders in the results section (accuracy, sparsity, convergence, ESS) - [ ] Generate convergence curve figure from calibration_log.csv - [ ] Select and run a subnational policy application example (Section 5.5 — candidate: EITC expansion across CDs) - [ ] Review pipeline methodology section against latest code for accuracy (clone-and-assign, matrix builder, assembly steps) - [ ] Review and deepen background section: verify claims about GREG/IPF limitations, add any missing related work - [ ] Resolve pre-existing overfull hbox warnings (long URLs in conclusion, hyperparameters table width) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce a paper benchmarking scaffold that compares L0 and GREG on the same exported calibration matrix while routing IPF through a separate automatic preprocessing step that reconstructs IPF-ready unit and target inputs from the saved package metadata. The scaffold includes two R runners, manifest-driven bundle export, common scoring against the shared matrix, environment setup helpers, and end-to-end tests for the runner schemas.

baogorek

First round of feedback!

I think Author ordering is perfect
Paragraphs are one long line in a text editor. Hey, we've all got text wrap. But I would recommend having an AI split up the lines very loosely, to 100-120 character length. If you notice, all my comments show up at the beginning of the paragraph, because the paragraph is the line.
Other methods you may want to consider for the background:
a) Covariate Balancing in Causal Inference
b) Small Area Estimation
We've got to figure out how to talk about states + DC and districts|
I'm less sure after reading that GREG and IPF are your "competitors." Let's chat.

baogorek · 2026-04-13T11:20:27Z

paper-l0/sections/abstract.tex

+\begin{abstract}
+Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets.
+
+Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.


This is good. There is one thing that we do differently that Louizos et al that I want to talk to you about in person.

baogorek · 2026-04-13T11:21:29Z

paper-l0/sections/abstract.tex

+
+Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.
+
+The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.


Paragraph feels less like an abstract and more like a body paragraph. I know it's still a WIP, but really the abstract needs to have cold, hard stats in it that show performance. (We'll get them.)

Uggh, one other thing we have to tackle: technically there are not 436 congressional districts. I was about to tell you to use 51 "state equivalents," but that's not exactly correct either. A TODO.

baogorek · 2026-04-13T11:31:28Z

paper-l0/sections/introduction.tex

+\section{Introduction}
+\label{sec:introduction}
+
+Microsimulation models estimate the effects of tax and benefit policies on households by applying program rules to individual-level microdata. Most operational models---including those maintained by the Congressional Budget Office \citep{cbo2018}, the Joint Committee on Taxation \citep{jct2023}, and the Tax Policy Center \citep{tpc2024}---operate at the national level. They calibrate household survey weights to aggregate administrative totals such as total income tax revenue, program enrollment counts, and demographic benchmarks, then use the reweighted dataset to simulate policy reforms.


"They calibrate" -> do we know that for sure?

baogorek · 2026-04-13T11:38:21Z

paper-l0/sections/introduction.tex

+
+Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.
+
+Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.


IPF is also just for categorical variables (which you do indicate with "cross-classification tables," but it may make sense to call it out), and also in classical raking there is no "almost." You either get the calibration perfect or it failed to converge. You may want to comment that this is also called "raking." There is a YouTube video called something like "generalized raking" where the presenter talks about relaxing that.

My understanding is that GREG too seeks to exactly match the targets and it's a failure if it can't (not just positive loss), but check me on that. GREG does permit quantitative, in addition to qualitative variables. But it will happily use negative weights, which is annoying.

"becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands." A source would be helpful here. Is this because, as I suggested above, the procedure simply failing because it can't match the equations?

baogorek · 2026-04-13T11:42:04Z

paper-l0/sections/introduction.tex

+
+Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.
+
+Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.


So even though you have "different" in the first sentence, you've switched from classical statistics to Microsimulation literature, which could be jarring. I think flowing through the different conceptual frameworks is going to require some artistic thinking here.

baogorek · 2026-04-13T12:16:39Z

paper-l0/sections/methodology.tex

@@ -0,0 +1,158 @@
+\section{Methodology}
+\label{sec:methodology}
+


Pure reader note: By the time we've gotten to methodology, there's been a lot of methodology!

baogorek · 2026-04-13T12:17:49Z

paper-l0/sections/methodology.tex

+
+\subsubsection{Block sampling}
+
+Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).


And of course, now we've got adjusted gross income in the formula

Ohhh, I see that's mentioned below. Hmm, I wonder if it's best to just show the formula all at once.

P_{AGI}(b) reads a bit strange since that would imply that AGI is the random variable. I think we want a multinomial distribution over all census blocks, and I do think it would be best to define that multinomial distribution directly.

baogorek · 2026-04-13T12:31:26Z

paper-l0/sections/methodology.tex

+
+\subsubsection{Per-state parallel simulation}
+
+The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.


ACA PTC has county specific rules, and eventually other policies are coming that are going to be at lower levels. Though admittedly, in the run I just ran, I'm only using the state level. A plug here for #598 which has been getting buried. We've got to figure out how to set the geographic level. Oh, that's right, Max had the idea for that and it's documented in that issue.

baogorek · 2026-04-13T12:34:29Z

paper-l0/sections/methodology.tex

+
+\subsubsection{Hyperparameters}
+
+Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.


Note that we've kept most of Louizos's parameters. I have been setting beta back to .67, which was what they used, but honestly I don't have a good reason, and .35 works as well. The initial probability is .999, and that's just from trial and error. I'd set it to 1 if I could (it blows up). It's intended to be the proportion of gates that are open at the start. I always saw worse performance when I started with some gates closed. If they start open, they will close (eventually) though it will take more epochs. I know this is a "trust me bro" style of argument.

baogorek · 2026-04-13T12:37:43Z

paper-l0/sections/results.tex

+
+\subsubsection{Unachievable targets}
+
+Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.


I know 37,800 is just a placeholder. This is inflated by the fact that multiple years of the same target can be in the database, etc.

juaristi22 marked this pull request as draft April 3, 2026 05:50

juaristi22 force-pushed the maria/l0-paper branch from c19a0a4 to 7b7994d Compare April 9, 2026 09:49

juaristi22 added 2 commits April 9, 2026 17:21

update background section

063b8fc

baogorek reviewed Apr 13, 2026

View reviewed changes


		Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.

		The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.


		Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.

		Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.


		Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.

		Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.

		@@ -0,0 +1,158 @@
		\section{Methodology}
		\label{sec:methodology}


		\subsubsection{Block sampling}

		Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).


		\subsubsection{Per-state parallel simulation}

		The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.


		\subsubsection{Hyperparameters}

		Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.


		\subsubsection{Unachievable targets}

		Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.

Conversation

juaristi22 commented Apr 3, 2026

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants