Skip to content

L0 regularization paper (WIP)#685

Draft
juaristi22 wants to merge 3 commits intomainfrom
maria/l0-paper
Draft

L0 regularization paper (WIP)#685
juaristi22 wants to merge 3 commits intomainfrom
maria/l0-paper

Conversation

@juaristi22
Copy link
Copy Markdown
Collaborator

No description provided.

@juaristi22 juaristi22 marked this pull request as draft April 3, 2026 05:50
Paper: "L0 regularization for subnational microsimulation calibration"
targeting the International Journal of Microsimulation.

Current state of the manuscript:
- Full paper structure: abstract, introduction, background, data,
  methodology, results, discussion, conclusion, appendix
- Formal survey calibration problem definition with GREG and IPF
  explained in depth, including benefits, drawbacks, and current
  practice in operational models (CBO, JCT, TPC, EUROMOD, TAXSIM)
- Four-stage pipeline methodology (clone, matrix, L0 optimize,
  assemble) documented against the pipeline source code
- Detailed appendix target tables populated from policy_data.db
  (37,758 targets: 33,572 district, 4,080 state, 106 national)
- All writing in US English, citations linked via plainnat/natbib

Still TODO:
- [ ] Implement IPF and GREG baselines on the same calibration matrix
      to populate the comparison table (tables/comparison.tex)
- [ ] Run calibration experiments and fill in all [TBC] placeholders
      in the results section (accuracy, sparsity, convergence, ESS)
- [ ] Generate convergence curve figure from calibration_log.csv
- [ ] Select and run a subnational policy application example
      (Section 5.5 — candidate: EITC expansion across CDs)
- [ ] Review pipeline methodology section against latest code for
      accuracy (clone-and-assign, matrix builder, assembly steps)
- [ ] Review and deepen background section: verify claims about
      GREG/IPF limitations, add any missing related work
- [ ] Resolve pre-existing overfull hbox warnings (long URLs in
      conclusion, hyperparameters table width)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a paper benchmarking scaffold that compares L0 and GREG on the same exported calibration matrix while routing IPF through a separate automatic preprocessing step that reconstructs IPF-ready unit and target inputs from the saved package metadata. The scaffold includes two R runners, manifest-driven bundle export, common scoring against the shared matrix, environment setup helpers, and end-to-end tests for the runner schemas.
Copy link
Copy Markdown
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of feedback!

  1. I think Author ordering is perfect
  2. Paragraphs are one long line in a text editor. Hey, we've all got text wrap. But I would recommend having an AI split up the lines very loosely, to 100-120 character length. If you notice, all my comments show up at the beginning of the paragraph, because the paragraph is the line.
  3. Other methods you may want to consider for the background:
    a) Covariate Balancing in Causal Inference
    b) Small Area Estimation
  4. We've got to figure out how to talk about states + DC and districts|
  5. I'm less sure after reading that GREG and IPF are your "competitors." Let's chat.

\begin{abstract}
Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets.

Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. There is one thing that we do differently that Louizos et al that I want to talk to you about in person.


Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.

The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paragraph feels less like an abstract and more like a body paragraph. I know it's still a WIP, but really the abstract needs to have cold, hard stats in it that show performance. (We'll get them.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uggh, one other thing we have to tackle: technically there are not 436 congressional districts. I was about to tell you to use 51 "state equivalents," but that's not exactly correct either. A TODO.

\section{Introduction}
\label{sec:introduction}

Microsimulation models estimate the effects of tax and benefit policies on households by applying program rules to individual-level microdata. Most operational models---including those maintained by the Congressional Budget Office \citep{cbo2018}, the Joint Committee on Taxation \citep{jct2023}, and the Tax Policy Center \citep{tpc2024}---operate at the national level. They calibrate household survey weights to aggregate administrative totals such as total income tax revenue, program enrollment counts, and demographic benchmarks, then use the reweighted dataset to simulate policy reforms.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"They calibrate" -> do we know that for sure?


Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.

Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPF is also just for categorical variables (which you do indicate with "cross-classification tables," but it may make sense to call it out), and also in classical raking there is no "almost." You either get the calibration perfect or it failed to converge. You may want to comment that this is also called "raking." There is a YouTube video called something like "generalized raking" where the presenter talks about relaxing that.

My understanding is that GREG too seeks to exactly match the targets and it's a failure if it can't (not just positive loss), but check me on that. GREG does permit quantitative, in addition to qualitative variables. But it will happily use negative weights, which is annoying.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands." A source would be helpful here. Is this because, as I suggested above, the procedure simply failing because it can't match the equations?


Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.

Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even though you have "different" in the first sentence, you've switched from classical statistics to Microsimulation literature, which could be jarring. I think flowing through the different conceptual frameworks is going to require some artistic thinking here.

@@ -0,0 +1,158 @@
\section{Methodology}
\label{sec:methodology}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure reader note: By the time we've gotten to methodology, there's been a lot of methodology!


\subsubsection{Block sampling}

Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course, now we've got adjusted gross income in the formula

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, I see that's mentioned below. Hmm, I wonder if it's best to just show the formula all at once.

P_{AGI}(b) reads a bit strange since that would imply that AGI is the random variable. I think we want a multinomial distribution over all census blocks, and I do think it would be best to define that multinomial distribution directly.


\subsubsection{Per-state parallel simulation}

The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACA PTC has county specific rules, and eventually other policies are coming that are going to be at lower levels. Though admittedly, in the run I just ran, I'm only using the state level. A plug here for #598 which has been getting buried. We've got to figure out how to set the geographic level. Oh, that's right, Max had the idea for that and it's documented in that issue.


\subsubsection{Hyperparameters}

Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we've kept most of Louizos's parameters. I have been setting beta back to .67, which was what they used, but honestly I don't have a good reason, and .35 works as well. The initial probability is .999, and that's just from trial and error. I'd set it to 1 if I could (it blows up). It's intended to be the proportion of gates that are open at the start. I always saw worse performance when I started with some gates closed. If they start open, they will close (eventually) though it will take more epochs. I know this is a "trust me bro" style of argument.


\subsubsection{Unachievable targets}

Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know 37,800 is just a placeholder. This is inflated by the fact that multiple years of the same target can be in the database, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants