Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ database:
python policyengine_us_data/db/etl_age.py --year $(YEAR)
python policyengine_us_data/db/etl_medicaid.py --year $(YEAR)
python policyengine_us_data/db/etl_snap.py --year $(YEAR)
python policyengine_us_data/db/etl_tanf.py --year $(YEAR)
python policyengine_us_data/db/etl_state_income_tax.py --year $(YEAR)
python policyengine_us_data/db/etl_irs_soi.py --year $(YEAR)
python policyengine_us_data/db/etl_pregnancy.py --year $(YEAR)
Expand Down
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,21 @@ which installs the development dependencies in a reference-only manner (so that
to the package code will be reflected immediately); `policyengine-us-data` is a dev package
and not intended for direct access.

## Pull Requests

PRs must come from branches pushed to `PolicyEngine/policyengine-us-data`, not from
personal forks. The PR workflow hard-fails fork-based PRs before the real test suite
runs because the required secrets are unavailable there.

Before opening a PR, push the current branch to the upstream repo:

```bash
make push-pr-branch
```

That target pushes the current branch to the `upstream` remote and sets tracking so
`gh pr create` opens the PR from `PolicyEngine/policyengine-us-data`.

## SSA Data Sources

The following SSA data sources are used in this project:
Expand Down
1 change: 1 addition & 0 deletions changelog.d/codex-tanf-state-targets.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added HHS ACF TANF caseload and cash-assistance ETL targets, exposed baseline CPS liquid-asset inputs, and aligned TANF calibration totals to FY2024 administrative data.
8 changes: 8 additions & 0 deletions policyengine_us_data/calibration/target_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,13 @@ include:
# REMOVED: is_pregnant — 100% unachievable across all 51 state geos
- variable: snap
geo_level: state
- variable: tanf
geo_level: state
- variable: adjusted_gross_income
geo_level: state
- variable: spm_unit_count
geo_level: state
domain_variable: tanf

# === STATE — fine AGI bracket targets (stubs 9/10 from in55cmcsv) ===
- variable: person_count
Expand Down Expand Up @@ -127,6 +132,9 @@ include:
geo_level: national
- variable: tanf
geo_level: national
- variable: spm_unit_count
geo_level: national
domain_variable: tanf
- variable: tip_income
geo_level: national
- variable: unemployment_compensation
Expand Down
10 changes: 1 addition & 9 deletions policyengine_us_data/datasets/cps/cps.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,6 @@ def _open_dataset_read_only(dataset_source):
with closing(dataset.load()) as store:
yield store


class CPS(Dataset):
name = "cps"
label = "CPS"
Expand Down Expand Up @@ -1534,8 +1533,6 @@ def select_random_subset_to_target(
}
)

final_counts = pd.Series(ssn_card_type).value_counts().sort_index()

# ============================================================================
# PROBABILISTIC FAMILY CORRELATION ADJUSTMENT
# ============================================================================
Expand All @@ -1559,8 +1556,6 @@ def select_random_subset_to_target(
)
print(f"Additional undocumented needed: {undocumented_needed:,.0f}")

families_adjusted = 0

if undocumented_needed > 0:
# Identify households with mixed status (code 0 + code 3 members)
mixed_household_candidates = []
Expand All @@ -1584,7 +1579,6 @@ def select_random_subset_to_target(
# Randomly select from eligible code 3 members in mixed households to hit target
if len(mixed_household_candidates) > 0:
mixed_household_candidates = np.array(mixed_household_candidates)
candidate_weights = person_weights[mixed_household_candidates]

# Use probabilistic selection to hit target
selected_indices = select_random_subset_to_target(
Expand All @@ -1596,7 +1590,6 @@ def select_random_subset_to_target(

if len(selected_indices) > 0:
ssn_card_type[selected_indices] = 0
families_adjusted = len(selected_indices)
print(
f"Selected {len(selected_indices)} people from {len(mixed_household_candidates)} candidates in mixed households"
)
Expand Down Expand Up @@ -1735,7 +1728,7 @@ def get_arrival_year_midpoint(peinusyr):
# Save as immigration_status_str since that's what PolicyEngine expects
cps["immigration_status_str"] = immigration_status.astype("S")
# Final population summary
print(f"\nFinal populations:")
print("\nFinal populations:")
code_to_str = {
0: "NONE", # Likely undocumented immigrants
1: "CITIZEN", # US citizens
Expand Down Expand Up @@ -1952,7 +1945,6 @@ def add_tips(self, cps: h5py.File):
# is_married is person-level here but policyengine-us defines it at Family
# level, so we must not save it
cps = cps.drop(columns=["is_married", "is_under_18", "is_under_6"], errors="ignore")

self.save_dataset(cps)


Expand Down
9 changes: 5 additions & 4 deletions policyengine_us_data/db/DATABASE_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,11 @@ make database-refresh # Force re-download all sources and rebuild
| 4 | `etl_age.py` | Census ACS 1-year | Age distribution: 18 bins x 488 geographies |
| 5 | `etl_medicaid.py` | Census ACS + CMS | Medicaid enrollment (admin state-level, survey district-level) |
| 6 | `etl_snap.py` | USDA FNS + Census ACS | SNAP participation (admin state-level, survey district-level) |
| 7 | `etl_state_income_tax.py` | Census STC | State income tax collections (Census STC FY2023 `T40`, downloaded and cached) |
| 8 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
| 9 | `etl_pregnancy.py` | CDC VSRR + Census ACS | Pregnancy prevalence by state (provisional birth counts) |
| 10 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |
| 7 | `etl_tanf.py` | HHS ACF | TANF caseload families and cash-assistance spending (FY2024) |
| 8 | `etl_state_income_tax.py` | Census STC | State income tax collections (Census STC FY2023 `T40`, downloaded and cached) |
| 9 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
| 10 | `etl_pregnancy.py` | CDC VSRR + Census ACS | Pregnancy prevalence by state (provisional birth counts) |
| 11 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |

### Raw Input Caching

Expand Down
2 changes: 2 additions & 0 deletions policyengine_us_data/db/create_field_valid_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ def populate_field_valid_values(session: Session) -> None:
("source", "Census ACS S2201", "survey"),
("source", "Census STC", "administrative"),
("source", "CDC VSRR Natality", "administrative"),
("source", "HHS ACF TANF Caseload", "administrative"),
("source", "HHS ACF TANF Financial", "administrative"),
("source", "PolicyEngine", "hardcoded"),
]

Expand Down
7 changes: 0 additions & 7 deletions policyengine_us_data/db/etl_national_targets.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,13 +203,6 @@ def extract_national_targets(year: int = DEFAULT_YEAR):
"notes": "Housing subsidies",
"year": HARDCODED_YEAR,
},
{
"variable": "tanf",
"value": 9e9,
"source": "HHS/ACF",
"notes": "TANF cash assistance",
"year": HARDCODED_YEAR,
},
{
"variable": "real_estate_taxes",
"value": 500e9,
Expand Down
Loading
Loading