A Python data engineering project that compiles American Community Survey (ACS) Census data from multiple tables into a single unified master dataset at the ZIP code level. Built to support Dallas real estate market analysis and neighborhood research.
The Census Bureau publishes ACS data as separate table downloads — one CSV per topic (income, housing, demographics, commuting, etc.), each with cryptic column codes and companion metadata files. Merging these into something usable requires understanding the data structure and building a repeatable pipeline.
A Python script that:
*-Data.csv files in the download directory using glob*-Column-Metadata.csv to map column codes to human-readable labelsGEO_ID field (last 5 digits)table_prefix: description (Estimate/Margin of Error) formatfunctools.reduce — scalable to any number of tablesDallas_Master_Dataset.csv — a single wide-format table with:
| Category | Tools |
|---|---|
| Data Processing | Python, pandas |
| File Discovery | glob, os.path |
| Merge Pattern | functools.reduce |
| Data Source | U.S. Census Bureau ACS |
Metadata-driven labeling: Rather than hardcoding column names, the script reads Census metadata files at runtime. Adding a new ACS table to the analysis requires zero code changes — just drop the files in the directory.
Estimate vs. MOE separation: ACS estimates always come with margins of error at 90% confidence. Keeping these as distinct columns lets downstream users filter out unreliable estimates from small-population ZIP codes.
| GitHub Repository | ← Back to Projects |