HeRoP Geodata
This website provides download links for geospatial datasets that we at the Healthy Regions & Policies Lab use for various analyses and cartographic applications. The source repository holds the Python processing pipeline that we use to generate these datasets directly from the US Census Bureau FTP.
Processing Steps
Given a geography, year, and scale, all matching files in the Census FTP will be processed accordingly:
- Downloaded and unzipped within
.cache/
- Loaded and merged into a single geopandas dataframe
- New fields calculated and added to the data frame
- The data frame is exported to one or more of the following file formats: Shapefile, GeoJSON, and PMTiles
Specs
The output files from this process have a few extra fields added to them, and come in a few different file formats that are useful for different contexts.
Fields
LABEL
A human readable label is calculated using each unit's name and proper LSAD.BBOX
The bounding box of each feature is calculated and concatenated into a single text field with this format: "{minx},{miny},{maxx},{maxy}".HEROP_ID
In some of our projects we use what we call a HEROP_ID to identify geographic boundaries defined by the US Census Bureau, which is a slight variation on the commonly used standard GEOID. Our format is similar to what the American FactFinder used (now data.census.gov).
A HEROP_ID consists of three parts:
-
The 3-digit Summary Level Code for this geography. Common summary level codes are:
040
-- State050
-- County140
-- Census Tract150
-- Census Block Group860
-- Zip Code Tabulation Area (ZCTA)
- The 2-letter string
US
-
The standard GEOID for the given unit (length depends on unit summary)
- GEOIDs are, in turn, hierarchical aggregations of FIPS codes
Expanding out the FIPS codes for the five summary levels shown above, the full IDs would look like:
summary level | format | length | example |
---|---|---|---|
State | 040US + STATE (2) |
7 | 040US17 (Illinois) |
County | 050US + STATE (2) + COUNTY (3) |
10 | 050US17019 (Champaign County) |
Tract | 140US + STATE (2) + COUNTY (3) + TRACT (6) |
16 | 140US17019005900 |
Block Group | 150US + STATE (2) + COUNTY (3) + TRACT (6) + BLOCK GROUP (1) |
17 | 150US170190059002 |
ZCTA | 860US + ZIP CODE (5) |
10 | 860US61801 |
The advantages of this composite ID are:
- Unique across all geographic areas in the US
- Will always be forced to string formatting
- Easy to programmatically change back into the more standard GEOIDs
Convert to GEOID (integers)
The HEROP_ID
can be converted back to standard GEOIDs by removing the first 5 characters, or by taking everything after the substring "US". Here are some examples of what this looks like in different software:
- Excel:
REPLACE(A1, 1, 5, "")
- R:
geoid <- str_split_i(HEROP_ID, "US", -1)
- Python:
geoid = HEROP_ID.split("US")[1]
- JavaScript:
const geoid = HEROP_ID.split("US")[1]
Formats
Each processed dataset is exported to three different file formats.
GeoJSON
A simple plain text format that is good for small to medium size datasets and can be used in a wide variety of web and desktop software [learn more](https://geojson.org/)PMTiles
A "cloud-native" vector format that is very fast in the right web mapping environment [learn more](https://docs.protomaps.com/pmtiles/)Shapefile
Used in scripting and desktop software for performant display and analysis [learn more](https://www.geographyrealm.com/what-is-a-shapefile/)Using Shapefiles in scripts
You don't need to download and unzip these shapefiles to use them in R or Python scripts.
-
R Example:
sf
allows you to directly open remote, zipped shapefiles without downloading them learn more,read_sf
seems not to be documented though (?):library('sf') tracts <- read_sf('/vsizip//vsicurl/https://herop-geodata.s3.us-east-2.amazonaws.com/oeps/tract-2018-500k-shp.zip')
-
Python Example:
geopandas
allows you to directly open remote, zipped shapefiles files without downloading them learn more:import geopandas as gpd tracts = gpd.read_file("/vsizip//vsicurl/https://herop-geodata.s3.us-east-2.amazonaws.com/oeps/state-2010-500k-shp.zip")
CLI
The script we use for this processing pipeline can be run on its own to generate new copies of the files. This section serves to help with development and maintance of that script.
Install
git clone https://github.com/healthyregions/census-geofiles
cd census-geofiles
python3 -m venv env && source ./env/bin/activate
pip install -e .
To create exports in PMTiles format, you must also install tippecanoe and provide the path to its executable through an environment variable or command line argument (see below)
Configuration
A few environment variables should be set before running the command.
cp .env.example .env
Our defaults are provided in the example file along with comments on each variable. These variables can be overwritten at runtime like so:
AWS_BUCKET_NAME=my-bucket python ./census.py etc...
Available mirrors
By default, the process will download files directly from the US Census FTP, https://www2.census.gov/geo. You can direct it to use mirrors of that FTP if needed.
Institution | Link | MIRROR_URL |
---|---|---|
University of Chicago | browse | https://pub-a835f667d17f4b6691fafec7e9ede33d.r2.dev |
Sources lookup
The file lookups/sources.json is a master list of all file urls and important field names for each year, geography, and scale. This is necessary because field names and file naming conventions have changed over the years (and I have had trouble running ls
-type commands on the Census FTP server so for now this config is all just hard-coded).
Usage
python ./census.py [OPTIONS]
Options:
Arg | Input | Description |
---|---|---|
-g/--geography | place,bg,tract,state,zcta,county | Specify a geography to prepare. If left empty, all geographies will be processed. |
-y/--year | 2010,2018,2020 | Specify one or more years. If left empty, all years will be processed. |
-s/--scale | 500k,tiger | Specify one or more scales of geographic boundary file. If left empty, all scales will be processed. |
--destination | path/to/directory | Output directory for export. If not provided, results will be in .cache/{{geography}}/processed. |
--upload | (flag) | Upload the processed files to S3. Bucket name, AWS creds, and prefix will be acquired from environment variables. |
--no-cache | (flag) | Force re-retrieval of source files. |
--verbose | (flag) | Enable verbose output during process. |
Note: To export PMTiles, you will need to instl
The available options for geography, year, and scale are collected from sources.json
which will continue to expand, so the best way to see all options is by running:
python ./census.py --help
If no arguments are provided, the entire sources
lookup will be traversed and an attempt will be made to generate each file format (Shapefile, GeoJSON, and PMTiles) for every configured year, geography, and scale.
Examples
python ./census.py -y 2020 -g state -s 500k -f geojson --destination .
Result: A new GeoJSON file in the local directory, from the 500k Cartographic Boundary shapefile, 2020 vintage.
python ./census.py -g bg -s 500k -f pmtiles --upload
Result: 500k scale cartographic boundary files for block groups will be merged into nation-wide coverage and exported to PMTiles, one file per available year. Each file will be uploaded to the S3 bucket as described via environment variables.
Building the website
A separate script build_pages.py
script generates the single HTML file hosted on Github pages. After the package has been installed as above, run python ./build_pages.py
. The docs/index.html
file will be re-rendered based on the main README.