This Jupyter notebook describes the steps necessary to recreate the data for the Texas Attainment Map.

The code below describes in detail the process of converting the American Community Surevy data made available through IPUMS-NHGIS to dots on a map. This project requires the following files which can be downloaded from NHGIS:

  • 2010 census block polygons for Texas
  • 2010 census table P1, Total Population, for census blocks
  • 2011-2015 5-year summary statistics for table B15001, educational attainment by age for each census tract. The census tract field (GISJOIN) will be used to merge the data to the census tract polygons.
  • 2015 census tract polygons (we downloaded the entire US census tract file, you will only need Texas)

You will need access to:

  • ArcGIS (peferably with the advanced license - however, there are workarounds if you only have the basic license)
  • SAS 9.4 (there are many alternatives to SAS)
In [1]:
#These modules only load when launching in the arcgispro-3 environment
import arcpy
import os
import arcpy.management as mg
import re
import pandas as pd
arcpy.env.overwriteOutput = True
arcpy.env.qualifiedFieldNames = False
os.chdir('PUT_YOUR_FILE_PATH_HERE')

Identify the places where there shouldn't be any dots

Census tracts include bodies of water, airports, and parks - places where people don't usually live.

Here is a census tract near downtown in Austin, Texas. image.png

In order to avoid placing a dot representing 25 people with a bachelor's degree in the middle of Lady Bird Lake, or on a highway, or in a park, we will need adjust the census tract polygons. We do this using the much smaller census blocks from the 2010 census.

Download and unzip the files from NHGIS. We'll begin by identifying all of the census blocks form the 2010 census with a population of zero. (Note: these census blocks with zero population highways and waterways that can divide entire census tracts.)

In [2]:
#######
#Create NoPopBlocks from data downloaded from NHGIS
#######

#File for UN-ZIPPED NHGIS data is: "AttainmentMap/OriginalDownloadfromNHGIS/"


#The CensusBlock CSV file for Texas where H7v001=0 [i.e. population=0]


#Convert zipped CensusBlock csv file to dBase and limit to just Texas blocks where population=0.
CensusBlocks = pd.read_csv("data/OriginalDownloadfromNHGIS/nhgis0004_csv/nhgis0004_ds172_2010_block.csv",  usecols=[0,39])

NoPopBlocks=CensusBlocks.loc[CensusBlocks['H7V001'] == 0 ]

#set max rows to display when viewing dataframes
pd.options.display.max_rows = 10
print(NoPopBlocks)

#Save NoPopBlocks as a CSV
NoPopBlocks.to_csv('data/NoPopBlocks.csv')
                   GISJOIN  H7V001
3       G48000109501001003       0
4       G48000109501001004       0
7       G48000109501001007       0
8       G48000109501001008       0
11      G48000109501001011       0
...                    ...     ...
914197  G48050709503024041       0
914198  G48050709503024042       0
914201  G48050709503024045       0
914202  G48050709503024046       0
914203  G48050709503024047       0

[459573 rows x 2 columns]
In [3]:
# Create a File Geodatabase to store the NoPopBlocks 
arcpy.CreateFileGDB_management("data", "Attainment.gdb")

#Get the NoPopBlocks table into the geodatabase
arcpy.TableToTable_conversion("data/NoPopBlocks.csv", "data/Attainment.gdb", "NoPopBlocksTable")

#Index GISJOIN fields for improved performamce
arcpy.AddIndex_management ("data/Attainment.gdb/NoPopBlocksTable", "GISJOIN", "index1")
Out[3]:
<Result 'data/Attainment.gdb/NoPopBlocksTable'>
In [4]:
#Put TX census block polygons in GDB.
arcpy.FeatureClassToGeodatabase_conversion("data/blocks/TX_block_2010.shp", "data/Attainment.gdb")

#Index GISJOIN fields for improved performamce
arcpy.AddIndex_management ("data/Attainment.gdb/TX_block_2010", "GISJOIN", "index2")
Out[4]:
<Result 'data/Attainment.gdb/TX_block_2010'>

Now we'll select all of the census blocks with zero population and save those polygons to the geodatabase.

In [5]:
#Get just the Census blocks with zero population

# Set local variables    
layerName = "TX_BlocksLayer"
joinTable = "data/Attainment.gdb/NoPopBlocksTable"
joinField = "GISJOIN"
outFeature = "data/Attainment.gdb/NoPopBlocks"
expression = "NoPopBlocksTable.H7V001 = 0"
    
#Load TX 2010 Census Blocks as Feature Layer
arcpy.MakeFeatureLayer_management("data/blocks/TX_block_2010.shp", "TX_BlocksLayer")
    
# Join the feature layer to a table. Use KEEP_COMMON to eleiminate need for arcpy.SelectLayerByAttribute_management()
arcpy.AddJoin_management(layerName, joinField, joinTable, joinField)
    
arcpy.SelectLayerByAttribute_management(layerName,"NEW_SELECTION", expression)
    
# Copy the layer to a new permanent feature class
arcpy.CopyFeatures_management(layerName, outFeature)

    
Out[5]:
<Result 'data\\Attainment.gdb\\NoPopBlocks'>

Now we can see all the 2010 census blocks with zero population (shown below in grey).

Here we see that the lake, the highway, the park, and a lot of commercial areas of downtown Austin have census blocks with population=0.

image.png

Now we'll poke holes in the census tracts using the census block with zero population

Thanks to Prof. Walker at TCU for posting his python code on how to do this.

Note: You will need the advanced ArcGIS license to use the erase tool. However, you can use the union tool to join the data for both layers and get the same result. Western Washington University has a helpful how-to here.

What will be left of our census tracts is shown below in green.

image.png

In [6]:
#Load Texas Census Tracts

#Make temporary CensusTract Layer
arcpy.MakeFeatureLayer_management("data/OriginalDownloadfromNHGIS/nhgis0001_shape/us_tract_2015.shp", "CensusTractLayer")

#Select just the TX Census Tracts and save in GDB
arcpy.SelectLayerByAttribute_management("CensusTractLayer", "NEW_SELECTION", " STATEFP = '48' ")
arcpy.CopyFeatures_management("CensusTractLayer", "data/Attainment.gdb/TexasTracts")

#Erase the NoPopBlocks from the Texas Census Tracts
in_features =  "data/Attainment.gdb/TexasTracts"
erase_features = "data/Attainment.gdb/NoPopBlocks"
out_feature_class = "data/Attainment.gdb/BustedTracts"

arcpy.Erase_analysis(in_features, erase_features, out_feature_class)
Out[6]:
<Result 'data\\Attainment.gdb\\BustedTracts'>

This leaves some thin slices of census tracts.

That can complicate things when it comes to randomly placing the dots within the census tracts.

So we need to remove the slivers. We'll do so by eliminating isolated chuncks of census tracts < 5,000 sq. meters. This will leave us with our dasymetric census tracts.

Note: Below, the Arcpy Add Geometry Attribute tool is providing a syntax warning but no apparent problem on execution.

In [7]:
#This will allow us to grab each piece of busted up Census Tract individually
arcpy.MultipartToSinglepart_management("data/Attainment.gdb/BustedTracts", "data/Attainment.gdb/SinglePartTracts")

# Check for slivers and remove them - let's set a threshold of 5000 sq meters
singlepart_layer = arcpy.MakeFeatureLayer_management("data/Attainment.gdb/SinglePartTracts")
arcpy.AddGeometryAttributes_management(singlepart_layer, Geometry_Properties = "AREA", Area_Unit = "SQUARE_METERS")
    #yes there is a syntax warning - as far as I can tell, it is just part of the arcpy script for that tool

# Now, specify the where clause and dissolve
area_clause = '"POLY_AREA" >= 5000'
arcpy.SelectLayerByAttribute_management(singlepart_layer, "NEW_SELECTION", area_clause)
out_dissolve = "data/scratch/DissolvedSinglepartTracts.shp"

arcpy.Dissolve_management(singlepart_layer, out_dissolve, dissolve_field = "GISJOIN")
c:\program files\arcgis\pro\Resources\ArcToolbox\Scripts\AddGeometryAttributes.py:31: SyntaxWarning: name 'hasZ' is assigned to before global declaration
  global hasZ
c:\program files\arcgis\pro\Resources\ArcToolbox\Scripts\AddGeometryAttributes.py:32: SyntaxWarning: name 'hasM' is assigned to before global declaration
  global hasM
c:\program files\arcgis\pro\Resources\ArcToolbox\Scripts\AddGeometryAttributes.py:67: SyntaxWarning: name 'ucurFields' is assigned to before global declaration
  global ucurFields
Out[7]:
<Result 'data\\scratch\\DissolvedSinglepartTracts.shp'>

Prepare the NHGIS attainment data by census tract needs before continuing to the next step.

I used SAS to generate the number of dots per level of attainment at each scale (1 dot =10 people, 25 people, 50 people, 100 people, and 150 people) for each census tract (GISJOIN field).

The output file is: data\NHGIS0001_PreparedForMerge.csv

I used the following SAS code for preparing the NHGIS attainment data:

/* This program takes the raw attainment data in the ACS B15001 table from NHGIS and 
prepares the number of dots to show by census tract at different scales (10, 25, 50, 100, 150).
These scales will correspond to different zoom levels in the final map.
*/

PROC IMPORT OUT= WORK.NHGIS0001
            DATAFILE= "~\data\OriginalDownloadfromNHGIS\nhgis0001_csv\nhgis0001_csv\nhgis0001_ds216_20155_2015_tract.csv" DBMS=CSV REPLACE;
     GETNAMES=YES;
     DATAROW=2; 
RUN;

/* Keep the GISJOIN field and make the following new variables: 
            High School or less, Some college no degree, Associates,  
            Bachelor's, Graduate or Professional Degree 
*/

data AttainmentRaw;
    set NHGIS0001;
    where STATEA='48';
        Assoc = AD0PE016 + AD0PE057;
        Bacc =  AD0PE017 + AD0PE058;
        GradPro =  AD0PE018 + AD0PE059;
        SomeHE = AD0PE015 + AD0PE056;
        HSorLess = AD0PE012 + AD0PE013 + AD0PE014 + AD0PE053 + AD0PE054 + AD0PE055;
        TotalPop =  AD0PE001;
run;


data work.AttainmentbyCensusTract (keep=GISJOIN Assoc10 Assoc25 Assoc50 Assoc100 Assoc150 
        Bacc10 Bacc25 Bacc50 Bacc100 Bacc150 GradPro10 GradPro25 GradPro50 GradPro100 GradPro150 
        HSunder10 HSunder25 HSunder50 HSunder100 HSunder150 SomeHE10 SomeHE25 SomeHE50 SomeHE100 SomeHE150);
    set AttainmentRaw;

    Assoc10=Round(Assoc/10);
    Assoc25=Round(Assoc/25);
    Assoc50=Round(Assoc/50);
    Assoc100=Round(Assoc/100);
    Assoc150=Round(Assoc/150);

    Bacc10=Round(Bacc/10);
    Bacc25=Round(Bacc/25);
    Bacc50=Round(Bacc/50);
    Bacc100=Round(Bacc/100);
    Bacc150=Round(Bacc/150);

    GradPro10=Round(GradPro/10);
    GradPro25=Round(GradPro/25);
    GradPro50=Round(GradPro/50);
    GradPro100=Round(GradPro/100);
    GradPro150=Round(GradPro/150);

    SomeHE10=Round(SomeHE/10);
    SomeHE25=Round(SomeHE/25);
    SomeHE50=Round(SomeHE/50);
    SomeHE100=Round(SomeHE/100);
    SomeHE150=Round(SomeHE/150);

    HSunder10=Round(HSorLess/10);
    HSunder25=Round(HSorLess/25);
    HSunder50=Round(HSorLess/50);
    HSunder100=Round(HSorLess/100);
    HSunder150=Round(HSorLess/150);

run;

proc export data=work.AttainmentbyCensusTract
            outfile="~\data\NHGIS0001_PreparedForMerge.csv" 
            dbms=csv replace;
run;

The next step joins the output from SAS (number of dots per level of attainment at 5 different zoom layers for each census tract in Texas) to the census tract polygons.

In [8]:
#Merge the Dissolved_back_to_Tracks polygons with the .xls file exported from sas with the number of dots per tract

#Get .xls into GDB
arcpy.TableToTable_conversion("data/NHGIS0001_PreparedForMerge.csv", "data/Attainment.gdb", "AttainmentData_SASoutput")

#Index GISJOIN field for improved performamce
#arcpy.AddIndex_management ("data/Attainment.gdb/JoinedAttainmentData_DasymetricTracts", "GISJOIN", "GISIndex")

# Join the feature layer to a SAS output table
inFeatures = "data/scratch/DissolvedSinglepartTracts.shp"
layerName = "JoinedAttainmentData_DasymetricLayer"
joinTable = "data/Attainment.gdb/AttainmentData_SASoutput"
joinField = "GISJOIN"
outFeature = "data/Attainment.gdb/JoinedAttainmentData_DasymetricTracts"

# Create a feature layer 
arcpy.MakeFeatureLayer_management (inFeatures,  layerName)
    
# Join the feature layer to a table
arcpy.AddJoin_management(layerName, joinField, joinTable, joinField)

# Copy the layer to a new permanent feature class
arcpy.CopyFeatures_management(layerName, outFeature)
Out[8]:
<Result 'data\\Attainment.gdb\\JoinedAttainmentData_DasymetricTracts'>

And now we can make the dots!

The uses the arcpy.CreateRandomPoints tool which is also available only with the advanced license. There are similar tools out there that may also get the job done.

Once we make random points within our modified census tracts we will no longer need any polygons or census shape data!

From here on out, our data consists of X,Y coordinates associated with

  • Level of educational attainment
  • Approximate number of individuals represented by each point (i.e.: scale).
In [9]:
def MakePointLayers(scale):
    levels=["Assoc", "Bacc", "GradPro", "HSunder", "SomeHE"]
    for level in levels:

        outGDB = "data/Attainment.gdb"
        out_name= level + str(scale)
        constraining_feature_class = "data/Attainment.gdb/JoinedAttainmentData_DasymetricTracts"
        Field_with_Number_of_Random_Points = level + str(scale)
        Minimum_Distance_Between_Points= str(scale)
        
        
        arcpy.CreateRandomPoints_management(outGDB, out_name, constraining_feature_class, "", Field_with_Number_of_Random_Points, Minimum_Distance_Between_Points)    

scales=[10, 25, 50, 100, 150]
for scale in scales:
    MakePointLayers(scale)

Here we'll add fields for scale and attainment level and then make a single shapefile for each level.

We made five scales. They are:

1 dot = 10 people, 25 people, 50 people, 100 people, and 150 people
In [10]:
def arcmap_to_geojson(scale):
        levels=["Assoc", "Bacc", "GradPro", "HSunder", "SomeHE"]
        for level in levels:
            layername = level + str(scale)
            layer = level + str(scale)
            arcpy.management.MakeFeatureLayer("data/Attainment.gdb/" + layername, layer)
            arcpy.management.AddField(layer, "scale", "SHORT")
            arcpy.management.CalculateField(layer, "scale", scale)
            arcpy.management.AddField(layer, "level", "TEXT")
            arcpy.management.CalculateField(layer, "level", "'"+level+"'")
        
def combine_levels_by_scale(scale):
    levels=["Assoc"+str(scale), "Bacc"+str(scale), "GradPro"+str(scale), "HSunder"+str(scale), "SomeHE"+str(scale)]
    
    # Use Merge tool to move features into single dataset, set spatial reference to WGS84
    CombinedFeature = "data/Attainment.gdb/TempAttainment" + str(scale)
    OutFeature = "data/Attainment.gdb/Attainment" + str(scale)
    
    arcpy.Merge_management(levels, CombinedFeature)
    arcpy.management.Project(CombinedFeature, OutFeature, arcpy.SpatialReference(4326))
    arcpy.FeatureClassToShapefile_conversion (OutFeature, "data/scratch")
                
scales=[10, 25, 50, 100, 150]
for scale in scales:
    arcmap_to_geojson(scale)
    combine_levels_by_scale(scale)

Now we can see the dots on a map!

The image below shows all of the dots in blue for scale=25. You can control the color using the 'Level' field later to contrast different degree types. Remember, the placement of the dots within the dasymetric census tract in random so a repeat of this process won't have dots in exactely the same place. But the number of dots per census tract would remain the same.

image.png

From here you can upload the shapefiles to a service like Mapbox or make your own .mbtiles using open source tools.

For diehard SAS users, there is a new proc json command in SAS 9.4. We got our geojson using that. But the code isn't pretty. Contact john.dinning@thecb.state.tx.us for a copy of the code. Ogr2ogr is much easier!

You can use the ogr2ogr tool from the open source GDAL/OGR library to convert the shapefiles to GeoJSON.

Once you have installed GDAL, you can run the tool in the command line that takes the shapefile of dots for each zoom level. The following was for zoom-level range where each dot represented about 150 people:

ogr2ogr -f GeoJSON geojson/Attainment150.json Shapefiles/Attainment150.shp -progress

This outputs the points as a .json file of point data.

The tippecanoe tool takes .json data to make .MBtiles. Read the full documentation on GitHub for best results. We used tippecanoe with the following options:

tippecanoe --output=Attainment150.mbtiles geojson/Attainment150.json -r1 --drop-fraction-as-needed --no-feature-limit --no-tile-size-limit --maximum-zoom=7 --minimum-zoom=5 --exclude=CID