RCP Demographic Interactive Map: Methodology
This is the technical manual for the interactive map – basically this is where I explain all the nitty-gritty choices I made on the code and data. This is going to get relatively detailed and possibly a little dense, so only read this if you really want to get into the weeds. I’m writing with academics, data enthusiasts, quantitative political analysts, data science professionals and my fellow data journalists in mind. In other words, I want this write-up to be understandable and not too laden with jargon, but I don’t intend to gloss over a detail because it might be too quantitative or boring for the lay reader. In this document, the boring details are the point.
The first section deals with the calculations within the table (popular vote – the table). The second section deals with the data used to make the state-by-state results (map). The third section deals with some of the code used to generate state-by-state vote totals in the map. The fourth section provides links to data sources.
Section 1: Calculating the Popular Vote
Most all of the calculations within the code are built around one equation:
total votes for party i = total VEP × turnout rate × two-party vote share for party i
It’s a pretty simple, intuitive equation – take all the people who can vote (the voting-eligible population or VEP), multiply that by the fraction of eligible voters who make it to the polls and then multiply that by the fraction that votes for a specific candidate. That produces the number of votes that candidate receives in a given election.
But each of the factors in the right-hand side of the equation comes from a different data source and only careful choice of data sources lead to accurate results.
VEP was taken from a combination of data from Professor Michael McDonald’s “United States Election Project” and the AEI/CAP/Brookings “States of Change” project. We made a rough projection of the 2016 United States VEP using McDonald’s data. Specifically, McDonald’s VEP estimate grows by an average of 2.3 percent every two years from 2000 to 2014 with a standard deviation of 0.28 percent. So we took the 2014 VEP estimate, increased it by 2.3 percent and used that as our estimate of the 2016 VEP. This straight-line, simple method of projection is not the most sophisticated estimate of VEP, but on a short timeline (such as two years) it should give a good enough read on the country’s future VEP.
After that we used data provided to us by Bill Frey at Brookings to divide the 2016 total VEP into four racial and ethnic groups – non-Hispanic whites, African-Americans, Hispanic-Americans and Asian-Americans plus Others. Specifically, Frey provided RCP with estimates of the composition of the national and state-by-state electorate broken down by race (e.g. x percent of the national electorate will be non-Hispanic white in 2016, y percent will be African-American, etc.). Multiplying those percentages by the projected national VEP produced estimates for the total VEP of each racial group.
In order to get turnout numbers, we combined exit poll data, official election results and VEP data from “States of Change.” To demonstrate how this works, we’ll use the example of finding the percentage turnout for African-Americans in the 2012 election. In order to get this statistic it’s necessary to know 1) the total number of African-Americans who voted in 2012 and 2) the total number of African-Americans who were eligible to vote in 2012. To get the first quantity, we multiplied the total votes that Mitt Romney and President Obama received (we exclude third party candidates) by the percentage of the electorate identified as African-American in the national exit polls. To get the second quantity – the total number of eligible African-American voters – we multiplied McDonald’s estimate of the total VEP by the fraction of the VEP that was African-American according to “States of Change.” From here it should be apparent that dividing the first quantity by the second quantity – the total number of African-Americans who voted by the total number of eligible African-American voters – yields the turnout.
Those who are familiar with these datasets might suggest forgoing these calculations and instead simply use turnout statistics from the Census Bureau’s Current Population Survey Voting and Registration Supplement (CPS). The problem is that CPS only tracks if a person voted, not who they voted for. And when CPS numbers are substituted into the main equation (VEP × turnout × vote split), the results don’t match up well with actual election returns. But when our more complex turnout formula is used, the final results track well with actual election results.
Astute observers will also notice that the “default” turnout numbers in the calculator match the 2012 CPS numbers. This is done so as not to confuse readers who are familiar with the CPS numbers and who believe them to be the true value for turnout. As soon as the user enters a turnout value into the calculator, the code adjusts that value by a percentage point or two so that it matches the turnout value we calculated with our formula. For example, our non-Hispanic white turnout value is 57.8 percent for 2012 but the CPS turnout value is 64.1 percent. The user sees the 64.1 percent default, but the code works with the 57.8 percent value. So if the user changes the non-Hispanic white turnout value to 60.1 percent (a four percentage point drop) the code registers a drop of the same size and takes non-Hispanic white turnout to be 53.8 percent.
These small adjustments lead to odd-looking results when turnout nears zero or 100 percent (e.g. negative votes for a party). Since turnout will not be anywhere near zero or 100 percent for any group in the 2016 election, we simply tell the reader not to input values above 90 percent or below 10 percent and the interactive still serves its stated purpose.
The third factor in the main equation is two-party vote share. We used exit polls from the Roper Center to get these values and we excluded third parties since they typically don’t make a big difference in election outcomes. For example, the exit polls showed that Romney took 27 percent of the Hispanic vote and Obama won 71 percent. The Republican share of the two-party vote would be calculated as (27/27+71) = 27.6 percent. The Democratic share of the vote is 100 – 27.6 = 72.4 percent.
Once the right data is in place, only simple addition, subtraction, multiplication and division are needed to get the overall estimates of popular vote by racial group. The equations above are used to calculate the popular vote for each racial group. To get the overall popular vote, one only needs to add up the popular vote by race for all racial groups. Any other questions can be answered by looking at the update_pop() method within the code. The code is well-commented and we used intuitive variable names so it shouldn’t be too hard to understand what’s going on.
Section 2: State-by-State Vote Data
The same simple equation – VEP multiplied by turnout and vote share – underlies the state-by-state calculations. One can easily imagine how, given the right data, it might be easy to calculate the state-by-state election results based on demographics. But we had to make some assumptions and fill in some blanks to get that data. We’ll take the election results equation (VEP × turnout × vote share) piece by piece to explain how we got what we needed.
Getting the VEP by state and race was relatively simple thanks to Frey’s “States of Change” data. For each state, Frey provided RCP with the projected percent of VEP for each racial group. We were also able to project out the total number of people in each demographic group for each state as of 2016. We measured the average percentage growth in VEP every two years from 2000 to 2014 using McDonald’s data. Applying that growth rate to the 2014 population gave estimates for the VEP of each state in 2016. Multiplying that total by the percentages from “States of Change” produced estimates for the size of each racial group in each state in 2016.
Turnout and vote share were somewhat trickier because not all states had exit polls in 2012. But by using 2012 and 2008 exit poll data, we were able to make a general guess of what the turnout and vote share by state might have looked like in 2012 and thus produce good “defaults” for each state. First, we found 2012 exit poll data for 30 states and essentially took it at face value. These polls broke the vote down by race, showing the percentage of the electorate of each race and how that group voted. There was some missing data within those exit polls (e.g. there were no estimates for what percentage of the Alabama electorate was Asian or how Hispanic/Latino Alabamans voted). When percentage of the electorate was missing, we assumed that that demographic subgroup made up zero percent of the electorate. When vote share was missing, we assumed the demographic sub-group voted the same way that the group did nationally. For example, in Kansas African-Americans make up 5 percent of the electorate, but the exit poll doesn’t report how they voted. So we assumed they followed the national average and voted for Obama over Romney, 93 percent-6 percent.
For 16 other states, we took the 2008 exit poll numbers (applying the same changes – zeros for missing percentage of electorate values and 2008 national exit poll numbers for missing vote share values) and applied a “2012 swing” to them. This means that we took the 2012 and 2008 national exit poll numbers and found the difference between them. We then applied that difference to each of these other 16 states. For example, 9 percent of the electorate was Hispanic in 2008 and 10 percent was in 2012, so the 2012 adjustment for every state’s Hispanic percentage of the electorate was +1 percent.
For three states, the 2008 exit poll numbers (again applying the same changes to deal with missing data) were used but no “2012 swing” was added.
For the two remaining states (for a total of 51 – 50 states plus D.C.), the 2008 numbers were again adjusted by the “2012 swing” but four additional points were added to the Republican share of the non-Hispanic white two-party vote.
There was no a priori reason to believe white voters were especially Republican in these two states (Wyoming and West Virginia). But these states were disproportionately white and the three previously mentioned methods of finding turnout and vote shares were overestimating the Democratic vote share when 2012 VEP statistics were plugged into the equation. So in order to get the states to more accurately match the true 2012 election results, we added more Republican votes into the largest portion of those states’ population – non-Hispanic whites. Adding Republican votes to the less populous racial/ethnic groups would have resulted in larger swings in how that group voted (e.g. a group that is 80 percent of the electorate only needs to move four points towards one party in order to move the election results 3.2 points, while a group that's 20 percent of the population would need to move 16 points towards some party to move the whole population 3.2 points).
To see which method was used to generate turnout and vote share for which states, simply open the underlying data and look for the “code” column. If code = 1, the 2012 exits were used; if code = 2, the 2008 exits were used but a “2012 swing” was applied; if code = 3, the 2008 exits were used without a 2012 swing; and if code = 4, then 2008 exits with a “2012 swing” were used but Republicans were given four extra points with non-Hispanic whites.
After these initial calculations, eight states required minor tweaks. In Colorado, Kentucky, Minnesota, New Hampshire, South Dakota, Vermont, West Virginia and Wisconsin, turnout numbers for some of their less populous racial groups (typically African-Americans and/or Hispanics) were above 100 percent at their calculated 2012 default. This is one of the hazards of trying to work with multiple disparate datasets – occasionally weird results pop up. For each group that had a turnout over 100 percent, we increased their share of that state’s electorate by no more than two percentage points. This small change eliminated the problem without seriously warping final election results or producing an unreasonable scenario for that state.
The rules used to get turnout and vote share estimates are by nature somewhat ad hoc, so we used actual 2012 election results to make sure our data produced sensible results. Specifically, we used 2012 VEP statistics to calculate 2012 election results based only on our turnout and vote share estimates. Our estimates differed, on average, less than 1 percent from the actual state-by-state election results. Utah was discarded in this average because it was an outlier – the actual 2012 election result there was seven points more Republican than what our turnout and vote share numbers would have suggested. This makes sense – Utah’s turnout and vote share were calculated using 2008 numbers that were adjusted by a national 2012 swing. Romney, a Mormon, likely had something akin to a home state advantage in heavily Mormon Utah – something the 2008 exit polls and 2012 national exit polls wouldn’t have picked up on.
Section 3: State-by-state Vote Calculations
There are two important issues in calculating the election result in the states. The first is how to pass national swings down into 51 subdivisions (states plus D.C.). The second is how to deal with increasing or decreasing the national turnout and vote share for a group after a state already has zero or full turnout or vote share.
For the sake of simplicity, changes in national turnout levels or vote share are distributed evenly across states. In other words, if the user increases the Hispanic turnout from its default level of 48 percent to 51 percent, then the Hispanic turnout in each state also increases three percentage points. There are other ways to distribute changes like this, but uniform swing has the advantage of transparency (readers will have a good understanding of what happens in the code) while maintaining plausibility.
The other issue is how to deal with situations in which one state hits a floor or ceiling in turnout or vote share but another does not. For example, suppose a user hugely increases the Republican share of the non-Hispanic white vote and Republicans get 110 percent of the non-Hispanic white vote in Alabama. Obviously Alabama’s white vote can’t go more than 100 percent for either party, but the user is still changing the national popular vote – so that extra 10 percent of Alabama’s white electorate has to show up somewhere else.
To deal with this, we redistributed both turnout and vote share based on that group’s population size in each state. So that extra 10 percent of Alabama’s white Republican vote would be distributed to the rest of the states in which Republicans won less than 100 percent of the non-Hispanic white vote. So the other states would take on those extra white Republican voters at the same rate, but Texas or California would take on more total voters than a state with less total non-Hispanic white voters – such as Alaska or West Virginia.
Data Sources – Links
McDonald’s United States Elections Project
State-by-State 2012 and 2008 exit polls