Chapter 5 Data Wrangling: Cleaning and Transforming Data

Chapter 5 reviews functions for manipulating raw data to prepare it for analysis or visualization.

Learning objectives and chapter resources

By the end of this chapter, you should be able to (1) apply key functions to sort, filter, arrange, and restructure data tables for analysis and presentation; (2) use these techniques to process county-level voting returns from an official data source; (3) create clean data tables for visualization; and (4) document the steps involved in transforming raw data into a usable format. The material in the chapter requires the dplyr (Wickham, François, et al. 2023), ggplot2 (Wickham, Chang, et al. 2023), and tibble (Müller and Wickham 2023) packages, which should already be installed as part of the tidyverse. The material uses the following datasets: house_memb_117.csv and 2016GEN_MI_CENR_BY_COUNTY.xls from https://faculty.gvsu.edu/kilburnw/inpolr.html.

5.1 Wrangling and tidying data

Data is rarely analysis-ready or machine readable when obtained from its source. Transforming raw data into a format suitable for analysis — known as “data wrangling” — is a critical step in nearly any research project. And just as important is having code to make the process reproducible, both for transparency and for repeating the steps in the event of an error. This chapter provides tools to take messy or unwieldy data tables and to clean, structure, extract, and prepare data for visualization and further analysis. It is organized into three sections:

First, in brief we will learn about the use of indices to subset data tables by rows and columns. While these are essential to learn, the dplyr package provides a more flexible set of functions for sequencing operations into others such as filtering and arranging rows and columns. Second, we will learn about the core functions of the dplyr package, displayed by function name in Table ?? followed by a brief explanation. We will learn to use these functions with the %>% ‘pipe’ operator, for combining functions by passing the output of one to another. Third, we will apply these skills to cleaning and preparing for visualization Michigan County level voting records from an official source.

Attach the tidyverse to access functions from the packages dplyr and ggplot2.

library(tidyverse)

Save the two Chapter datafiles to your working directory; use the Session menus to change it as needed. Use read_csv() to read in house_memb_117.csv and store it as the data object house_memb_117:

house_memb_117<-read_csv(file="house_memb_117.csv")

The result, house_memb_117, is stored as a tibble, a particular format for printing a dataframe to the Console. Enter house_memb_117 to observe it. The file house_memb_117.csv is a table of characteristics of representatives to the 117th session of the U.S. House of Representatives, from the canonical source of data on members of Congress and their votes, Voteview.com (Lewis et al. 2023). The file consists of 455 rows and 9 columns. (See the Appendix for details.) Apart from identifiers, the data includes two measures of legislator ideology, nominate_dim1 and nominate_dim2 (K. T. Poole 2005; K. T. Poole and Rosenthal 2006). Developed by the political scientist Keith Poole, ‘NOMINATE’ (an acronym) scores estimate ideology on the principal line of political conflict in Congress, an economic liberalism-conservatism (dim1) and a second (dim2) capturing the legislator’s position on the remaining bundle of issues within a particular session of Congress. See Everson et al. (2016) for an explanation of the meaning and estimation of the scores. Given the file, imagine that we are interested in extracting observations on legislator ideology by State or party caucus, or perhaps sort the legislators by ideology.

Selecting rows and columns with indices

Indices are a fundamental way to select specific rows and columns in a dataframe. Below, the code selects four columns (state_abbrev, party_code, bioname, and nominate_dim1) from house_memb_117:

house_memb_117[, c(4, 5, 6, 8) ]
## # A tibble: 455 × 4
##    state_abbrev party_code bioname              nomin…¹
##    <chr>        <chr>      <chr>                  <dbl>
##  1 AL           Republican ROGERS, Mike Dennis    0.363
##  2 AL           Democratic SEWELL, Terri         -0.396
##  3 AL           Republican BROOKS, Mo             0.652
##  4 AL           Republican PALMER, Gary James     0.678
##  5 AL           Republican CARL, Jerry L.         0.52 
##  6 AL           Republican MOORE, Barry           0.642
##  7 AL           Republican ADERHOLT, Robert       0.386
##  8 AK           Republican YOUNG, Donald Edwin    0.283
##  9 AK           Democratic PELTOLA, Mary Sattl…  -0.156
## 10 AS           Republican RADEWAGEN, Aumua Am…   0.336
## # … with 445 more rows, and abbreviated variable name
## #   ¹​nominate_dim1

Indices specify positions within the dataframe using the format [r, c], where r represents rows, and c represents columns. When left blank, all rows or columns are selected. For example, house_memb_117[, 6] selects all rows from the sixth column, bioname. Moving the number 6 to the r position selects all columns for the sixth observation, house_memb_117[6, ]. Multiple columns can be selected with the c() function: house_memb_117[, c(1, 2, 3, 4, 5)] selects columns 1 through 5, while house_memb_117[, c(1:5)] achieves the same result with the sequence 1 through 5 specified by 1:5.

In the example above, house_memb_117[, c(4, 5, 6, 8)] returns all rows but only columns 4, 5, 6, and 8. When executed, the result is displayed as a tibble, with a header # A tibble: 455 × 4, showing that the dataset contains 455 rows and 4 columns. The first 10 rows are displayed along with column names, types (<chr> for character, <dbl> for numeric, <fct> for factor), and a comment indicating the remaining rows: # … with 445 more rows.

5.2 Functions for organizing rows and columns

Indices are essential for working with dataframes. The tools from dplyr are different: rather than directly identifying rows and columns with indices to specify a position, dplyr works with functions. As individual functions, each follows a clear pattern. The functions require first the name of a dataset, in this case, house_memb_117, followed by a set of conditions that select rows or columns of the dataset. The result is typically returned as a tibble. Below are examples of key dplyr functions:

Highlight a row of the dataset: filter()

The filter() function selects rows of a data table based on row values. For example, if we wanted to view ideology scores on the elected representatives from New York:13

filter(house_memb_117, state_abbrev=="NY")
## # A tibble: 29 × 4
##    state_abbrev party_code bioname              nomin…¹
##    <chr>        <chr>      <chr>                  <dbl>
##  1 NY           Democratic HIGGINS, Brian        -0.348
##  2 NY           Democratic CLARKE, Yvette Diane  -0.61 
##  3 NY           Democratic TONKO, Paul           -0.418
##  4 NY           Republican REED, Thomas W. II     0.269
##  5 NY           Democratic MENG, Grace           -0.379
##  6 NY           Democratic JEFFRIES, Hakeem      -0.489
##  7 NY           Democratic MALONEY, Sean Patri…  -0.239
##  8 NY           Republican ZELDIN, Lee M          0.397
##  9 NY           Democratic RICE, Kathleen Maura  -0.28 
## 10 NY           Republican STEFANIK, Elise M      0.269
## # … with 19 more rows, and abbreviated variable name
## #   ¹​nominate_dim1

Notice in the printout, the list of representative names followed by the ideology score, ranging from (-1 for more liberal) to (+1 for more conservative). For example, current House minority leader Representative Hakeem Jeffries is scored a -.489, while Representative Elise Stefanik is scored a .269.

Saving data subsets

The filter() function, like other dplyr functions, does not save a new dataset. Instead, it only returns the lines of the dataframe that match the conditions, printing the dataframe out as a tibble.

To save a filtered dataset, such as a New York subset of House member data, we use the assignment operator. To create a data table or ‘tibble’ NY and assign it the contents of the filter:

NY<-filter(house_memb_117, state_abbrev=="NY")

The result is a dataframe consisting of New York legislators. While the actual dataset includes all 9 variables, subscripts could select just the party, name, and ideology score for printing: party_code, bioname, nominate_dim1: [, c(5,6,8)]:

NY[, c(5,6,8)]
## # A tibble: 29 × 3
##    party_code bioname               nominate_dim1
##    <chr>      <chr>                         <dbl>
##  1 Democratic HIGGINS, Brian               -0.348
##  2 Democratic CLARKE, Yvette Diane         -0.61 
##  3 Democratic TONKO, Paul                  -0.418
##  4 Republican REED, Thomas W. II            0.269
##  5 Democratic MENG, Grace                  -0.379
##  6 Democratic JEFFRIES, Hakeem             -0.489
##  7 Democratic MALONEY, Sean Patrick        -0.239
##  8 Republican ZELDIN, Lee M                 0.397
##  9 Democratic RICE, Kathleen Maura         -0.28 
## 10 Republican STEFANIK, Elise M             0.269
## # … with 19 more rows

Note that inserting parentheses around the entire expression results in both saving the dataframe to a new object and returning the lines that fit the conditions. So the line (NY<-filter(house_memb_117, state_abbrev=="NY")) will both save the contents of filter() to a new dataframe, NY, and will return the result.

Filtering rows by condition

Filtering and other applicable functions, allow you to use familiar comparison operators. While we used == equal to in the prior examples , other operators are less than or equal to < = , greater than or equal to > = , not equal to !=, and of course less than < and greater than >. Other functions can be combined with filter(), such as is.na() for identifying observations with missing values; filter(house_memb_117, is.na(born)) would find any row with a missing birth year.

We can use simple Boolean logic to construct filter() functions that satisfy multiple conditions. For example, when conditions in filter() are listed sequentially, the result of the filter operation is the dataset when all conditions are evaluated as true. So the two statements below are equivalent:

filter(house_memb_117, state_abbrev=="NY" , born > 1969)
filter(house_memb_117, state_abbrev=="NY" & born > 1969)

These statements query for representatives who are from NY and born after 1969. The symbol for ‘and’ in the line above means that only for those lines when the two conditions state_abbrev=="NY" & born > 1969 are true does the filter() function return a line of the dataset. The symbol | as an operator for ‘or’, substituted for &, would return all NY representatives or those born after 1969, regardless of State.

Combining the two, we can find the names of representatives from New York and New Jersey that fit the two conditions.

filter(house_memb_117, state_abbrev=="NY" | 
                       state_abbrev=="NJ", born > 1969)
## # A tibble: 17 × 9
##    chamber icpsr distri…¹ state…² party…³ bioname  born
##    <chr>   <dbl>    <dbl> <chr>   <chr>   <chr>   <dbl>
##  1 House   21723        5 NJ      Democr… GOTTHE…  1975
##  2 House   21937        3 NJ      Democr… KIM, A…  1982
##  3 House   21964       11 NJ      Democr… SHERRI…  1972
##  4 House   21101       23 NY      Republ… REED, …  1971
##  5 House   21342        6 NY      Democr… MENG, …  1975
##  6 House   21343        8 NY      Democr… JEFFRI…  1970
##  7 House   21539        1 NY      Republ… ZELDIN…  1980
##  8 House   21541       21 NY      Republ… STEFAN…  1984
##  9 House   21916       19 NY      Democr… DELGAD…  1977
## 10 House   21949       14 NY      Democr… OCASIO…  1989
## 11 House   22105       16 NY      Democr… BOWMAN…  1976
## 12 House   22117        2 NY      Republ… GARBAR…  1984
## 13 House   22127       17 NY      Democr… JONES,…  1987
## 14 House   22133       11 NY      Republ… MALLIO…  1980
## 15 House   22154       15 NY      Democr… TORRES…  1988
## 16 House   22169       19 NY      Democr… RYAN, …  1982
## 17 House   22170       23 NY      Republ… SEMPOL…  1983
## # … with 2 more variables: nominate_dim1 <dbl>,
## #   nominate_dim2 <dbl>, and abbreviated variable
## #   names ¹​district_code, ²​state_abbrev, ³​party_code

A tempting shortcut might be to think you can enter the function as state_abbrev=="NY" | "NJ", but that would be incorrect. Instead, to avoid re-typing multiple state_abbrev== statements, use the “in” operator %in%, as in the following structure:

filter(house_memb_117, state_abbrev %in% c("NY", "NJ"), born > 1969)

Arranging rows

The arrange() function is used to arrange rows, or change the order in which rows are displayed, rather than select particular rows. With the nominate_dim1 scores, the most obvious use would be to find the most liberal or conservative representatives. For example, arrange(house_memb_117, nominate_dim1) would display, in ascending order, the rows of the dataset starting with the lowest (or most liberal) value. To reverse the order to descending, use desc().

## # A tibble: 455 × 4
##    state_abbrev party_code bioname              nomin…¹
##    <chr>        <chr>      <chr>                  <dbl>
##  1 NM           Republican HERRELL, Yvette        0.936
##  2 SC           Republican NORMAN, Ralph          0.848
##  3 AZ           Republican BIGGS, Andrew S.       0.833
##  4 GA           Republican CLYDE, Andrew S.       0.823
##  5 IL           Republican MILLER, Mary E.        0.806
##  6 GA           Republican GREENE, Marjorie Ta…   0.8  
##  7 TX           Republican ROY, Charles           0.8  
##  8 VA           Republican GOOD, Bob              0.8  
##  9 GA           Republican HICE, Jody Brownlow    0.797
## 10 TX           Republican JACKSON, Ronny         0.773
## # … with 445 more rows, and abbreviated variable name
## #   ¹​nominate_dim1
arrange(house_memb_117, desc(nominate_dim1))

The tibble, indexed to house_memb_117[, c(4, 5, 6, 8)], displays the top ten most conservative lawmakers, at the top Rep. Yvette Herrell (R-NM), followed by some familiar names.

Selecting columns

Without indices, the tibble sorted on ideology displayed all columns by default. While indexing with house_memb_117[, c(5, 6, 8)] selects columns by their numeric positions, select() makes it clearer and more intuitive by referencing the column names directly. For example, the following function selects three columns from the dataset for display:

select(house_memb_117, party_code, bioname, nominate_dim1)
## # A tibble: 455 × 3
##    party_code bioname                        nominate…¹
##    <chr>      <chr>                               <dbl>
##  1 Republican ROGERS, Mike Dennis                 0.363
##  2 Democratic SEWELL, Terri                      -0.396
##  3 Republican BROOKS, Mo                          0.652
##  4 Republican PALMER, Gary James                  0.678
##  5 Republican CARL, Jerry L.                      0.52 
##  6 Republican MOORE, Barry                        0.642
##  7 Republican ADERHOLT, Robert                    0.386
##  8 Republican YOUNG, Donald Edwin                 0.283
##  9 Democratic PELTOLA, Mary Sattler              -0.156
## 10 Republican RADEWAGEN, Aumua Amata Coleman      0.336
## # … with 445 more rows, and abbreviated variable name
## #   ¹​nominate_dim1

This use of select() is equivalent to `house_memb_117[, c(5,6,8)]. Or to display a sequential order of columns, separate the columns with a colon: select(house_memb_117, party_code:nominate_dim1). Adding the function everything() to the end of the function will simply place the remaining variables at the end of the dataframe, select(house_memb_117, party_code:nominate_dim1, everything()).

Renaming columns

To rename a column of a dataframe, use the rename function. In the rename() function, the new variable name is listed first, followed by the old variable name. So to rename nominate_dim1 as left_right_ideology:

rename(house_memb_117, left_right_ideology = nominate_dim1 )
## # A tibble: 455 × 3
##    party_code bioname                        left_rig…¹
##    <chr>      <chr>                               <dbl>
##  1 Republican ROGERS, Mike Dennis                 0.363
##  2 Democratic SEWELL, Terri                      -0.396
##  3 Republican BROOKS, Mo                          0.652
##  4 Republican PALMER, Gary James                  0.678
##  5 Republican CARL, Jerry L.                      0.52 
##  6 Republican MOORE, Barry                        0.642
##  7 Republican ADERHOLT, Robert                    0.386
##  8 Republican YOUNG, Donald Edwin                 0.283
##  9 Democratic PELTOLA, Mary Sattler              -0.156
## 10 Republican RADEWAGEN, Aumua Amata Coleman      0.336
## # … with 445 more rows, and abbreviated variable name
## #   ¹​left_right_ideology

As in the other functions, for the result to be saved to a dataset, the assignment operator needed to either overwrite the existing dataset or save a new version of it, as in house_memb_117<- rename(house_memb_117, left_right_ideology = nominate_dim1 ). Multiple renames can be combined, such as rename(house_memb_117, left_right_ideology = nominate_dim1, state=state_abbrev), to rename the ideology and state name.

Creating new columns

The mutate() function allows you to create new variables as combinations of other variables. For example, an age in years of each representative in the year 2022 could be calculated from born. With mutate() we name the new variable age and the formula for calculating it, which is the value 2022 minus the variable born, appended as the last column in the data:

mutate(house_memb_117, age = 2022-born) 
## # A tibble: 455 × 5
##    state_abbrev party_code bioname           born   age
##    <chr>        <chr>      <chr>            <dbl> <dbl>
##  1 AL           Republican ROGERS, Mike De…  1958    64
##  2 AL           Democratic SEWELL, Terri     1965    57
##  3 AL           Republican BROOKS, Mo        1954    68
##  4 AL           Republican PALMER, Gary Ja…  1954    68
##  5 AL           Republican CARL, Jerry L.    1958    64
##  6 AL           Republican MOORE, Barry      1966    56
##  7 AL           Republican ADERHOLT, Robert  1965    57
##  8 AK           Republican YOUNG, Donald E…  1933    89
##  9 AK           Democratic PELTOLA, Mary S…  1973    49
## 10 AS           Republican RADEWAGEN, Aumu…  1947    75
## # … with 445 more rows

There are many useful functions within mutate(). One is the ranking function.

We might be interested, for example, in ranking representatives in ideology. The min_rank() function will do this. By default, it assigns smallest values the smallest ranks; with the desc() function, we can assign smaller ranks to higher values. For example, assigning smallest ranks to more conservative (higher ideology scores):

mutate(house_memb_117, conservative_rank = 
         min_rank(desc(nominate_dim1)))

The result is a table displaying the most conservative legislators, rank-ordered from most conservative starting at rank position 1. Further limiting the display of columns to fit on the printed page results in:

## # A tibble: 455 × 4
##    bioname                         born nomin…¹ conse…²
##    <chr>                          <dbl>   <dbl>   <int>
##  1 ROGERS, Mike Dennis             1958   0.363     178
##  2 SEWELL, Terri                   1965  -0.396     349
##  3 BROOKS, Mo                      1954   0.652      41
##  4 PALMER, Gary James              1954   0.678      31
##  5 CARL, Jerry L.                  1958   0.52      103
##  6 MOORE, Barry                    1966   0.642      48
##  7 ADERHOLT, Robert                1965   0.386     170
##  8 YOUNG, Donald Edwin             1933   0.283     207
##  9 PELTOLA, Mary Sattler           1973  -0.156     227
## 10 RADEWAGEN, Aumua Amata Coleman  1947   0.336     188
## # … with 445 more rows, and abbreviated variable names
## #   ¹​nominate_dim1, ²​conservative_rank

Without the desc() function, min_rank() assigns higher ranks to smaller values.

Saving new datasets

Without an assignment operator, the functions such as mutate() and select() will not alter the dataset saved within the Environment pane. To create a subsetted data table (or to overwrite the existing data table), the assignment operator is necessary. With the assignment operator for the example with age, the variable is written to the end of the dataframe. For example the following lines add the variable age to the dataframe:

house_memb_117<-mutate(house_memb_117, age = 2022-born) 

To save only a subset of variables – including the newly created age variable – we would use the select() function to identify specific variables. For example the two lines below create the age variable with mutate(), appending age to house_memb_117_subset, which is a subset of the five variables identified in the next line, saved again as house_memb_117_subset.

house_memb_117_subset<-mutate(house_memb_117, age = 2022-born) 
house_memb_117_subset<-select(house_memb_117_subset, state_abbrev, 
                              party_code, bioname, born, age)

The new subset data table contains only the five variables:

house_memb_117_subset
## # A tibble: 455 × 5
##    state_abbrev party_code bioname           born   age
##    <chr>        <chr>      <chr>            <dbl> <dbl>
##  1 AL           Republican ROGERS, Mike De…  1958    64
##  2 AL           Democratic SEWELL, Terri     1965    57
##  3 AL           Republican BROOKS, Mo        1954    68
##  4 AL           Republican PALMER, Gary Ja…  1954    68
##  5 AL           Republican CARL, Jerry L.    1958    64
##  6 AL           Republican MOORE, Barry      1966    56
##  7 AL           Republican ADERHOLT, Robert  1965    57
##  8 AK           Republican YOUNG, Donald E…  1933    89
##  9 AK           Democratic PELTOLA, Mary S…  1973    49
## 10 AS           Republican RADEWAGEN, Aumu…  1947    75
## # … with 445 more rows

5.3 Combining functions

Often in managing unwieldy datasets, we combine different functions to accomplish a series of steps, such as mutating or creating a new variable, selecting columns, filtering the order of rows, and printing out or saving a subset. Doing so in an efficient way, however, requires us to combine multiple functions in a sequential order. The “pipe operator”, the characters %>% (Bache and Wickham 2020), allows multiple functions to be chained together.

Conceptually, the pipe operator translates to “and then”. So for example, we could filter a dataset, and then select particular rows, and then arrange the rows. The pipe operator simplifies the process of combining functions in a highly readable way, sequential steps separated by %>%. As a core element of the tidyverse, the piper operator is available once dplyr or the entire tidyverse is attached. It is worth noting that a recent update to R added a native pipe, |>, which works nearly identically, although the code in this text relies on %>%. Line by line we will use %>% to combine functions to do the following:

  1. Filter on representatives from New York.
  2. Select the columns for party_code, bioname, and nominate_dim1.
  3. Arrange the dataframe rows by conservatism, from most conservative in row 1 to least at the end.
  4. Print out the data table.

To do so, we start with the house_memb_117 dataframe and then add the functions line by line, adding %>% until the last command:

house_memb_117 %>%
  filter(state_abbrev=="NY") %>%
  select(party_code, bioname, nominate_dim1) %>%
  arrange(desc(nominate_dim1))
## # A tibble: 29 × 3
##    party_code bioname              nominate_dim1
##    <chr>      <chr>                        <dbl>
##  1 Republican SEMPOLINSKI, Joseph          0.545
##  2 Republican TENNEY, Claudia              0.453
##  3 Republican ZELDIN, Lee M                0.397
##  4 Republican JACOBS, Chris                0.306
##  5 Republican MALLIOTAKIS, Nicole          0.304
##  6 Republican GARBARINO, Andrew R.         0.274
##  7 Republican REED, Thomas W. II           0.269
##  8 Republican STEFANIK, Elise M            0.269
##  9 Republican KATKO, John                  0.185
## 10 Democratic RYAN, Patrick               -0.206
## # … with 19 more rows

Notice that at the end of the tibble, a comment states there are 19 more rows, in addition to the printed 10, for a total of 29 rows and 3 columns. To print the entire tibble (all of the rows) we add print(n=39) as an additional row. Below the print() function prints the top five:

house_memb_117 %>%
  filter(state_abbrev=="NY") %>%
  select(party_code, bioname, nominate_dim1) %>%
  arrange(desc(nominate_dim1)) %>%
  print(n=5)
## # A tibble: 29 × 3
##   party_code bioname             nominate_dim1
##   <chr>      <chr>                       <dbl>
## 1 Republican SEMPOLINSKI, Joseph         0.545
## 2 Republican TENNEY, Claudia             0.453
## 3 Republican ZELDIN, Lee M               0.397
## 4 Republican JACOBS, Chris               0.306
## 5 Republican MALLIOTAKIS, Nicole         0.304
## # … with 24 more rows

Notice that in each of the functions, such as filter(), it is not necessary to include the name of the dataframe. Through the pipe operator, the dataframe to the left of the first pipe operator house_memb_117 %>% is passed on to each of the functions below it. To save the table to a dataframe, for example named ny_reps, we would change the first line to _ny_reps<-house_memb_117 %>%.

Grouped summary statistics

It is often useful to sort through data by groups within it. The group_by() function, sets a grouping structure for the dataframe that is taken into account when functions like are subsequently applied. It is often useful in combination with the summarize() function, for calculating summary statistics on groups. For example, we may be interested in average ideological orientation of State delegations, or averages within parties within States. Or perhaps we would like to count the number of representatives from each State, by political party.

We would first group the data with group_by() then apply summarize(). Because the summarize() function reduces the data down to a single summary statistic, it requires both a name of the new summary statistic in the data table, and a function to calculate the summary statistic, such as mean(), sum(), range(), max() or min().

For example, in summarize(mean_ideology = mean(nominate_dim1)), mean_ideology will be the column name in the summary dataframe, and mean(nominate_dim1) is the function that calculates the average ideology score.

Let’s observe how it works within partisan groups of New York legislators.

house_memb_117 %>%
  filter(state_abbrev=="NY") %>%
  group_by(party_code) %>%
  summarize(mean_ideology = mean(nominate_dim1))
## # A tibble: 2 × 2
##   party_code mean_ideology
##   <chr>              <dbl>
## 1 Democratic        -0.381
## 2 Republican         0.334

The summarize() function reduces the dataframe to a set of means, ideology by party group. The average Democrat and Republican representative from New York is roughly similar, although the mean Democratic ideology is slightly larger in magnitude than Republican.

The summarize() function can take multiple arguments, allowing you to calculate several summary statistics at once. For example, we could calculate both the mean and standard deviation within parties:

house_memb_117 %>%
  filter(state_abbrev=="NY") %>%
  group_by(party_code) %>%
  summarize(
    mean_ideology = mean(nominate_dim1),
    sd_ideology = sd(nominate_dim1))
## # A tibble: 2 × 3
##   party_code mean_ideology sd_ideology
##   <chr>              <dbl>       <dbl>
## 1 Democratic        -0.381       0.123
## 2 Republican         0.334       0.111

The sd_ideology expression follows after mean_ideology. Note that in these examples of calculating summary statistics, there are no missing values in the data. Where there are missing values R requires specific instructions on what to do with the missing values, in this case it would be to remove missing values from any calculations. For example, the statement for the standard deviation would need to be sd_ideology = sd(nominate_dim1, na.rm = TRUE).

One additional useful function to add to summarize() is n(), which does not require any arguments. The function will count up the number of rows defined by the group_by() function. For example, to add counts of the party_code groups:

house_memb_117 %>%
  filter(state_abbrev=="NY") %>%
  group_by(party_code) %>%
  summarize(
    mean_ideology = mean(nominate_dim1),
    sd_ideology = sd(nominate_dim1),
    count_n = n())
## # A tibble: 2 × 4
##   party_code mean_ideology sd_ideology count_n
##   <chr>              <dbl>       <dbl>   <int>
## 1 Democratic        -0.381       0.123      20
## 2 Republican         0.334       0.111       9

Overall, with a few functions, starting with those listed in Table ??, datasets can be transformed in nearly limitless ways. The following section demonstrate the use of dplyr functions for shaping up raw datasets, wrangling and cleaning data for use in analysis.

5.4 Cleaning and transforming county-level votes

Another important application of data wrangling is in creating tidy datasets from a source, such as a US Secretary of State’s election portal, aggregated to a geographic level. In this example, we look at election results from the November 2016 general election, aggregated to the Michigan County level. We will extract County-level presidential votes and organize a tidy dataset. From there, we will construct visualizations of the election results across Michigan counties with a specific data visualization, a “Cleveland dotplot” (Cleveland and McGill 1984).

Working with county-level election results

Save the 2016GEN_MI_CENR_BY_COUNTY.xls datafile to your working directory; use the Session menus to change it.14

The data is imported with the read_tsv() (tab-separated value) function from the dplyr package. Place the file in your working directory and run the function. The datafile from the Secretary of State’s Office (2016GEN_MI_CENR_BY_COUNTY.xls) has the file extension .xls, suggesting it is a Microsoft Excel formatted file, but it is actually saved as a tab delimited text file:

MI_2016<-read_tsv(file="2016GEN_MI_CENR_BY_COUNTY.xls")

When you import the data, you will see a list of columns ‘parsed’ from the data, starting with the ElectionDate and ending with `Nominated(N)/Elected(E). Each row of the dataset corresponds to a single candidate for any elected office appearing on a ballot within each County. So for each County, there are as many rows as there are candidates appearing on the ballot for election to any office, which is why there are 6,082 rows:

MI_2016
## # A tibble: 6,082 × 19
##    ElectionDate Offic…¹ Distr…² Statu…³ Count…⁴ Count…⁵
##    <chr>        <chr>   <chr>   <chr>     <dbl> <chr>  
##  1 2016-11-08   01      00000   0             1 ALCONA 
##  2 2016-11-08   01      00000   0             1 ALCONA 
##  3 2016-11-08   01      00000   0             1 ALCONA 
##  4 2016-11-08   01      00000   0             1 ALCONA 
##  5 2016-11-08   01      00000   0             1 ALCONA 
##  6 2016-11-08   01      00000   0             1 ALCONA 
##  7 2016-11-08   01      00000   0             1 ALCONA 
##  8 2016-11-08   01      00000   0             1 ALCONA 
##  9 2016-11-08   01      00000   0             1 ALCONA 
## 10 2016-11-08   01      00000   0             1 ALCONA 
## # … with 6,072 more rows, 13 more variables:
## #   OfficeDescription <chr>, PartyOrder <chr>,
## #   PartyName <chr>, PartyDescription <chr>,
## #   CandidateID <dbl>, CandidateLastName <chr>,
## #   CandidateFirstName <chr>,
## #   CandidateMiddleName <chr>,
## #   CandidateFormerName <lgl>, CandidateVotes <dbl>, …

We will extract the vote totals within each county for the presidential election, and then organize a Michigan County level presidential elections dataset, to visualize candidate support across counties. So for this purpose the file from the State includes mostly rows of irrelevant election results. So the first step is to delete all rows that do not relate to the presidential election. We use the filter() function to select only the results — the rows — for the presidential election. The OfficeDescription field in the dataset contains indicators for different elections. Open the data spreadsheet in the Environment pane to review the offices. Or table out the office descriptions with table(MI_2016$OfficeDescription).

The presidential election results are stored as President of the United States 4 Year Term (1) Position, in the first set of counties. The other offices follow. To find any set of elections, such as all results for U.S. House elections, we could use a search function to look for all instances of the term Congress in the OfficeDescription field. For example, the function str_detect() will find and return the locations in the dataset of a search string. For example, to find the rows where “Congress” appears, we could enter str_detect(MI_2016$OfficeDescription, "Congress"). This function would give us a result, either TRUE or FALSE for each row. Let’s see what that summary would look like:

summary(str_detect(MI_2016$OfficeDescription, "Congress"))
##    Mode   FALSE    TRUE    NA's 
## logical    5651     430       1

It shows that 430 rows reference a race for Congress. If we wanted to select only these rows, we would use str_detect() inside the filter() function. These two lines subset the MI_2016 dataset by filtering the OfficeDescription field to select only the rows where the term “Congress” appears: MI_2016 %>% filter(str_detect(OfficeDescription, "Congress")).

Following this example, we filter on the presidential election and save a new version of the data – MI_prez_2016 with just the variables we needed to analyze vote returns over Counties: CountyName, CandidateLastName, CandidateVotes, CountyCode. CountyCode is a Federal Information Processing Standard (FIPS) indicator for each County15

MI_prez_2016<-MI_2016 %>%
  filter(OfficeDescription==
    "President of the United States 4 Year Term (1) Position") %>%
    select(CountyName, CandidateLastName, CandidateVotes, CountyCode)

MI_prez_2016
## # A tibble: 1,079 × 4
##    CountyName CandidateLastName CandidateVotes County…¹
##    <chr>      <chr>                      <dbl>    <dbl>
##  1 ALCONA     Trump                       4201        1
##  2 ALCONA     Clinton                     1732        1
##  3 ALCONA     Johnson                      164        1
##  4 ALCONA     Castle                        28        1
##  5 ALCONA     Stein                         54        1
##  6 ALCONA     Soltysik                       0        1
##  7 ALCONA     Fox                            0        1
##  8 ALCONA     Hartnell                       0        1
##  9 ALCONA     Hoefling                       0        1
## 10 ALCONA     Kotlikoff                      0        1
## # … with 1,069 more rows, and abbreviated variable
## #   name ¹​CountyCode

Reshaping the data, long to wide format

In MI_prez_2016 there is one column of vote counts (CandidateVotes), and for every County a row containing an entry for each presidential candidate listed on the ballot in that County. This way of organizing the data is called ‘long’ format, because the dataset is longer than it is wide; instead of having a different column (or variable) for each candidate’s votes, there is one column to record votes and separate rows for each candidate.

To simplify analysis, we want to restructure the data to ‘wide’ format. In this format, the dataset will have one row per county (exactly 83 rows) and a separate column for each candidate’s vote totals. This wide structure is particularly useful for calculations, such as summing vote totals over all third-party candidates, because it facilitates column based operations.

The function pivot_wider() will translate the data from long to wide data format. The required arguments for pivot_wider() are (1) the names of each new column in wide format, and (2) the values to be stored in these columns. From long format, the names in CandidateLastName will be passed to wide format, while the values in CandidateVotes will be stored in the new wide format columns.

MI_prez_2016_wide <- MI_prez_2016 %>%
    pivot_wider(names_from = CandidateLastName, 
      values_from=CandidateVotes)

MI_prez_2016_wide
## # A tibble: 83 × 15
##    CountyN…¹ Count…² Trump Clinton Johnson Castle Stein
##    <chr>       <dbl> <dbl>   <dbl>   <dbl>  <dbl> <dbl>
##  1 ALCONA          1  4201    1732     164     28    54
##  2 ALGER           2  2585    1663     177     19    67
##  3 ALLEGAN         3 34183   18050    2513    289   596
##  4 ALPENA          4  9090    4877     498     53   160
##  5 ANTRIM          5  8469    4448     459     47   146
##  6 ARENAC          6  4950    2384     270     26    60
##  7 BARAGA          7  2158    1156     106     15    49
##  8 BARRY           8 19202    9114    1424    181   346
##  9 BAY             9 28328   21642    2189    208   544
## 10 BENZIE         10  5539    4108     382     36   149
## # … with 73 more rows, 8 more variables:
## #   Soltysik <dbl>, Fox <dbl>, Hartnell <dbl>,
## #   Hoefling <dbl>, Kotlikoff <dbl>, Maturen <dbl>,
## #   McMullin <dbl>, Moorehead <dbl>, and abbreviated
## #   variable names ¹​CountyName, ²​CountyCode

With the data now in wide format, we create an ‘other’ vote count column. This variable will record vote totals for all candidates who are neither Trump nor Clinton. Within mutate() , the other variable is the sum of each individual candidate column:

MI_prez_2016_wide<-MI_prez_2016_wide %>%
  mutate(other=Castle+ Fox + Hartnell  + Hoefling + Kotlikoff + Maturen
         + McMullin +  Moorehead + Soltysik + Stein) %>%
  select(CountyName, CountyCode, Trump, Clinton, other) %>%
  print(n=5)
## # A tibble: 83 × 5
##   CountyName CountyCode Trump Clinton other
##   <chr>           <dbl> <dbl>   <dbl> <dbl>
## 1 ALCONA              1  4201    1732   101
## 2 ALGER               2  2585    1663    93
## 3 ALLEGAN             3 34183   18050  1040
## 4 ALPENA              4  9090    4877   233
## 5 ANTRIM              5  8469    4448   206
## # … with 78 more rows

Then we select() only the necessary columns for recording presidential vote totals across counties, identifiers CountyName and CountyCode, followed by Trump, Clinton, and other. Given the raw votes, we could create vote percentage totals for each candidate:

MI_prez_2016_wide<-MI_prez_2016_wide %>%
 mutate(Trump_percent=Trump/(Trump + Clinton + other)*100) %>%
 mutate(Clinton_percent=Clinton/(Trump + Clinton + other)*100) 
## # A tibble: 83 × 7
##    County…¹ Count…² Trump Clinton other Trump…³ Clint…⁴
##    <chr>      <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
##  1 ALCONA         1  4201    1732   101    69.6    28.7
##  2 ALGER          2  2585    1663    93    59.5    38.3
##  3 ALLEGAN        3 34183   18050  1040    64.2    33.9
##  4 ALPENA         4  9090    4877   233    64.0    34.3
##  5 ANTRIM         5  8469    4448   206    64.5    33.9
##  6 ARENAC         6  4950    2384    92    66.7    32.1
##  7 BARAGA         7  2158    1156    70    63.8    34.2
##  8 BARRY          8 19202    9114   589    66.4    31.5
##  9 BAY            9 28328   21642   818    55.8    42.6
## 10 BENZIE        10  5539    4108   199    56.3    41.7
## # … with 73 more rows, and abbreviated variable names
## #   ¹​CountyName, ²​CountyCode, ³​Trump_percent,
## #   ⁴​Clinton_percent

The dataset now consists of an organized, wide table of presidential candidate support, one row for each county, and candidate support across multiple columns.

5.5 Wrangling election data and dotplotting results

The ‘Cleveland dotplot’, named after its creator William S. Cleveland, is a data visualization form for comparing quantitative values across categories of a qualitative variable. In this case, we are interested in visualizing levels of candidate support across each Michigan county. Similar to a bar graph, where the lengths of bars would be drawn to scale the percentage of the vote won by each candidate across counties, a Cleveland dotplot replaces bars with dots, arranged along a horizontal or vertical axis. The length of a bar is replaced by position of a dot. By eliminating the area of bars on the graph, the dotplot facilitates comparisons across many categories.

For MI_prez_2016_wide, visualizing raw vote totals is not feasible in one graphic because of the disparity between the largest counties, such as Detroit’s Wayne County, and any of the smaller counties that would appear indistinguishable. What interests us more than raw votes is the percentage of each candidate’s support by county. To construct the dotplot, first in MI_prez_2016_wide because the County names are in all caps, we will convert each one to “title case” – uppercase first letters and lowercase remaining letters — with the str_to_title() function within mutate():

MI_prez_2016_wide<- MI_prez_2016_wide %>%
 mutate(CountyName=str_to_title(CountyName))

The str_to_title() function (from the stringr package (Wickham 2022) in the tidyverse) converts each word, for example, “ST.CLAIR” to “St. Clair”. Notice the difference:

MI_prez_2016_wide %>%
  print(n=3)
## # A tibble: 83 × 7
##   CountyN…¹ Count…² Trump Clinton other Trump…³ Clint…⁴
##   <chr>       <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
## 1 Alcona          1  4201    1732   101    69.6    28.7
## 2 Alger           2  2585    1663    93    59.5    38.3
## 3 Allegan         3 34183   18050  1040    64.2    33.9
## # … with 80 more rows, and abbreviated variable names
## #   ¹​CountyName, ²​CountyCode, ³​Trump_percent,
## #   ⁴​Clinton_percent

A key aspect of the Cleveland dotplot is to order the categories (counties) to make the visual comparisons clearer. For example, rather than organize the counties alphabetically, we will arrange counties by Trump_percent or Clinton_percent.

Reorder function

The dots are constructed in ggplot() by geom_point(). In the aes() argument, we specify the order of the counties within an additional function. The reorder() function sorts the names of the counties by Trump’s percentage of the vote (reorder(CountyName, Trump_percent)) or Clinton (reorder(CountyName, Clinton_percent)). This function is placed within the aes() function as the name of the County variable. Figure 5.1 displays a Cleveland dotplot with counties sorted by the Trump vote percent.

MI_prez_2016_wide %>%
  ggplot() +
  geom_point(aes(reorder(CountyName, Trump_percent), Trump_percent)) +
  labs(title="Trump Percentage Vote by Michigan County, 2016",
       x="County", y="Vote Percentage") +
  coord_flip()
Cleveland dotplot of Trump’s vote percentage by Michigan county in the 2016 presidential election, created with \textbf{ggplot2}. County points are ordered by vote percentage, making it easier to compare across counties than in a traditional bar chart.

FIGURE 5.1: Cleveland dotplot of Trump’s vote percentage by Michigan county in the 2016 presidential election, created with . County points are ordered by vote percentage, making it easier to compare across counties than in a traditional bar chart.

Each dot represents one county, plotted along the X-axis based on the vote percentage. The counties are arranged in ascending order of support on the Y-axis, which highlights patterns and outliers. For example, Trump won the vast majority of Michigan counties by wide margins, but lost most of the more populous counties of Michigan cities by a small margin. Wayne county and Washtenaw are clear outliers.

The X and Y axes are flipped in coord_flip() , and the graphic aspect ratio is elongated to make all 83 county names visible. The reorder() function changes the order of Michigan counties. Otherwise the syntax is similar to a scatterplot.

We could add two different series, Clinton and Trump, and color each series. Note that ggplot2 expects data to be in a long format. As a shortcut, or perhaps exploratory graphic, without having to convert the data to long format, it may be easier to plot the percents in two different geom_point() series. To save space, this dotplot is not produced below, but try running the lines to observe the result:

MI_prez_2016_wide %>%
  ggplot(aes(reorder(CountyName, Trump_percent), Trump_percent)) +
  geom_point(color="red") +
  labs(title="Trump and Clinton Votes by Michigan County, 2016",
       x="County", y="Number of Votes",
          caption="Clinton in blue, Trump in red") +
  coord_flip() +
  geom_point(aes(CountyName, Clinton_percent), color="blue")

In Chapter 6 we consider election data in the context of maps, and among other subjects, examine how to create election maps and other aspects of geographically referenced data.

Resources

The official dplyr reference guide details the package functions with examples: https://dplyr.tidyverse.org/reference/index.html.

5.6 Exercises

Answer the questions with R code supported by a narrative explanation within an RMarkdown document.

  1. Construct a series of box-whiskers plots for the ideology scores within State delegations to the US House. Pick a subset of five States to compare and wrangle the data, using filter() to identify the States of interest. Save the data as a dataframe or pass on the wrangled data to ggplot2 for the construction of the box-whiskers plots.

  2. Based on the Congressional ideology (Voteview.com) data, write a three-paragraph report on the ideology of the Congressional delegation from Florida. Be sure to include a discussion of how the delegation relates to each Chamber’s overall median and within party caucus median ideology score. Be sure to include appropriate summary statistics with your answer.

  3. Construct a data visualization consisting of two histograms to compare the ideological orientations of representatives to the U.S. House by party caucus, in one of the following states: Texas, California, or New York. Wrangle the data to create a subset for the state, and for the histograms use color and transparency to highlight differences between the two caucuses.

  4. In Michigan, which US House legislators had the most liberal and most conservative ideology scores? Create two data tables to display district number, name, party affiliation, and ideology (left - right) score.

  5. Pick a US State, identify the Congressional delegation from it. Create a Cleveland dotplot to visualize the most conservative to most liberal legislators, from a given State, by their ideology score.

References

Bache, Stefan Milton, and Hadley Wickham. 2020. Magrittr: A Forward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.
Cleveland, William S., and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (387): 531–54.
Everson, Phil, Rick Valelly, Arjun Vishwanath, and Jim Wiseman. 2016. “NOMINATE and American Political Development: A Primer.” Studies in American Political Development 30 (2): 97–115.
Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet. 2023. “Voteview: Congressional Roll-Call Votes Database.” https://voteview.com. https://voteview.com.
Müller, Kirill, and Hadley Wickham. 2023. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Poole, Keith T. 2005. Spatial Models of Parliamentary Voting. Cambridge University Press.
Poole, Keith T., and Howard Rosenthal. 2006. Ideology and Congress: A Political Economic History of Roll Call Voting. 2nd ed. New Brunswick, NJ: Transaction Publishers.
———. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

  1. In this printed text, since all of the columns will not fit on a page, I select a subset of the columns, such as in this case, I entered filter(house_memb_117[, c(4, 5, 6, 8)], state_abbrev=="NY")↩︎

  2. To download the data directly see the Michigan Secretary of State website, https://www.michigan.gov/sos/elections/election-results-and-data. Scrolling through past election results, find the 2016 November general election, which will link to a webpage showing County-level results for all races on the ballot, https://mielections.us/election/results/2016GEN_CENR.html. At the top of this page, click “Data” then the link to “TAB-delimited by County”, which will start the process of downloading a file with the election results.↩︎

  3. County-level FIPS codes are a standardized set of numerical codes used to uniquely identify counties within the United States. More information about county-level FIPS codes, including their structure and usage, can be found on the U.S. Census Bureau’s website, https://www.census.gov/quickfacts/fact/note/US/fips.↩︎