Selecting Columns with dplyr’s select function

Generate a Small Data Set

To begin, we’ll use the pre-built dataset for our exercise. We’ll utilize the pre-existing mtcars dataset, which provides details on different car models.

# Load the dplyr package
library(dplyr)

# Create a small data set using the mtcars dataset
data = mtcars

Let’s inspect the data

Prior to delving into column selection, it’s important to examine the structure and contents of our dataset.

# View the structure of the data set
str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# View the first few rows of the data set
head(data)

Now it’s time for the select-function

The select() function grants us the ability to pick particular columns from a data frame. We have the flexibility to specify columns using various methods.

# Select specific columns by name 
selected_columns_by_name = select(data, mpg, hp, wt) # we select "Miles/US Gallon", "horsepower", "Weight"
head(selected_columns_by_name)

The variable selected_columns contains a subset of the initial data set.

# Select columns using column indexing
selected_columns_by_index = select(data, 1:3)
head(selected_columns_by_index)

Again, we get a subset of the data we initially generated by using indices.

# Select columns using column names via a logical condition
selected_columns_by_condition <- select(data, starts_with("cyl"))
head(selected_columns_by_condition)

A third way to get a subset using the select-function is formulating logical conditions, here: choose all variables that start with “cyl”. That gives us only one variable.

Renaming Columns

You can also use select() to rename columns of the data set.

# Rename the 'mpg' column to 'MilesPerGallon'
renamed_column <- select(data, MilesPerGallon = mpg)
head(renamed_column)

Now, we have a subset (one column) stored in a new variable. On top of that, the variable has a new label, we can work with now.

Excluding Columns

In addition to selecting columns from a data set, we can exclude certain columns using the select() function, as well.

# Exclude the 'disp' and 'gear' columns
excluded_columns = select(data, -disp, -gear, -am)
head(excluded_columns)

Now we have a new variable, that is a subset of the inital data set. We got this subset by excluding three variables: disp, gear, am. The exclusion can be done with the help of the minus symbol.


Conclusion

We could provide an introduction to the select-function from the dplyr package, which enables the selection of specific columns from a data frame. We explored different methods to specify columns, such as using column names, indices, and conditions. Additionally, we learned how to rename columns and exclude unwanted columns. By mastering the select() function, you now have a powerful tool at your disposal for efficient data manipulation in R. Keep exploring the dplyr package and its other functions to further enhance your data analysis skills. Happy coding!


Back to Blog Overview

Back to Website