To begin, we’ll use the pre-built dataset for our exercise. We’ll utilize the pre-existing mtcars dataset, which provides details on different car models.
# Load the dplyr package
library(dplyr)
# Create a small data set using the mtcars dataset
data = mtcars
Prior to delving into column selection, it’s important to examine the structure and contents of our dataset.
# View the structure of the data set
str(data)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# View the first few rows of the data set
head(data)
The select() function grants us the ability to pick particular columns from a data frame. We have the flexibility to specify columns using various methods.
# Select specific columns by name
selected_columns_by_name = select(data, mpg, hp, wt) # we select "Miles/US Gallon", "horsepower", "Weight"
head(selected_columns_by_name)
The variable selected_columns contains a subset of the initial data set.
# Select columns using column indexing
selected_columns_by_index = select(data, 1:3)
head(selected_columns_by_index)
Again, we get a subset of the data we initially generated by using indices.
# Select columns using column names via a logical condition
selected_columns_by_condition <- select(data, starts_with("cyl"))
head(selected_columns_by_condition)
A third way to get a subset using the select-function is formulating logical conditions, here: choose all variables that start with “cyl”. That gives us only one variable.
You can also use select() to rename columns of the data set.
# Rename the 'mpg' column to 'MilesPerGallon'
renamed_column <- select(data, MilesPerGallon = mpg)
head(renamed_column)
Now, we have a subset (one column) stored in a new variable. On top of that, the variable has a new label, we can work with now.
In addition to selecting columns from a data set, we can exclude certain columns using the select() function, as well.
# Exclude the 'disp' and 'gear' columns
excluded_columns = select(data, -disp, -gear, -am)
head(excluded_columns)
Now we have a new variable, that is a subset of the inital data set. We got this subset by excluding three variables: disp, gear, am. The exclusion can be done with the help of the minus symbol.
We could provide an introduction to the select-function from the dplyr package, which enables the selection of specific columns from a data frame. We explored different methods to specify columns, such as using column names, indices, and conditions. Additionally, we learned how to rename columns and exclude unwanted columns. By mastering the select() function, you now have a powerful tool at your disposal for efficient data manipulation in R. Keep exploring the dplyr package and its other functions to further enhance your data analysis skills. Happy coding!