Often, you have the same data separated out into multiple files.
Let’s say that you have a ton of files following the filename structure: 'file_1.csv'
, 'file_2.csv'
, 'file_3.csv'
, and so on. The power of dplyr and tidyr is mainly in being able to manipulate large amounts of structured data, so you want to be able to get all of the relevant information into one table so that you can analyze the aggregate data.
You can combine the base R functions list.files()
and lapply()
with readr and dplyr to organize this data better, as shown below:
files <- list.files(pattern = "file_.*csv") df_list <- lapply(files,read_csv) df <- bind_rows(df_list)
- The first line uses
list.files()
and a regular expression, a sequence of characters describing a pattern of text that should be matched, to find any file in the current directory that starts with'file_'
and has an extension ofcsv
, storing the name of each file in a vectorfiles
- The second line uses
lapply()
to read each file infiles
into a data frame withread_csv()
, storing the data frames indf_list
- The third line then concatenates all of those data frames together with dplyr’s
bind_rows()
function
Instructions
You have 10 different files containing 100 students each. These files follow the naming structure:
exams_0.csv
exams_1.csv
- … up to
exams_9.csv
You are going to read each file into an individual data frame and then combine all of the entries into one data frame.
First, create a variable called student_files
and set it equal to the list.files()
of all of the CSV files we want to import.
Read each file in student_files
into a data frame using lapply()
and save the result to df_list
.
Concatenate all of the data frames in df_list
into one data frame called students
.
Inspect students
. Save the number of rows in students
to nrow_students
. Did you get all of them?