5 Reproducibile Data Cleanup with R

Modified

March 13, 2026

5.1 Set up an RStudio Project and install the relevant packages

File -> New Project
Name the Project as follows: las_demo
Create the following three folders:

data_raw
data_clean
code

You can create these in the folder using your operating systems “create folder” option or you can create within R studio using the files tab
Install libraries, load libraries, and run verify they work by running a few commands

5.2 Gather the Raw Data to be corrected

Save your original (aka ‘raw’ or ‘dirty’) data files in the data_raw folder.

Important

Never make changes to the original file of dirty/raw data. Never. NEVER. Always import the uncorrected original data file, make corrections using a script, and then save a new, clean file.
Annote your code with reminders for future you: the # symbol (aka pound, hash tag) allows you to make comments in the script explaining what different sections or commands do. Annotate your scripts with lots of details on what different commands do, what is being done in each section, even links to websites you used to figure stuff out. Remember: ABA (always be annotating).

5.3 Take your data from `messy` to `clean` in these 6 steps:

Familiarize yourself with the data set
Check for potential errors. These can be structural errors (e.g., misaligned columns, duplicated rows/columns, missing values), data entry errors, or measurement errors. Decide how you will flag and deal with them.
Decide how to deal with missing values
Identify ways to simplify data values (i.e., codes, abbreviations) and column headings
Write code to load the ‘raw’ data file file, implement your corrections/changes, and save the ‘clean’ version of the data.

Things to look out for: Make a list of the what you think needs to be corrected and the steps necessary to identify and implement each correction. Some of the things to look out for include:

Numeric values stored as character data types
Factors stred as characters
Duplicate rows
Spelling mistakes
inconsistent formatting (eg., codes, capitalizations)
White spaces
Missing data
Zeros instead of null values
Special characters (e.g. commas in numeric values instead of decimals)
column headings with spaces between words or that start with numerals

Remember, the characteristics of clean data set include:

Free of duplicate rows/values
Error-free (correct misspellings, eliminate special characters)
correct data type for analysis
outliers identified and dealt in the correct way
“tidy” data structure

When working with multiple files, should you correct before or after combining?

We often need to combine multiple files with the same kind of data (i.e., surveys conducted in Year 1 and in Year 2, each of which are recorded in their own .csv file). Is it more efficient to correct each file first, then combine them, or combine first and then correct?

It depends and can vary from one project to another. Make an outline of the different steps and corrections to be made to each file and see if you can decide which is more efficient. Note that there might be different ways to do the same thing, this outline will help figure out which is best. For instance you could:

Option 1
1. Import table 1
2. Correct column headings in Table 1
3. Import table 2
4. Correct column headings in Table 2
5. Bind Table 1 and Table 2 Together

but this is less efficient than…

Option 2
1. Import table 1
2. Import table 2
3. Bind Table 1 and Table 2 Together
4. Correct the column headings in the Table

5.4 Tools & Resources

These introductions to R and R Studio were made by Professor Ethan White (UF-WEC). They are a good overview of some R basics.
The Carpentries’ R workshops (self-paced or taught in-person) are excellent, I use many of their materials in class:
- R for Social Scientists
- Data Analysis and Visualization in R for Ecologists
Software Carpentry lesson on Project Management with R Studio
Hadley Wickham wrote a book on using the tidyverse and the online version is FREE. This is a phenomenal resource on using R to import, tidy, and visualize data.
RStudio Cheat Sheets: help with commands for using the different tidyverse packages, RStudio shortcuts and tricks, help with R commands, and more. You definitely want the ones for Data Import, Work with Strings, Factors, Data Transformation, and Base R.
Where and How to ask for help
- Hadley Wickham’s advice on how to write a good reproducible example for getting help with R
- how to post good questions on StackOverflow
- The UF R-users listserv is very user friendly and a great place to post requests for help.
Ten simple rules for biologists learning to program
Lot’s more on the course’s ‘Resources’ page

6 Additional (interesting) Reading

Lewis, Keith P., Eric Vander Wal, and David A. Fifield. 2018. Wildlife biology, big data, and reproducible research. Wildlife Society Bulletin 42(1): 172-179.
White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR. 2013. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2):1-10.
The humanities have a ‘reproducibility’ problem
The humanities do not need a replication drive
Reproducible Research: A primer for the social sciences
Replicability and replication in the humanities
Towards reproducible science in the digital humanities
The possibility and desirability of replication in the humanities
Reproducible Research: A Retrospective

For when you feel more comfortable with R and programming

Bryan, J. (2018). Excuse me, do you have a moment to talk about version control? The American Statistician, 72(1), 20-27.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. Good enough practices in scientific computing. PLoS Computational Biology 13, no. 6 (2017): e1005510.

5.1 Set up an RStudio Project and install the relevant packages

5.2 Gather the Raw Data to be corrected

5.3 Take your data from messy to clean in these 6 steps:

5.4 Tools & Resources

6 Additional (interesting) Reading

5.3 Take your data from `messy` to `clean` in these 6 steps: