5  Reproducibile Data Cleanup with R

Modified

February 20, 2026

5.1 Set up an RStudio Project and install the relevant packages

  1. File -> New Project

  2. Name the Project as follows: las_demo

  3. Create the following three folders:

    data_raw
    data_clean
    code

    You can create these in the folder using your operating systems “create folder” option or you can create within R studio using the files tab

  4. Install libraries, load libraries, and run verify they work by running a few commands

5.2 Gather the Raw Data to be corrected

Save your original (aka ‘raw’ or ‘dirty’) data files in the data_raw folder.

Important
  1. Never make changes to the original file of dirty/raw data. Never. NEVER. Always import the uncorrected original data file, make corrections using a script, and then save a new, clean file.

  2. Annote your code with reminders for future you: the # symbol (aka pound, hash tag) allows you to make comments in the script explaining what different sections or commands do. Annotate your scripts with lots of details on what different commands do, what is being done in each section, even links to websites you used to figure stuff out. Remember: ABA (always be annotating).

5.3 Take your data from messy to clean in these 6 steps:

  1. Familiarize yourself with the data set
  2. Check for potential errors. These can be structural errors (e.g., misaligned columns, duplicated rows/columns, missing values), data entry errors, or measurement errors. Decide how you will flag and deal with them.
  3. Decide how to deal with missing values
  4. Identify ways to simplify data values (i.e., codes, abbreviations) and column headings
  5. Write code to load the ‘raw’ data file file, implement your corrections/changes, and save the ‘clean’ version of the data.

Things to look out for: Make a list of the what you think needs to be corrected and the steps necessary to identify and implement each correction. Some of the things to look out for include:

  • Numeric values stored as character data types
  • Factors stred as characters
  • Duplicate rows
  • Spelling mistakes
  • inconsistent formatting (eg., codes, capitalizations)
  • White spaces
  • Missing data
  • Zeros instead of null values
  • Special characters (e.g. commas in numeric values instead of decimals)
  • column headings with spaces between words or that start with numerals

Remember, the characteristics of clean data set include:

  • Free of duplicate rows/values
  • Error-free (correct misspellings, eliminate special characters)
  • correct data type for analysis
  • outliers identified and dealt in the correct way
  • “tidy” data structure
TipWhen working with multiple files, should you correct before or after combining?

We often need to combine multiple files with the same kind of data (i.e., surveys conducted in Year 1 and in Year 2, each of which are recorded in their own .csv file). Is it more efficient to correct each file first, then combine them, or combine first and then correct?

It depends and can vary from one project to another. Make an outline of the different steps and corrections to be made to each file and see if you can decide which is more efficient. Note that there might be different ways to do the same thing, this outline will help figure out which is best. For instance you could:

Option 1
1. Import table 1
2. Correct column headings in Table 1
3. Import table 2
4. Correct column headings in Table 2
5. Bind Table 1 and Table 2 Together

but this is less efficient than…

Option 2
1. Import table 1
2. Import table 2
3. Bind Table 1 and Table 2 Together
4. Correct the column headings in the Table

5.4 Tools & Resources

  1. These introductions to R and R Studio were made by Professor Ethan White (UF-WEC). They are a good overview of some R basics.

  2. The Carpentries’ R workshops (self-paced or taught in-person) are excellent, I use many of their materials in class:

  3. Software Carpentry lesson on Project Management with R Studio

  4. Hadley Wickham wrote a book on using the tidyverse and the online version is FREE. This is a phenomenal resource on using R to import, tidy, and visualize data.

  5. RStudio Cheat Sheets: help with commands for using the different tidyverse packages, RStudio shortcuts and tricks, help with R commands, and more. You definitely want the ones for Data Import, Work with Strings, Factors, Data Transformation, and Base R.

  6. Where and How to ask for help

  7. Ten simple rules for biologists learning to program

  8. Lot’s more on the course’s ‘Resources’ page

6 Additional (interesting) Reading

  1. Lewis, Keith P., Eric Vander Wal, and David A. Fifield. 2018. Wildlife biology, big data, and reproducible research. Wildlife Society Bulletin 42(1): 172-179.

  2. White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR.  2013.  Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2):1-10.

  3. The humanities have a ‘reproducibility’ problem

  4. The humanities do not need a replication drive

  5. Reproducible Research: A primer for the social sciences

  6. Replicability and replication in the humanities

  7. Towards reproducible science in the digital humanities

  8. The possibility and desirability of replication in the humanities

  9. Reproducible Research: A Retrospective

For when you feel more comfortable with R and programming

  1. Bryan, J. (2018). Excuse me, do you have a moment to talk about version control? The American Statistician, 72(1), 20-27.

  2. Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. Good enough practices in scientific computing. PLoS Computational Biology 13, no. 6 (2017): e1005510.