20 Class Outline: QA/QC 2 - Open Refine
By the end of this lesson students will:
Be able to import a data set into OpenRefine, make changes to the data set and its structure, and export the revised data set
Learn how to automatically track changes made and export the record of changes
Be able to apply these changes to a different data set
20.1 Pre-Class Preparation (Instructor):
- Remind via email about OpenRefine Installation
- Post Data sets
Bring to Class:
- Snacks
- Tent cards for student names
20.2 Pre-class Preparation (Students):
20.2.1 Online Lectures: None
20.2.2 Readings: None
20.2.3 Computer Resources
Install OpenRefine on your computer and verify it works by following the instructions here.
Optional: Read and watch about how OpenRefine works here. You can also review the basic workflow we will learn.
20.3 In-Class: Using Open Refine to clean data
OpenRefine is a powerful, free, and open source tool that is used to work with and clean messy data. We will be working through some of OpenRefine’s basic features, after which you will trying them onm your own on a new data set.
This session uses the Data Carpentry Lesson OpenRefine for Social Science Data to introducee students to how OpenRefine can be used to clean datasets. The lesson’s objectives are:
- To describe OpenRefine’s uses and applications,
- Differentiate data cleaning from data organization, and
- Experiment with OpenRefine’s user interface.
20.3.1 Key Points to empghasize:
- OpenRefine provides a set of tools to identify and correct messy data.
- All actions are easily reversed
- If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset
- OpenRefine keeps track of all of your actions and allows them to be applied to different datasets
- It is open source with a large user community, making it easy to get
- It works very well with ‘large-ish’ datasets (100,000 rows), but can adjust memory allocation to accommodate larger datasets.
- OpenRefine always keeps your data private on your own computer until you choose to share it. It works by running a small server on your computer and using your web browser to interact with it. Your private data never leaves your computer unless you want it to
20.3.2 Installation and Setup
Follow the Setup instructions to install OpenRefine.
If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
The data for this lesson is a part of the ‘Data Carpentry Social Sciences’ workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database: interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock). The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
Download the data file to your computer to a location where you will be able to find it during the lesson.
20.3.3 Part 1
- Intro to OR
- Working with OR
-Filtering and Sorting
20.3.4 Break
20.3.5 Part 2
- Examining Numbers
- Using Scripts, Exporting, and Saving
- Wrap-up, Questions
20.3.6 Assignment
Now it’s your turn. Download this csv file and use OpenRefine to clean it up. After you create a Project, edit the data as follows:
Correct and standardize the names of the countries in which the rodents were captured.
The column
scientificNamecontains two pieces of information (the genus and species of each animal). Split this into two columns, rename them asgenusandspecies, and then correct and standardize the data in each column as needed. NB: You may run into an obstacle when you try to rename the columns. How can you get around it?Save the clean data as a
.csvfile on your desktop.
Extract and save your steps (i.e., ‘operation history’ as JSON. Save this as a text file.
Bonus Brainteaser: Many of the cells in the column for the Latin bonomial are blank. How might you go about filling them in based on the column with the abbreviation?
Submission: Submit your clean
.csvand the.txtfile of JSON output on Canvas.
20.3.7 Grading Rubric:
- Data corrected and JSON file can be used on another data set: 35
- Most data correction properly programmed; some require instructor follow-up: 25
- Many of the corrections missing, JSON file unable to process new data : 20
- Instructor follow-up required to implement most changes: 15
The materials in this lesson are derived from the Data Carpentry Social Sciences workshop by The Carpentries. The materials Materials are licensed under CC-BY 4.0 by the authors