12 Automated Data Collection & Extraction
12.1 Optical Character Recognition
- Video Primer: What is OCR?
Online OCR Tools (text & data from .pdf to .csv, .txt, etc.)
Google Drive - Video Primer: OCR with Google Drive
Free online sites for small batches (can upgrade for larger numbers of files)
- Free Online OCR 1
- New OCR
- pdf to excel
- OnlineOCR
- PDFTables will convert PDF to .csv, and has an API so you can do your conversions in bulk with R. You can do ~25 pages free; large numbers are reasonably priced.
Mathpix Snip digitizes handwritten or printed text, and copies outputs to the clipboard that can be pasted into LaTeX editors like Overleaf, Markdown editors like Typora, Microsoft Word, and more.
OCR with R
R package
pdftoolsR package
tabulapdfDetailed Blog Post / Tutorial
More advanced but more powerful from the Programming Historian: OCR with Google Vision API and Tesseract
12.2 Extracting tables from images with R
- R package
magick(this package actually includes several very powerful tools for image processing; this is just one of the things you can do with it) - Detailed Blog Post / Tutorial
12.3 Extracting Data from Published Figures
Ankit Rohagni’s Web Plot Digitizer
- WPD Video Tutorial
- WPD Tutorial Blog Post
Alternative 1: R package
magickAlternative 2: GetData extracts data automatically from scanned images (~$30).
Alternative 3: R package
digitizewill extract data from scatterplots within the R environment. This article will walk you through the process.
12.4 Text Mining
Text Mining with R by Julia Silge and David Robinson
gutenbergr: Download and Process Public Domain Works from Project Gutenberg. Tutorial can be found here
Useful reading on text mining
Atanassova I, Bertin M and Mayr P (2019) Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics. Front. Res. Metr. Anal. 4:2. doi: 10.3389/frma.2019.00002
Westergaard D, Stærfeldt H-H, Tønsberg C, Jensen LJ, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2): e1005962. https://doi.org/10.1371/journal.pcbi.1005962
Salloum, S.A., Al-Emran, M., Monem, A.A., Shaalan, K. (2018). Using Text Mining Techniques for Extracting Information from Research Articles. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_18
Simon, C., Davidsen, K., Hansen, C. et al. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics 19, 57 (2019). https://doi.org/10.1186/s12859-019-2607-x Extracting Body Text from Academic PDF Documents for Text Mining
Benchimol, J., Kazinnik, S., & Saadon, Y. (2022). Text mining methodologies with R: An application to central bank texts. Machine Learning with Applications, 8, 100286.link
Yu, C., Zhang, C., & Wang, J. (2020). Extracting Body Text from Academic PDF Documents for Text Mining. arXiv preprint arXiv:2010.12647.
Gulo, C. A., & Rúbio, T. R. (2015, January). Text Mining Scientific Articles using the R. In Doctoral Symposium in Informatics Engineering. linl
12.5 Collecting & Processing data from the Web of Science and Scopus
R package
refsplitrR package
bibliometrix
12.6 Scraping websites
- Library Carpentry Lesson on Webscraping
- Start Here: Introduction to webscraping
- Video: Scraping WebData in R with rvest
- Video: Practical Introduction to Web Scraping using R
- Very nice written tutorial…
- ….and another one, this time from the UC Business Analytics R Programming Guide
- scraping HTML text and scraping HTML tables
- SelectorGadget is useful to id CSS selectors.
- Noortje Marres & Esther Weltevrede (2013) Scraping the Social?, Journal of Cultural Economy, 6:3, 313-335, DOI: 10.1080/17530350.2013.772070
12.7 Cell Phone Data
12.9 Automated Image Analysis
12.11 Buildimg automated data collectors
Calipers that dump data directly to Excel link
PiSpy: An Affordable, Accessible, and Flexible Imaging Platform for the Automated Observation of Organismal Biology and Behavior
Jolles, J. W. (2021). Broad-scale applications of the Raspberry Pi: A review and guide for biologists. Methods in Ecology and Evolution, 12, 1562– 1579. https://doi.org/10.1111/2041-210X.13652
12.12 Online data archives
Overview: Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano‐Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes‐Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706
Government data
- Data.gov (the open data portal of the US Government) and Using Data.gov APIs in R
- the rOpengov Project
- Open Fiscal Data Package
educationdata: Retrieve data from the Urban Institute’s Education Data API as a data.frame for easy analysis. See also here- a huge list of data sources for social scientists available with R tools
- accessing World bank Data with R
US & World Census Data
- A Guide to Working with US Census Data in R
- R Package
tidycensus - Tutorial 1
- Tutorial 2
- R package
ipumsr: The ipumsr package helps import IPUMS extracts from the IPUMS website into R. IPUMS provides census and survey data from around the world integrated across time and space.
Education Data
edbuildr: import EdBuild’s master dataset of school district finance, student demographics, and community economic indicators for every school district in the United States.
Other Online Data Portals
- Giant compendium of open datasets #1
- Data on Amazonia
- R package
bdc: toolkit for gathering & cleaning biodiversity data
Software for gathering data from online archives
- EcoRetriever: automates the tasks of finding, downloading, and cleaning up publicly available ecological data, and then stores them in a local database or csv files.
- litsearcher an R package to facilitate quasi-automatic search strategy development for systematic review
12.8 Social Media Data
How to extract Biodiversity Data from Facebook
Fox, Nathan, Tom August, Francesca Mancini, Katherine E. Parks, Felix Eigenbrod, James M. Bullock, Louis Sutter, and Laura J. Graham. ““photosearcher” package in R: An accessible and reproducible method for harvesting large datasets from Flickr.” SoftwareX 12 (2020): 100624. https://www.sciencedirect.com/science/article/pii/S235271102030337X