13 Automated Data Collection & Extraction – LAS 6292: Data Collection and Management

13.1 APIs

Isabella Velásquez: Creating APIs for Data Science With plumber: [link] (and the link to the plumbr package)
SESYNC: Data APIs in R Lesson [link]
The Carpentries: APIs with R [link]
Md Sakhawat Hossen: Working with the APIs (Application Programming Interfaces) in R [link]

13.2 Optical Character Recognition

Video Primer: What is OCR?

Online OCR Tools (text & data from `.pdf` to `.csv`, `.txt`, etc.)

Google Drive - Video Primer: OCR with Google Drive
Free online sites for small batches (can upgrade for larger numbers of files)
- Free Online OCR 1
- New OCR
- pdf to excel
- OnlineOCR
- PDFTables will convert PDF to .csv, and has an API so you can do your conversions in bulk with R. You can do ~25 pages free; large numbers are reasonably priced.
Amazon TextExtract
Mathpix Snip digitizes handwritten or printed text, and copies outputs to the clipboard that can be pasted into LaTeX editors like Overleaf, Markdown editors like Typora, Microsoft Word, and more.

OCR with R

R package pdftools
R package tabulapdf
Detailed Blog Post & Tutorial
Other Written Tutorials: tutorial 1, tutorial 2, tutorial 3
Video Tutorials: Video Tutorial 1, Video Tutorial 2
Convert PDF to text in R - OCR pdftools
PDFtools in R
More advanced but more powerful from the Programming Historian: OCR with Google Vision API and Tesseract

13.3 Extracting tables from images with R

R package magick (this package actually includes several very powerful tools for image processing; this is just one of the things you can do with it)
Detailed Blog Post / Tutorial

13.4 Extracting Data from Published Figures

Ankit Rohagni’s Web Plot Digitizer
- WPD Video Tutorial
- WPD Tutorial Blog Post
Alternative 1: R package magick
Alternative 2: GetData extracts data automatically from scanned images (~$30).
Alternative 3: R package digitize will extract data from scatterplots within the R environment. Note that this is no longer on CRAN due to dependency issues but you can still install and try to work with it directly from the Github repository.
A more comprehensive overview: Brauckhoff, et al., “Exploring Image Analysis in R: Applications and Advancements”, The R Journal, 2025 [link]

13.5 Text Mining

Text Mining with R by Julia Silge and David Robinson
gutenbergr: Download and Process Public Domain Works from Project Gutenberg. Tutorial can be found here

Useful reading on text mining

Atanassova I, Bertin M and Mayr P (2019) Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics. Front. Res. Metr. Anal. 4:2. doi: 10.3389/frma.2019.00002
Westergaard D, Stærfeldt H-H, Tønsberg C, Jensen LJ, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2): e1005962. https://doi.org/10.1371/journal.pcbi.1005962
Salloum, S.A., Al-Emran, M., Monem, A.A., Shaalan, K. (2018). Using Text Mining Techniques for Extracting Information from Research Articles. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_18
Simon, C., Davidsen, K., Hansen, C. et al. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics 19, 57 (2019). https://doi.org/10.1186/s12859-019-2607-x
Benchimol, J., Kazinnik, S., & Saadon, Y. (2022). Text mining methodologies with R: An application to central bank texts. Machine Learning with Applications, 8, 100286.link
Yu, C., Zhang, C., & Wang, J. (2020). Extracting Body Text from Academic PDF Documents for Text Mining. Proceedings of the 12th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2020). https://www.scitepress.org/Papers/2020/101314/101314.pdf
Gulo, C. A., & Rúbio, T. R. (2015, January). Text Mining Scientific Articles using the R. In Doctoral Symposium in Informatics Engineering. link

13.6 Collecting & Processing data from the Web of Science and Scopus

R package refsplitr
R package bibliometrix
jstor: An R package for Analysing Scientific Articles and link to JSTORr package repository on Github. The package is no longer on CRAN but you may still find it useful to browse the repo.

13.7 Scraping websites

Library Carpentry Lesson on Webscraping Note: no longer updated and a bit out of date, but still a very useful introduction
Start Here: Introduction to webscraping
Video: Scraping WebData in R with rvest
Video: Practical Introduction to Web Scraping using R
Very nice written tutorial…
….and another one, this time from the UC Business Analytics R Programming Guide
SelectorGadget is useful to id CSS selectors.
Noortje Marres & Esther Weltevrede (2013) Scraping the Social?, Journal of Cultural Economy, 6:3, 313-335, DOI:10.1080/17530350.2013.772070

13.8 Cell Phone Data

Exploratory analyses Part 1 and Part 2

13.9 Social Media Data

How to extract Biodiversity Data from Facebook: Chowdhury, S., Ahmed, S., Alam, S., Callaghan, C. T., Das, P., Di Marco, M., Di Minin, E., Jarić, I., Labi, M. M., Rokonuzzaman, Md., Roll, U., Sbragaglia, V., Siddika, A., & Bonn, A. (2024). A protocol for harvesting biodiversity data from Facebook. Conservation Biology, 38, e14257. https://doi.org/10.1111/cobi.14257
Di Minin, E., Fink, C., Hausmann, A., Kremer, J. and Kulkarni, R. (2021), How to address data privacy concerns when using social media data in conservation science. Conservation Biology, 35: 437-446. https://doi.org/10.1111/cobi.13708
Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano-Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes-Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706
Fox, Nathan, Tom August, Francesca Mancini, Katherine E. Parks, Felix Eigenbrod, James M. Bullock, Louis Sutter, and Laura J. Graham. ““photosearcher” package in R: An accessible and reproducible method for harvesting large datasets from Flickr.” SoftwareX 12 (2020): 100624. https://www.sciencedirect.com/science/article/pii/S235271102030337X
Ergon Cugler de Moraes Silva: TelegramScrap: A comprehensive tool for scraping Telegram data. https://arxiv.org/abs/2412.16786
Batrinca, B., Treleaven, P.C. Social media analytics: a survey of techniques, tools and platforms. AI & Soc 30, 89–116 (2015). https://doi.org/10.1007/s00146-014-0549-4

13.10 Automated Image Analysis

Pennekamp, F. and Schtickzelle, N. (2013), Implementing image analysis in laboratory‐based experimental systems for ecology and evolution: a hands‐on guide. Methods Ecol Evol, 4: 483-492. https://doi.org/10.1111/2041-210X.12036
How to build your own image recognition app with R! Part 1 and Part 2
LinkedIn Learning Course: Deep Learning - Image Recognition reequires UF login
UF Practicum AI courses include one on Image recognition models.

13.11 Wearable Devices & RFID tags

What is an RFID tag?
Rafiq, K., Appleby, R. G., Edgar, J. P., Radford, C., Smith, B. P., Jordan, N. R., Dexter, C. E., Jones, D. N., Blacker, A. R. F., & Cochrane, M. (2021). WildWID: An open-source active RFID system for wildlife research. Methods in Ecology and Evolution, 12, 1580– 1587. https://doi.org/10.1111/2041-210X.13651
Build your own RFID device
Izmailova, E.S., Wagner, J.A. and Perakslis, E.D. (2018), Wearable Devices in Clinical Trials: Hype and Hypothesis. Clin. Pharmacol. Ther., 104: 42-52. https://doi.org/10.1002/cpt.966
Loncar-Turukalo T, Zdravevski E, Machado da Silva J, Chouvarda I, Trajkovik V. Literature on Wearable Technology for Connected Health: Scoping Review of Research Trends, Advances, and Barriers J Med Internet Res 2019;21(9):e14017 doi: 10.2196/14017
Why Should Sociologists Care about Wearable Tech?
Harari, G. M., Lane, N. D., Wang, R., Crosier, B. S., Campbell, A. T., & Gosling, S. D. (2016). Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges. Perspectives on Psychological Science 11(6), 838-854 https://doi.org/10.1177/1745691616650285
Seifert Alexander, Hofer Matthias, Allemand Mathias. 2018. Mobile Data Collection: Smart, but Not (Yet) Smart Enough. 12. Frontiers in Neuroscience https://www.frontiersin.org/article/10.3389/fnins.2018.00971

13.12 Buildimg automated data collectors

Calipers that dump data directly to Excel link
Morris BI, Kittredge MJ, Casey B, Meng O, Chagas AM, Lamparter M, et al. (2022) PiSpy: An affordable, accessible, and flexible imaging platform for the automated observation of organismal biology and behavior. PLoS ONE 17(10): e0276652. https://doi.org/10.1371/journal.pone.0276652
Jolles, J. W. (2021). Broad-scale applications of the Raspberry Pi: A review and guide for biologists. Methods in Ecology and Evolution, 12, 1562– 1579. https://doi.org/10.1111/2041-210X.13652

13.13 Online data archives

Overview: Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano‐Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes‐Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706

Government data

Data.gov (the open data portal of the US Government) and Using Data.gov APIs in R
the rOpengov Project
Open Fiscal Data Package
educationdata: Retrieve data from the Urban Institute’s Education Data API as a data.frame for easy analysis. See also here
a huge list of data sources for social scientists available with R tools
accessing World bank Data with R

US & World Census Data

A Guide to Working with US Census Data in R
R Package tidycensus
Tutorial 1
Tutorial 2
R package ipumsr: The ipumsr package helps import IPUMS extracts from the IPUMS website into R. IPUMS provides census and survey data from around the world integrated across time and space.

Education Data

edbuildr: import EdBuild’s master dataset of school district finance, student demographics, and community economic indicators for every school district in the United States.
Building R and Stata packages for the Education Data Portal

Other Online Data Portals

Giant compendium of open datasets #1
Data on Amazonia
R package bdc: toolkit for gathering & cleaning biodiversity data

Software for gathering data from online archives

EcoRetriever: automates the tasks of finding, downloading, and cleaning up publicly available ecological data, and then stores them in a local database or csv files.
litsearcher an R package to facilitate quasi-automatic search strategy development for systematic review