13 Automated Data Collection & Extraction
Automating your data collection is one of the best ways to make your research more efficient and rfree up time to dso other things. Some of these techniques take time to learn, but so take into account the time required to learn a method and the return on investment for the effort.

Automated data extraction is a rapidly changing field due to advances in machine learning and other ‘AI’ tools. The list below will be changing frequently to reflect these advances, but working through some of these methods by hand is an excellent to way to evaluate what these tools can and can’t do for your research.
13.1 APIs
13.2 Optical Character Recognition
- Video Primer: What is OCR?
Online OCR Tools (text & data from .pdf to .csv, .txt, etc.)
Google Drive - Video Primer: OCR with Google Drive
Free online sites for small batches (can upgrade for larger numbers of files)
- Free Online OCR 1
- New OCR
- pdf to excel
- OnlineOCR
- PDFTables will convert PDF to .csv, and has an API so you can do your conversions in bulk with R. You can do ~25 pages free; large numbers are reasonably priced.
Mathpix Snip digitizes handwritten or printed text, and copies outputs to the clipboard that can be pasted into LaTeX editors like Overleaf, Markdown editors like Typora, Microsoft Word, and more.
OCR with R
- R package
pdftools - R package
tabulapdf - Detailed Blog Post & Tutorial
- Other Written Tutorials: tutorial 1, tutorial 2, tutorial 3
- Video Tutorials: Video Tutorial 1, Video Tutorial 2
- Convert PDF to text in R - OCR pdftools
- PDFtools in R
- More advanced but more powerful from the Programming Historian: OCR with Google Vision API and Tesseract
13.3 Extracting tables from images with R
- R package
magick(this package actually includes several very powerful tools for image processing; this is just one of the things you can do with it) - Detailed Blog Post / Tutorial
13.4 Extracting Data from Published Figures
- Ankit Rohagni’s Web Plot Digitizer
- WPD Video Tutorial
- WPD Tutorial Blog Post
- Alternative 1: R package
magick - Alternative 2: GetData extracts data automatically from scanned images (~$30).
- Alternative 3: R package
digitizewill extract data from scatterplots within the R environment. Note that this is no longer on CRAN due to dependency issues but you can still install and try to work with it directly from the Github repository. - A more comprehensive overview: Brauckhoff, et al., “Exploring Image Analysis in R: Applications and Advancements”, The R Journal, 2025 [link]
13.5 Text Mining
Text Mining with R by Julia Silge and David Robinson
gutenbergr: Download and Process Public Domain Works from Project Gutenberg. Tutorial can be found here
Useful reading on text mining
Atanassova I, Bertin M and Mayr P (2019) Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics. Front. Res. Metr. Anal. 4:2. doi: 10.3389/frma.2019.00002
Westergaard D, Stærfeldt H-H, Tønsberg C, Jensen LJ, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2): e1005962. https://doi.org/10.1371/journal.pcbi.1005962
Salloum, S.A., Al-Emran, M., Monem, A.A., Shaalan, K. (2018). Using Text Mining Techniques for Extracting Information from Research Articles. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_18
Simon, C., Davidsen, K., Hansen, C. et al. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics 19, 57 (2019). https://doi.org/10.1186/s12859-019-2607-x
Benchimol, J., Kazinnik, S., & Saadon, Y. (2022). Text mining methodologies with R: An application to central bank texts. Machine Learning with Applications, 8, 100286.link
Yu, C., Zhang, C., & Wang, J. (2020). Extracting Body Text from Academic PDF Documents for Text Mining. Proceedings of the 12th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2020). https://www.scitepress.org/Papers/2020/101314/101314.pdf
Gulo, C. A., & Rúbio, T. R. (2015, January). Text Mining Scientific Articles using the R. In Doctoral Symposium in Informatics Engineering. link
13.6 Collecting & Processing data from the Web of Science and Scopus
R package
refsplitrR package
bibliometrixjstor: An R package for Analysing Scientific Articles and link to JSTORr package repository on Github. The package is no longer on CRAN but you may still find it useful to browse the repo.
13.7 Scraping websites
- Library Carpentry Lesson on Webscraping Note: no longer updated and a bit out of date, but still a very useful introduction
- Start Here: Introduction to webscraping
- Video: Scraping WebData in R with rvest
- Video: Practical Introduction to Web Scraping using R
- Very nice written tutorial…
- ….and another one, this time from the UC Business Analytics R Programming Guide
- SelectorGadget is useful to id CSS selectors.
- Noortje Marres & Esther Weltevrede (2013) Scraping the Social?, Journal of Cultural Economy, 6:3, 313-335, DOI:10.1080/17530350.2013.772070
13.8 Cell Phone Data
13.10 Automated Image Analysis
Pennekamp, F. and Schtickzelle, N. (2013), Implementing image analysis in laboratory‐based experimental systems for ecology and evolution: a hands‐on guide. Methods Ecol Evol, 4: 483-492. https://doi.org/10.1111/2041-210X.12036
How to build your own image recognition app with R! Part 1 and Part 2
LinkedIn Learning Course: Deep Learning - Image Recognition reequires UF login
UF Practicum AI courses include one on Image recognition models.
13.12 Buildimg automated data collectors
Calipers that dump data directly to Excel link
Morris BI, Kittredge MJ, Casey B, Meng O, Chagas AM, Lamparter M, et al. (2022) PiSpy: An affordable, accessible, and flexible imaging platform for the automated observation of organismal biology and behavior. PLoS ONE 17(10): e0276652. https://doi.org/10.1371/journal.pone.0276652
Jolles, J. W. (2021). Broad-scale applications of the Raspberry Pi: A review and guide for biologists. Methods in Ecology and Evolution, 12, 1562– 1579. https://doi.org/10.1111/2041-210X.13652
13.13 Online data archives
Overview: Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano‐Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes‐Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706
Government data
- Data.gov (the open data portal of the US Government) and Using Data.gov APIs in R
- the rOpengov Project
- Open Fiscal Data Package
educationdata: Retrieve data from the Urban Institute’s Education Data API as a data.frame for easy analysis. See also here- a huge list of data sources for social scientists available with R tools
- accessing World bank Data with R
US & World Census Data
- A Guide to Working with US Census Data in R
- R Package
tidycensus - Tutorial 1
- Tutorial 2
- R package
ipumsr: The ipumsr package helps import IPUMS extracts from the IPUMS website into R. IPUMS provides census and survey data from around the world integrated across time and space.
Education Data
edbuildr: import EdBuild’s master dataset of school district finance, student demographics, and community economic indicators for every school district in the United States.
Other Online Data Portals
- Giant compendium of open datasets #1
- Data on Amazonia
- R package
bdc: toolkit for gathering & cleaning biodiversity data
Software for gathering data from online archives
- EcoRetriever: automates the tasks of finding, downloading, and cleaning up publicly available ecological data, and then stores them in a local database or csv files.
- litsearcher an R package to facilitate quasi-automatic search strategy development for systematic review
13.9 Social Media Data
How to extract Biodiversity Data from Facebook: Chowdhury, S., Ahmed, S., Alam, S., Callaghan, C. T., Das, P., Di Marco, M., Di Minin, E., Jarić, I., Labi, M. M., Rokonuzzaman, Md., Roll, U., Sbragaglia, V., Siddika, A., & Bonn, A. (2024). A protocol for harvesting biodiversity data from Facebook. Conservation Biology, 38, e14257. https://doi.org/10.1111/cobi.14257
Di Minin, E., Fink, C., Hausmann, A., Kremer, J. and Kulkarni, R. (2021), How to address data privacy concerns when using social media data in conservation science. Conservation Biology, 35: 437-446. https://doi.org/10.1111/cobi.13708
Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano-Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes-Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706
Fox, Nathan, Tom August, Francesca Mancini, Katherine E. Parks, Felix Eigenbrod, James M. Bullock, Louis Sutter, and Laura J. Graham. ““photosearcher” package in R: An accessible and reproducible method for harvesting large datasets from Flickr.” SoftwareX 12 (2020): 100624. https://www.sciencedirect.com/science/article/pii/S235271102030337X
Ergon Cugler de Moraes Silva: TelegramScrap: A comprehensive tool for scraping Telegram data. https://arxiv.org/abs/2412.16786
Batrinca, B., Treleaven, P.C. Social media analytics: a survey of techniques, tools and platforms. AI & Soc 30, 89–116 (2015). https://doi.org/10.1007/s00146-014-0549-4