13  Automated Data Collection & Extraction

Modified

May 7, 2026

Automating your data collection is one of the best ways to make your research more efficient and rfree up time to dso other things. Some of these techniques take time to learn, but so take into account the time required to learn a method and the return on investment for the effort.

xkcd.com: Is it Worth the Time?

TipAI and Automated Data Extraction

Automated data extraction is a rapidly changing field due to advances in machine learning and other ‘AI’ tools. The list below will be changing frequently to reflect these advances, but working through some of these methods by hand is an excellent to way to evaluate what these tools can and can’t do for your research.

13.1 APIs

  1. Isabella Velásquez: Creating APIs for Data Science With plumber: [link] (and the link to the plumbr package)
  2. SESYNC: Data APIs in R Lesson [link]
  3. The Carpentries: APIs with R [link]
  4. Md Sakhawat Hossen: Working with the APIs (Application Programming Interfaces) in R [link]

13.2 Optical Character Recognition

  1. Video Primer: What is OCR?

Online OCR Tools (text & data from .pdf to .csv, .txt, etc.)

  1. Google Drive - Video Primer: OCR with Google Drive

  2. Free online sites for small batches (can upgrade for larger numbers of files)

  3. Amazon TextExtract

  4. Mathpix Snip digitizes handwritten or printed text, and copies outputs to the clipboard that can be pasted into LaTeX editors like Overleaf, Markdown editors like Typora, Microsoft Word, and more.

OCR with R

  1. R package pdftools
  2. R package tabulapdf
  3. Detailed Blog Post & Tutorial
  4. Other Written Tutorials: tutorial 1, tutorial 2, tutorial 3
  5. Video Tutorials: Video Tutorial 1, Video Tutorial 2
  6. Convert PDF to text in R - OCR pdftools
  7. PDFtools in R
  8. More advanced but more powerful from the Programming Historian: OCR with Google Vision API and Tesseract

13.3 Extracting tables from images with R

  1. R package magick (this package actually includes several very powerful tools for image processing; this is just one of the things you can do with it)
  2. Detailed Blog Post / Tutorial

13.4 Extracting Data from Published Figures

  1. Ankit Rohagni’s Web Plot Digitizer
  2. Alternative 1: R package magick
  3. Alternative 2: GetData extracts data automatically from scanned images (~$30).
  4. Alternative 3: R package digitize will extract data from scatterplots within the R environment. Note that this is no longer on CRAN due to dependency issues but you can still install and try to work with it directly from the Github repository.
  5. A more comprehensive overview: Brauckhoff, et al., “Exploring Image Analysis in R: Applications and Advancements”, The R Journal, 2025 [link]

13.5 Text Mining

  1. Text Mining with R by Julia Silge and David Robinson

  2. gutenbergr: Download and Process Public Domain Works from Project Gutenberg. Tutorial can be found here

Useful reading on text mining

  1. Atanassova I, Bertin M and Mayr P (2019) Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics. Front. Res. Metr. Anal. 4:2. doi: 10.3389/frma.2019.00002

  2. Westergaard D, Stærfeldt H-H, Tønsberg C, Jensen LJ, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2): e1005962. https://doi.org/10.1371/journal.pcbi.1005962

  3. Salloum, S.A., Al-Emran, M., Monem, A.A., Shaalan, K. (2018). Using Text Mining Techniques for Extracting Information from Research Articles. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_18

  4. Simon, C., Davidsen, K., Hansen, C. et al. BioReader: a text mining tool for performing classification of biomedical literature. BMC Bioinformatics 19, 57 (2019). https://doi.org/10.1186/s12859-019-2607-x

  5. Benchimol, J., Kazinnik, S., & Saadon, Y. (2022). Text mining methodologies with R: An application to central bank texts. Machine Learning with Applications, 8, 100286.link

  6. Yu, C., Zhang, C., & Wang, J. (2020). Extracting Body Text from Academic PDF Documents for Text Mining. Proceedings of the 12th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2020). https://www.scitepress.org/Papers/2020/101314/101314.pdf

  7. Gulo, C. A., & Rúbio, T. R. (2015, January). Text Mining Scientific Articles using the R. In Doctoral Symposium in Informatics Engineering. link

13.6 Collecting & Processing data from the Web of Science and Scopus

  1. R package refsplitr

  2. R package bibliometrix

  3. jstor: An R package for Analysing Scientific Articles and link to JSTORr package repository on Github. The package is no longer on CRAN but you may still find it useful to browse the repo.

13.7 Scraping websites

  1. Library Carpentry Lesson on Webscraping Note: no longer updated and a bit out of date, but still a very useful introduction
  2. Start Here: Introduction to webscraping
  3. Video: Scraping WebData in R with rvest
  4. Video: Practical Introduction to Web Scraping using R
  5. Very nice written tutorial
  6. ….and another one, this time from the UC Business Analytics R Programming Guide
  7. SelectorGadget is useful to id CSS selectors.
  8. Noortje Marres & Esther Weltevrede (2013) Scraping the Social?, Journal of Cultural Economy, 6:3, 313-335, DOI:10.1080/17530350.2013.772070

13.8 Cell Phone Data

  1. Exploratory analyses Part 1 and Part 2

13.9 Social Media Data

  1. How to extract Biodiversity Data from Facebook: Chowdhury, S., Ahmed, S., Alam, S., Callaghan, C. T., Das, P., Di Marco, M., Di Minin, E., Jarić, I., Labi, M. M., Rokonuzzaman, Md., Roll, U., Sbragaglia, V., Siddika, A., & Bonn, A. (2024). A protocol for harvesting biodiversity data from Facebook. Conservation Biology, 38, e14257. https://doi.org/10.1111/cobi.14257

  2. Di Minin, E., Fink, C., Hausmann, A., Kremer, J. and Kulkarni, R. (2021), How to address data privacy concerns when using social media data in conservation science. Conservation Biology, 35: 437-446. https://doi.org/10.1111/cobi.13708

  3. Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano-Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes-Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706

  4. Fox, Nathan, Tom August, Francesca Mancini, Katherine E. Parks, Felix Eigenbrod, James M. Bullock, Louis Sutter, and Laura J. Graham. ““photosearcher” package in R: An accessible and reproducible method for harvesting large datasets from Flickr.” SoftwareX 12 (2020): 100624. https://www.sciencedirect.com/science/article/pii/S235271102030337X

  5. Ergon Cugler de Moraes Silva: TelegramScrap: A comprehensive tool for scraping Telegram data. https://arxiv.org/abs/2412.16786

  6. Batrinca, B., Treleaven, P.C. Social media analytics: a survey of techniques, tools and platforms. AI & Soc 30, 89–116 (2015). https://doi.org/10.1007/s00146-014-0549-4

13.10 Automated Image Analysis

  1. Pennekamp, F. and Schtickzelle, N. (2013), Implementing image analysis in laboratory‐based experimental systems for ecology and evolution: a hands‐on guide. Methods Ecol Evol, 4: 483-492. https://doi.org/10.1111/2041-210X.12036

  2. How to build your own image recognition app with R! Part 1 and Part 2

  3. LinkedIn Learning Course: Deep Learning - Image Recognition reequires UF login

  4. UF Practicum AI courses include one on Image recognition models.

13.11 Wearable Devices & RFID tags

  1. What is an RFID tag?

  2. Rafiq, K., Appleby, R. G., Edgar, J. P., Radford, C., Smith, B. P., Jordan, N. R., Dexter, C. E., Jones, D. N., Blacker, A. R. F., & Cochrane, M. (2021). WildWID: An open-source active RFID system for wildlife research. Methods in Ecology and Evolution, 12, 1580– 1587. https://doi.org/10.1111/2041-210X.13651

  3. Build your own RFID device

  4. Izmailova, E.S., Wagner, J.A. and Perakslis, E.D. (2018), Wearable Devices in Clinical Trials: Hype and Hypothesis. Clin. Pharmacol. Ther., 104: 42-52. https://doi.org/10.1002/cpt.966

  5. Loncar-Turukalo T, Zdravevski E, Machado da Silva J, Chouvarda I, Trajkovik V. Literature on Wearable Technology for Connected Health: Scoping Review of Research Trends, Advances, and Barriers J Med Internet Res 2019;21(9):e14017 doi: 10.2196/14017

  6. Why Should Sociologists Care about Wearable Tech?

  7. Harari, G. M., Lane, N. D., Wang, R., Crosier, B. S., Campbell, A. T., & Gosling, S. D. (2016). Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges. Perspectives on Psychological Science 11(6), 838-854 https://doi.org/10.1177/1745691616650285

  8. Seifert Alexander, Hofer Matthias, Allemand Mathias. 2018. Mobile Data Collection: Smart, but Not (Yet) Smart Enough. 12. Frontiers in Neuroscience https://www.frontiersin.org/article/10.3389/fnins.2018.00971

13.12 Buildimg automated data collectors

  1. Calipers that dump data directly to Excel link

  2. Morris BI, Kittredge MJ, Casey B, Meng O, Chagas AM, Lamparter M, et al. (2022) PiSpy: An affordable, accessible, and flexible imaging platform for the automated observation of organismal biology and behavior. PLoS ONE 17(10): e0276652. https://doi.org/10.1371/journal.pone.0276652

  3. Jolles, J. W. (2021). Broad-scale applications of the Raspberry Pi: A review and guide for biologists. Methods in Ecology and Evolution, 12, 1562– 1579. https://doi.org/10.1111/2041-210X.13652

13.13 Online data archives

Overview: Correia, R.A., Ladle, R., Jarić, I., Malhado, A.C.M., Mittermeier, J.C., Roll, U., Soriano‐Redondo, A., Veríssimo, D., Fink, C., Hausmann, A., Guedes‐Santos, J., Vardi, R. and Di Minin, E. (2021), Digital data sources and methods for conservation culturomics. Conservation Biology, 35: 398-411. https://doi.org/10.1111/cobi.13706

Government data

  1. Data.gov (the open data portal of the US Government) and Using Data.gov APIs in R
  2. the rOpengov Project
  3. Open Fiscal Data Package
  4. educationdata: Retrieve data from the Urban Institute’s Education Data API as a data.frame for easy analysis. See also here
  5. a huge list of data sources for social scientists available with R tools
  6. accessing World bank Data with R

US & World Census Data

  1. A Guide to Working with US Census Data in R
  2. R Package tidycensus
  3. Tutorial 1
  4. Tutorial 2
  5. R package ipumsr: The ipumsr package helps import IPUMS extracts from the IPUMS website into R. IPUMS provides census and survey data from around the world integrated across time and space.

Education Data

  1. edbuildr: import EdBuild’s master dataset of school district finance, student demographics, and community economic indicators for every school district in the United States.

  2. Building R and Stata packages for the Education Data Portal

Other Online Data Portals

  1. Giant compendium of open datasets #1
  2. Data on Amazonia
  3. R package bdc: toolkit for gathering & cleaning biodiversity data

Software for gathering data from online archives

  1. EcoRetriever: automates the tasks of finding, downloading, and cleaning up publicly available ecological data, and then stores them in a local database or csv files.
  2. litsearcher an R package to facilitate quasi-automatic search strategy development for systematic review