- This event has passed.
Optical Character Recognition (OCR) and Working with Messy Text Data
May 19 @ 2:00 pm - 4:00 pm
Optical Character Recognition (OCR) involves computational techniques for converting scanned images of printed or handwritten text into computer-readable formats. OCR helps make documents more searchable and can allow for analyses including text mining and natural language processing. This workshop will provide an overview of existing and emerging tools for unlocking the text in printed images, and will demonstrate practical techniques for OCR with Python using Tesseract OCR engine. Additionally, this workshop will include a discussion and practical examples of evaluating OCR viability, as well as tips for using OCR extracted data in NLP pipelines. This workshop qualifies as an elective for the Text Mining and NLP micro-credential through UC Davis GradPathways.
– After this workshop learners should be able to:
– Define “OCR”
– Describe an example of when OCR has aided “distant” (computational) reading and analysis
– List potential off-the-shelf solutions for simple OCR
– Identify possible technical challenges for performing OCR on a given document
– Describe an OCR workflow
– Use the course notebook to perform OCR on provided documents
– Assess and propose solutions for increasing accuracy.
Instructors: Arthur Koehl, TA: Tyler Shoemaker
Arthur Koehl is a research data scientist. He graduated from UC Davis with degrees in history, economics, and computer science. Prior to DataLab he worked for several years as a scientific computing intern at the Center for BioImaging Sciences at the National University of Singapore, where he learned the basics of Linux system administration. His interests include natural language processing, computer vision, and web programming. At DataLab he develops tools and provides technical expertise on interdisciplinary research projects with an emphasis in the humanities and social sciences.
Location: Zoom. Click link below to APPLY to this workshop.
Cost: Free of charge.