- This event has passed.
Workshop – Text Analysis and Natural Language Processing for Data Science
June 10 @ 10:00 am - June 14 @ 4:00 pm
Learn how to getting started with textual data analysis and natural language processing with Python in this week-long intensive workshop. Sessions run 10 AM – 4 PM daily from Monday, June 10 through Friday June 14. A special session on Friday will cover the basics of Large Language Models (LLMs; no separate registration required).
This week-long workshop series covers the basics of text mining and natural language processing (NLP) with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovering of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/NLP and will walk through methods that range from tokenization and dependency parsing to text classification, topic modeling, and word embeddings. Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities. Note: this series concludes with a special session on large language models, “The Basics of Large Language Models.”
This is an in person workshop.
By the end of this series, you will be able to:
- Clean and structure textual data for analysis
- Recognize and explain how these cleaning proceses impact research findings
- Explain key concepts and terminology in text mining/NLP, including tokenization, dependency parsing, word embedding
- Use special data structures such as document-term matrices to efficiently analyze multiple texts
- Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text
- Classify texts on the basis of their features
- Produce statistical models of topics from/about a collection of texts
- Produce models of word meanings from a corpus
Prerequisites
Instructors will distribute a zipped directory of notebooks and files prior to the workshop. Participants are required to load this data into their Google Drive account before our first session. A basic knowledge of Python is required. Specifically, participants should be able to:
- Load text data into Python
- Load Python libraries
- Work with different Python data structures (strings, lists, dictionaries)
- Implement control flow with for loops
- Use Pandas DataFrames (primarily: indexing and subsetting)
We will be using Google Colab for this workshop.