Workshop Descriptions and Materials
Fundamentals
Overview
A research project is reproducible if a different researcher can carry out the same analysis with the same data and produce the same overall result. To do so, they need transparent, detailed documentation about all of the steps in the research process and access to the tools — especially code — with which the steps were carried out. Reproducibility enables independent verification, a touchstone for all research.
There are myriad practices, often accompanied by software tools, that can help ensure research projects are reproducible. This overview workshop will help you decipher which to adopt and when to adopt them. The workshop also highlights additional benefits many of these practices confer, such as making it easier to collaborate with others. As an overview, this workshop is relatively non-technical, but provides technical references, including other DataLab workshops, for all of the practices covered.
Prerequisites
This workshop is intended for learners of at all experience levels, and may benefit learners at different experience levels in different ways. There are no prerequisites and no prior programming experience is necessary.
Resources
Overview
Technical advancements through data science combined with the exponential increase in data has led to research breakthroughs across domains and generated entirely new industries. But, lagging behind this growth is our understanding of the evolving socio-technical landscape and ability to predict the indirect consequences of our work. While laws determine the legal parameters governing data use, data science approaches that are technically legal can still be used unethically and irresponsibly, with disastrous consequences from loss of revenue to human rights violations. Through case studies and interactive sessions, this workshop provides an overview of how to practice responsible data science by incorporating considerations of ethics, equity, and justice. We will discuss FACT (fairness, accuracy, confidentiality, and transparency) based approaches to increasing the integrity of our work in data science.
Prerequisites
There are no prerequisites for this workshop.
Resources
Overview
Learn and practice how to talk directly to your computer via the command line. The Unix shell is a powerful tool for using scientific software, working with large datasets, and controlling remote servers. It is primarily used to manage files and run programs, and it allows for automation of repetitive tasks
This workshop is a prerequisite for many of DataLab’s workshops, including:
- Introduction to Version Control with Git
- Reproducible Research for Teams with GitHub
- The Remote Computing series
Prerequisites
No prior experience is necessary.
Resources
Overview
This workshop covers the fundamentals of using version control for reproducible research. Topics covered will include installing the Git versioning control software locally, initiating a local Git repository, managing file versions, basic branching and merging, and time permitting intermediate topics including working with remote repositories and resolving conflicts. At the end of this workshop learners should be able to create new repos and begin using Git for version control of their individual projects.
Prerequisites
The workshop is suitable for participants with little to no previous Git experience. Familiarity with basic command line syntax is required. If you have not taken DataLab’s “Introduction to Command Line” workshop, please work through those materials in advance of this session.
Resources
Overview
GitHub is an online platform for software development using Git for version control. During this hands-on workshop we’ll practice setting up, sharing, and collaboratively working on a repository for a research project. We’ll explore different features for improving your workflows, whether you’re working by yourself or with others on a data-driven project using GitHub.
Prerequisites
Familiarity with Command Line and Git versioning control software is required.
Resources
Overview
This 4-part workshop series provides an introduction to using the Python programming language for reproducible data analysis and scientific computing. The focus of this workshop is programming basics and working with tabular data. Along the way, we’ll learn a little bit about how to break down programming problems and organize for clarity and reproducibility.
After this workshop, learners will be able to load tabular data sets into Python, compute simple summaries and visualizations, do common data-tidying tasks, write reusable functions, and identify where to go to learn more.
Prerequisites
No prior programming experience is necessary.
Resources
Overview
This 4-part workshop series provides an introduction to using the R programming language for reproducible data analysis and scientific computing. Topics include programming basics, how to work with tabular data, how to break down programming problems, and how to organize code for clarity and reproducibility.
After this workshop, learners will be able to load tabular data sets into R, compute simple summaries and visualizations, do common data-tidying tasks, write reusable functions, and identify where to go to learn more.
Prerequisites
No prior programming experience is necessary. All learners will need access to an internet-connected computer and the latest version of Zoom, R, and RStudio.
Resources
Overview
Structured Query Language (SQL) is a programming language for interacting with relational databases. This workshop covers basic SQL keywords to view, filter, aggregate, and combine tables in a database. SQL is supported by many different database systems. The examples in this workshop use a SQLite database, but most of the keywords are applicable to other database systems as well. The workshop also covers how to use SQLiteStudio, an integrated development environment for SQL code.
We’ll focus on querying data to get to know a database and answer questions, and joining data from separate tables.
Prerequisities
No prior experience is necessary.
Resources
Overview
This workshop provides a broad overview of the various technologies for storing and organizing different collections of data. We will discuss how data structure and data types impact your storage options, when you should use a database, and which platforms you might consider for your research. This workshop is a general lecture with case studies and Q&A (no laptops necessary).
Prerequisites
No prior programming experience is necessary. It is designed for researchers with active or planned data projects.
Resources
Intermediate & Advanced Topics
Overview
Data visualization is a powerful tool for exploring and communicating our research findings. A good plot helps us uncover and share the patterns in our data, but creating good plots takes skill and practice. This workshop introduces concepts of design principles, visual perception, and storytelling with data. We’ll discuss the graphical elements of a plot, when to use different statistical plot types, and how to make your plots more understandable and accessible. We’ll explore some historic and more recent data visualizations as we deconstruct the principles of plotting. Learners are encouraged to bring a draft plot that they are working on.
Prerequisites
No prior experience with data visualization is necessary. This workshop is intended for researchers who are actively working with data. It covers material in a tool-agnostic way, with the intent that researchers may use it to improve their visualizations regardless of the software they use.
Resources
Overview
Take your data visualization skills to the next level in Python! In this intermediate Python workshop we’ll cover the fundamentals of Matplotlib, which is the foundation for most Python visualization packages. Then we’ll discuss how to choose appropriate visualizations for your data and how to critically assess your visualizations for potential problems. We’ll then put it all together to practice using Matplotlib to fix and improve visualizations made with Python packages including Plotnine, Seaborn, and Pandas.
Prerequisites
Participants must have taken DataLab’s “Python Basics” workshop series and/or have prior experience using Python, be comfortable with basic Python syntax, and have it pre-installed and running on their laptops. Learners should also take the “Principles of Data Visualization from Perception to Statistical Graphics” workshop in advance of this session.
Resources
Overview
Take your data visualization skills to the next level in R! In this intermediate R workshop we’ll dive into the grammar of graphics to uncover how R’s popular ggplot2 library works and learn how to create high-quality, reproducible plots that make sense. We’ll cover topics including why the aesthetics that convert data columns into abstract dimensions are kept separate from their geometries, which turn those dimensions into concrete shapes and lines, and cover the steps for using ggplot2 to create beautiful data visualizations.
Prerequisites
Participants must have taken DataLab’s “R Basics” workshop series and/or have prior experience using R, be comfortable with basic R syntax, and have it pre-installed and running on their laptops. Learners should also take “Principles of Data Visualization from Perception to Statistical Graphics” workshop in advance of this session.
Resources
Overview
This intermediate workshop is a primer on creating dynamic visualizations. We’ll discuss what it means for a visualization to be “dynamic” and the advantages and disadvantages of dynamic visualizations compared to static visualizations. We’ll also explore the ecosystem of packages for creating dynamic visualizations with R and Python, as well as the JavaScript libraries that underpin these. To make the ideas concrete and get you started on building your own dynamic visualizations, we’ll implement a simple dynamic visualization in R and Python.
Prerequisites
Participants must have taken DataLab’s “R Basics” or “Python Basics” workshop series and/or have prior experience using Python or R. Learners should also take the “Principles of Data Visualization” workshop in advance of this session.
Resources
Overview
Network science approaches are being increasingly used to explore complex interactions and the general connectivity among entities, from friends in a social network to the spread of a disease in a population. Due its complexity, network data is often explored and communicated using data visualizations.
In this intermediate R workshop we will cover how to tell useful stories with network data primarily using the statnet
suite of packages and the ggraph
plotting package that is compatible with much of the ggplot2
framework. In this interactive and hands-on workshop we’ll practice using these packages in R to plot one-mode and two-mode networks. As we introduce functions unique to these packages we will discuss what visualization features best suit different types of network data and research communication goals. Along the way we will cover basic data preparation steps and how to calculate (or assign) key network descriptives including centrality measures, edge attributes, and community clusters for your plots.
Prerequisites
Participants must have taken DataLab’s “R Basics” workshop series and/or have prior experience using R, be comfortable with basic R syntax, and have it pre-installed and running on their laptops. We also strongly encourage having either taken DataLab’s Intermediate R Data Visualization workshop, or have general familiarity with how to make plots in R using the ggplot2
package.
Resources
Overview
Virtual reality (VR) is now widely available as a medium for scientific data analysis, specifically analysis of large and/or complex 3D data such as the results of high-resolution 3D scanning, tomography, or advanced numerical simulation. This workshop will teach the practical aspects of setting up and operating a VR system based on commodity technology, and of using DataLab-developed open-source VR software applications to analyze several types of scientific data.
Prerequisites
Participants should have a basic understanding of how to use the UNIX command line (such as having completed DataLab’s “Introduction to the Command Line” workshop), and how to view/edit scientific data using desktop tools such as text editors or spreadsheet software.
Overview
This workshop unpacks the subjective process of data visualization and its relationship to concepts of diversity, equity and inclusion. We’ll critically explore how data can be used to uphold and perpetuate, or quantify and demonstrate structural oppression. Through this workshop learners will practice the technique of “data visceralization,” the process of experiencing differences in data and understanding them viscerally.
Prerequisites
This workshop is aimed for researchers with prior experience collecting and working with data who are ready and willing to engage in understanding how data can be diverse, equitable and inclusive.
Resources
Overview
A digital (online) portfolio complements your CV (or resume) and helps demonstrate your skills to potential collaborators, employers, and/or funders. Digital portfolios expand upon the skills in your other application materials for jobs or grants, allowing you to showcase your abilities. In this workshop we will discuss why you should create and maintain a digital portfolio of your work, various methods for creating a digital portfolio, and considerations for carefully curating and presenting your work in an engaging manner.
Prerequisites
No prior experience is necessary. Participants should have access to a web browser (for example, Chrome or Firefox).
Resources
Coming soon!
Overview
This workshop is an introduction to the Julia programming language for people familiar R, Python, or MATLAB. Workshop topics include a concise overview of Julia’s syntax and features, an end-to-end introduction to using built-in functions and contributed packages to read, summarize, and visualize tabular data, real-world examples where we’ve found Julia beneficial.
Prerequisites
Participants must be proficient programming in a high-level language such as R, Python, MATLAB, etc.. Before the workshop, participants must install the latest version of Julia on their computer (https://julialang.org/). This workshop is NOT designed for entry level programmers.
Resources
Overview
These four standalone workshops aim to help Python users understand language features, packages, and programming strategies that will enable them to write more efficient code, be more productive when writing code, and debug code more effectively. This is not an introduction to Python and is appropriate for motivated intermediate to advanced users who want a better understanding of working with Python for their research.
Prerequisites
Participants are expected to have taken DataLab’s Python Basics workshop series and/or have prior experience using Python, be comfortable with basic Python syntax, and have it pre-installed and running on their laptops.
Resources
Overview
This is the reader for all of UC Davis DataLab’s Intermediate R workshop series. The curriculum currently has four parts:
- Cleaning & Reshaping Data
- Writing & Debugging R Code
- Data Visualization & Analysis in R
- Thinking in R, which is about understanding how R works, how to diagnose and fix bugs in code, and how to estimate and measure performance characteristics of code.
Each part is independent and consists of approximately 2 workshop sessions.
After completing these workshops, students will have a better understanding of language features, packages, and programming strategies, which will enable them to write more efficient code, be more productive when writing code, and debug code more effectively.
Prerequisites
Participants are expected to have prior experience using R, be comfortable with basic R syntax, and to have it pre-installed and running on their laptops. This series is appropriate for motivated intermediate to advanced users who want a better understanding of base R.
Resources
Overview
Are you working on a research project but finding that your data is too big for your laptop, or that your code must run for a long time? You need more compute power! During this session we’ll discuss the differences and advantages of various remote and networked computing options, from servers in your lab to institutional high performance computing (HPC) and cloud services. We’ll cover an overview of HPC terminology, architecture, and general workflows. We’ll also provide information about UC-specific computing resources and contacts. This workshop is a prerequisite introduction to DataLab’s Introduction to Remote Computing series where you’ll learn how to access and work efficiently on the UC Davis HPC.
Prerequisites
No previous experience required.
Resources
Overview
This workshop series provides an introduction to accessing and computing on remote servers such as UC Davis’ “Farm” cluster. The series covers everything you need to know to get started: how to set up and use SSH to log in and transfer files, how to install software with conda, how to reserve computing time and run programs with SLURM, and shell commands that are especially useful for working with servers
Prerequisites
DataLab’s “Overview of Remote and High Performance Computing (HPC)” workshop and “Introduction to the Command Line” workshop series, or have equivalent prior experience. Participants must be comfortable with basic Linux shell syntax.
Resources
Overview
In this workshop, attendees will be introduced to building an interactive web map to display spatial data using the Leaflet javascript library.
Prerequisites
Participants should have a basic understanding of spatial data formats such as vector and raster data. Participants that have experience with coding in HTML and javascript will have an easier time learning, but these skills are not required.
Resources
Overview
In this workshop, participants will learn how to make map figures for academic publications such as journal articles and books. Participants will learn how to find and understand art specifications from publishers and how to make maps that fit these specifications. They will also learn strategies for communicating effectively with small maps with limited color pallets.
Prerequisites
No previous background in figure creation or design is required. Introductory experience with GIS software and concepts will be helpful but are not required.
Resources
Overview
In this workshop, participants will learn about projected coordinate reference systems (CRS, commonly called “projections”) and how to apply them in R to spatial data. We will discuss the components of a CRS, how to apply them, how to translate your data into a different CRS, and how to choose a CRS.
Prerequisites
Participants should have a basic understanding of R (for example, be able to create variables and load common data formats like a CSV) and a basic understanding of GIS data formats (e.g., raster and vector data). Participants should also install R and RStudio.
Resources
Overview
In this workshop, we’ll discuss the concepts needed to geocode data, understand options for working with personally identifiable data and non-identifiable data, and gain some hands-on experience with geocoding address data using QGIS, a GIS with a graphical user interface.
Prerequisites
An introductory-level understanding of GIS and experience with graphical GIS software is recommended.
Resources
Overview
This introductory-level workshop will focus upon the fundamental concepts and skills needed to explore and analyze data using Geographic Information Systems (GIS) software with examples using the QGIS platform.
Prerequisites
No prior experience with QGIS or other GIS software is needed, though attendees should be comfortable learning new computer applications, working with the basics of spreadsheets, and managing, organizing, and moving computer files on their operating system.
Resources
Overview
This workshop is intended to give participants an introduction to working with spatial data using SQL. We will work with a graphical user interface (GUI) and explore some examples of common analysis processes as well as present participants with resources for continued learning. This workshop will give participants a solid foundation on which to build further learning.
Prerequisites
An introductory understanding of SQL is recommended, but not mandatory. For example, knowing how to compose a SELECT statement using SQL and the general concept of joining tables will serve learners well. For learners without a foundation in SQL, we recommend attending or reviewing the materials for DataLab’s Introduction to Databases and Data Storage Technologies, which introduces the concept of databases, and Intro to SQL for Querying Databases, which teaches the basics of querying data using SQL.
Resources
Overview
This workshop explores basic, practical applied statistics using the R statistical programming language. On the first day we’ll focus on common procedures like assessing the distribution of your data and calculating differences between groups. In the second day we’ll focus on common linear models (linear regression, ANOVA, etc.) in R. We will also calculate the power of a simple study.
Prerequisites
Learners should be proficient at using R and have the latest versions of R, RStudio, and Zoom installed on their computers.
Resources
Overview
We often see that values observed in closer spatial proximity are more alike than those from distant locations, and thus the data may not be independent. This can cause problems, and opportunities, for our analyses. In this workshop, we will discuss how spatial data can break the assumptions of common statistical methods, and work towards identifying and implementing appropriate methods in R. Specifically, this workshop will focus on the uncertainty of spatial interpolation and regression.
Prerequisites
Participants should have a basic understanding of R (for example, understand how to create variables and load common data formats like a CSV) and a basic understanding of GIS data formats (e.g., raster and vector data).
Resources
Overview
In this workshop, we will:
- Discuss the basics of creating, comparing, and validating predictive models using a case study from the health sciences
- Demonstrate categorical prediction with logistic regression, and numerical predictions with a regression tree approach
- Calculate measurements of accuracy that are applicable to the different types of models, and use cross-validation to find the model parameters that generate the best predictions
- Interpret the results for insights about the real-world process being modeled
While this workshop features working with health data, the conceptual framework and principles discussed should be generalizable to research in other domains.
Prerequisites
This workshop is open to learners at all levels, but prior experience with R is required in order to fully participate in this interactive, hands-on workshop.
Resources
Overview
Regression modeling — using input variables to predict or model the value of a response — is widely used in pretty much every field of research. Yet many graduate programs don’t include formal training in statistical modeling, and the DataLab’s office hours indicate widespread anxiety about using regression models in practice. This workshop is intended to help address that anxiety by teaching the fundamentals of using regression modeling. The emphasis is on practice and intuition, with only a small amount of math.
Prerequisites
Participants must have taken DataLab’s “R Basics” workshop series and/or have prior experience using R, be comfortable with basic R syntax, and have the latest version of R and RStudio pre-installed and running on their laptops. We also strongly encourage learners to have either taken DataLab’s Intermediate R Data Visualization workshop, or have general familiarity with how to make plots in R using the ggplot2 package.
Resources
Overview
This workshop provides an overview of contemporary machine learning methods. We’ll cover important terminology and popular methods so that you can determine whether machine learning is relevant to your research and what to learn more about if it is.
This is a concept-focused, non-technical workshop. No laptops needed.
Prerequisites
No prior experience is necessary. This workshop is designed for researchers from all domains and backgrounds and does not involve any coding.
Resources
Overview
This two-part workshop series provides an introduction to using R for two popular machine learning techniques: clustering and classification. Clustering involves identifying groups of similar observations (called clusters) within data. Clustering can be an effective tool for finding patterns and an important part of exploratory data analysis. Classification refers to modeling categorical variables. Classification models can provide insight into the relationship between the predictors and response, as well as a way to make predictions about new observations.
Prerequisites
This workshop is designed for researchers who have data that they are already working with in R. Participants must have taken DataLab’s “Overview of Statistical Machine Learning,” “R Basics,” and “Regression in R” workshop series, or have equivalent prior experience. Completion of DataLab’s “Intermediate R” series is recommended but not required.
Resources
Overview
This week-long workshop series covers the basics of text mining and natural language processing (NLP) with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovering of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/NLP and will walk through methods that range from tokenization and dependency parsing to text classification, topic modeling, and word embeddings. Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities. Note: this series concludes with a special session on large language models, “The Basics of Large Language Models.”
Prerequisites
A basic knowledge of Python is required. Specifically, participants should be able to:
- Load text data into Python;
- Load Python libraries;
- Work with different Python data structures (strings, lists, dictionaries);
- Implement control flow with for loops;
- Use Pandas dataframes (primarily: indexing and subsetting).
Resources
Overview
This workshop focuses on the basics of working with Large Language Models (LLMs) as part of the research pipeline, including understanding and interrogating the models themselves as well as interacting with their generative capabilities. Specific topics will include: setting up your own open-source LLM, fine-tuning LLMs, interacting with LLMs programmatically via an API, and the basics of prompt engineering.
Prerequisites
This workshop is designed for researchers from all domains with prior programming experience. Basic Python and command-line (shell) skills are required. Specifically, participants should be able to:
- Load text data into Python
- Perform basic text cleaning actions
- Generate data structures like document-term matrices
- Conduct preliminary counting processes on corpora
Resources
Archive
These workshops were offered in the past but are not currently being maintained as part of our regular workshops offerings. Resources such as curriculum readers and recordings of the workshop sessions remain available for you to learn from on your own.
Overview
This workshop introduces learners to best practices for data entry and organization for data-driven projects. With interactive, hands-on examples we’ll practice using data validation tools including filters, controlled vocabularies, and flags in Microsoft Excel. These skills are also applicable to working in Google Sheets, LibreOffice, and other spreadsheet software. At the end of the workshop, learners should be able to identify the best practices for designing a spreadsheet and entering data, compare different data formats, and be able to use spreadsheet software for basic data validation.
Prerequisites
This workshop is designed for learners who have basic familiarity with spreadsheet programs. No coding experience is required. All participants will need a computer with Microsoft Excel 2016 or newer, and the latest version of Zoom.
Resources
Overview
Optical Character Recognition (OCR) involves computational techniques for converting scanned images of printed or handwritten text into computer-readable formats. OCR helps make documents more searchable and can allow for analyses including text mining and natural language processing.
This workshop will provide an overview of existing and emerging tools for unlocking the text in printed images, and will demonstrate practical techniques for OCR with Python using Tesseract OCR engine. Additionally, this workshop will include a discussion and practical examples of evaluating OCR viability, as well as tips for using OCR extracted data in NLP pipelines.
Prerequisites
For this course you will need Python, Tesseract, and several python packages. While it is possible to install of these software individually, we recommend using Miniconda.
Resources
Overview
This workshop covers best practices for organizing and documenting your digital projects for robust, reproducible research. Topics covered include data documentation, forms of metadata at both the data and project level, and best practices for organizing file directories and recording metadata. After this workshop learners will be able to create a readme for their data-driven project directory, design a data dictionary (or codebook) for a dataset, and describe how to use a workflow diagram to represent their data gathering, cleaning, and analyses methodology. This workshop includes a studio portion where learners are guided through putting principles into practice to begin documenting their own research projects.
Prerequisites
This workshop is designed for learners with prior experience collecting or otherwise working with research data. No coding experience is required.