DataLab Projects

Current Projects

  • BML

    Activating Ocean Acidification Research

    DataLab is partnering with Professor Tessa Hill at the Bodega Marine Laboratory to support efforts to develop an oceanographic synthesis of publicly available datasets to understand how changing ocean conditions will impact coastal habitats along the western USA. This effort aims to provide detailed interpretations of how ocean acidification and related stressors can be managed by decision makers along the U.S. West Coast.

    Read More>>

  • Monarch butterfly (Jim Hudgins/USFWS)

    Bibliographic approach to the role of science in policy making

    Although science-informed policymaking is frequently touted as a solution to policy design and implementation dilemmas (e.g., Howlett 2009; Cairney 2016; Parkhurst 2017) there are few empirical studies of how scientific information informs policy making (Desmarais and Hird 2014; Newman et al. 2017). DataLab is working with researchers in Environmental Science and Policy to help quantify and characterize the use of science and federal agencies’ environmental assessments.

    Read More>>

  • CDC Covid 19 render


    The CovidDocs project aims to collect and catalog all official state communications related to COVID-19, such as executive orders, emergency declarations, public health orders, and guidance documents. These documents are tagged with relevant metadata, such as what restrictions are being called for in these documents. CovidDocs provides data for analyses by DataLab’s data scientists and collaborating UC Davis faculty. The goal of the study is to create a well-documented data set that can inform research into the pandemic and the public health response.

    Read More>>

  • AVA_image

    Digitizing American Viticultural Areas (AVAs)

    The UC Davis library, in conjunction with UC Santa Barabra, Virginia Tech, and contributions from the general public, are creating a publicly accessible geospatial dataset of American Viticultural Areas boundaries. The AVA Project empowers researchers to study emerging environmental questions, evaluate wine production and marketing data, compare wine aesthetics by geography, and otherwise enrich science related to different wine-growing environments.

    Read More>>

  • Digital Humanities

    English Broadside Ballad Archive

    The English Broadside Ballad Archive (EBBA) was created to catalog and showcase all surviving ballads from 17th century England--currently around 10 thousand unique ballads. EBBA was started in 2003 at the University of California, Santa Barbara, its institutional home, by Dr. Patricia Fumerton, who continues to serve as the director of the Archive.  DataLab’s Executive Director, Carl Stahmer, has served as the archive’s Associate Director since 2008 and is responsible for overseeing the archive’s technical development. As EBBA’s collection of ballads has grown, the DataLab has worked to expand the capabilities of the archive by providing functionality that allows users to apply computational methods to perform advanced analysis of the materials archived in the collection.

    Read More>>

  • vr_colab_geodata

    Geodynamics Collaborative VR

    Analysis of large 3D datasets is difficult on traditional desktop software, as the limited perspective often clutters the view or hides important detail. Previous experience has demonstrated that these visualization challenges are overcome with interactive and immersive virtual reality (VR) tools. Building off of the research done at the UC Davis KeckCAVES, the DataLab will extend the capability of our current 3D data visualization software 3D Visualizer to read and visualize emerging hierarchical mesh data structures such as those used by the community-developed ASPECT simulation code.

    Read more>>

  • Aerial view Aerial view at 2500 feet looking south of the Dutch Slough Tidal Marsh Restoration Project the construction site, in the Sacramento-San Joaquin Delta near Oakley, California. 
The restoration project implemented by the California Department of Water Resources will restore 1,187 acres into a tidal marsh to provide habitat for salmon and other native fish and wildlife. Photo taken March 08, 2018.
Ken James/ California Department of Water Resources, FOR EDITORIAL USE ONLY
California Department of Water Resources
Public Domain
20 Mar 2019
20 Mar 2019
Date Taken
08 Mar 2019
Image Size
5568 x 3712 / 10.93MB

    Informatics for CA Water Data

    This collaboration between DataLab and researchers in UC Davis’ Environmental Science and Policy department is establishing data management workflows to develop and implement a database architecture that can be used to assemble water data at different levels of aggregation, extend to new datasets, visualize and map data in different ways for policy stakeholders, and eventually become available to other researchers and government agencies. This Start-Up project focuses on sustainable groundwater management datasets, specifically the 2014 Sustainable Groundwater Management Act (SGMA) in California.

    Read More>>

  • EarthModel


    The KeckCAVES is a unique visualization collaboration that is developing software to interact with three-dimensional data in real-time – moving, rotating, coloring, and manipulating datasets with ease using a wide range of visualization and interaction hardware. Our software is built to run on anything from standard computers to fully-immersive virtual reality systems such as CAVEs or VR headsets. At the DataLab we continue to advance data visualization research and develop new tools to address new types of data and data science problems for an expanding group of users.

    Read More>>

  • Dna

    Molecular Eugenics

    Dr. Emily Klancher Merchant’s Molecular Eugenics project seeks to identify the intellectual trajectory of eugenics across the twentieth and twenty-first centuries. This project investigates how the contents of eugenics journals (including journals in such related fields as behavior genetics and sociobiology) changed over time, particularly as those journals dropped the word “eugenics” from their title. Dr. Merchant suspects that the journals may have adopted a more technical vocabulary — particularly as behavior geneticists began to utilize molecular methods after the completion of the human genome project — but continued to reflect hereditarian assumptions about the origins of socioeconomic inequality.

    Read More>>

  • Quintessence text topics


    Quintessence seeks to add state-of-the-art data analysis and dynamic corpus exploration to the study of Early Modern period English texts. This project currently uses a corpus of approximately sixty thousand texts from the Early English Books Online (EEBO) Text Creation Partnership. Each text is standardized using Northwestern University’s MorphAdorner, which accounts for spelling changes over time. Any scholar interested in the archive can use Quintessence to run analyses ranging from individual word meanings to broad textual themes. The ability to add more collections of texts is under active development.

    Read More>>

  • Getty-trust-logo

    Shared Cataloging of Early Printed Images

    Through the generous support of The Getty Foundation, DataLab is working to develop an infrastructure that leverages Content Based Image Recognition (CBIR) to facilitate shared cataloging of early printed images from the early modern period.  Our vision is to develop an environment in which a cataloger or archivist who is describing an image can use CBIR to search across collections and institutions for copies of the same or similar images, retrieve the cataloging records for matched images, and easily ingest retrieved cataloging data into the local datastore.  In short, we intend to provide an infrastructure that allows image catalogers to quickly and easily ask, “Has anyone else described an image like this?” and, if so, “How was it described?” Such a system would improve the quality and interoperability of descriptive metadata and speed up image cataloging efforts, thereby improving access to collections worldwide.  

    Read More>>

  • Photo by <a href="">Element5 Digital</a> on <a href="">Unsplash</a>

    Strategy & Democracy Project

    After a half-century of deregulatory and market-centered politics, markets and democracy now appear to be on separate, divergent tracks. The Strategy & Democracy Project, headed by Dr. Stephanie Mudge, seeks to historicize and account for this state of affairs. Why, after almost a century of democratic political development—giving rise, by the year 2000, to what many characterized as an age of triumphant democratic capitalism—are democratic institutions failing while markets thrive? How might we have foreseen the coming of the current democratic crisis?

    Read More>>

  • Photo by <a href="">Kym Ellis</a> on <a href="">Unsplash</a>

    Wine Chemistry

    The site characteristics often cited as likely contributors to the flavor of a wine include factors such as soil type, soil moisture, air temperature, solar exposure, and elevation. Both the site characteristics and the grape juice or wine characteristics can be measured and quantified, which means they lend themselves to exploration with quantitative and statistical methods of investigation. The DataLab will be working with Professor Ron Runnebaum to build a data infrastructure capable of comparing how these quantifiable growing conditions impact the characteristics of the resulting grape juice.

    Read More>>

Past Projects

  • Arch-V


    Archive-Vision (archv or arch-v) is a collection of computer vision programs written in C++ which utilizes functions from the OpenCV library to perform analysis on large image sets. The primary function is to locate recurring patterns within each image in a set of images. Arch-v locates features from a given seed image within an imageset and outputs the image(s) with the most similarities. The first program, processImages.cpp, generates text files containing the keypoints and their mathmatical descriptors; with the keypoints, analysis can be done to compare images and find matches. The second program, scanDatabase.cpp, finds the images that are most similar to a given seed image. The third program, drawMatches.cpp, compares two images, locates their matches based on homography, then draws the keypoints and their relative match; this is most useful when the best matches have already been found.

    Read More>>

  • Datalab-image

    Assessing data on services utilization of children with Autism

    Lack of access to combined mental health, educational and developmental disabilities services data limits our ability to understand how essential services provided by these systems can affect outcomes for children. While limited research to date suggests that services in one sector may affect utilization in another, identifying cross system patterns of care that lead to better outcomes for children with Autism Spectrum Disorder (ASD) is even more complex due to differences in classification processes and eligibility definitions. In particular, thus far neither researchers nor community agencies have leveraged educational outcomes, which are key for all children, including those with ASD, to help understand how we can improve coordinated care for children with complex mental health concerns.

    Read More>>

  • aspect_author_plot_no_labels

    Assessing Impact of Outreach through Software Citation in Geodynamics

    The Computational Infrastructure for Geodynamics is a community of software users and user-developers who model physical processes in the Earth and planetary interiors. From 2010-2018, the community of researchers published upward of 638 peer reviewed papers in more than 124 venues. We analyzed this corpus of publications to understand the impact of CIG workshops and tutorials, measured through software citation. We automated article analysis using text extraction and tokenization techniques. Patterns in co-mentioned software suggest that usage for some tools cross-cuts many domains.

    Read More>>

  • Shields


    BIBFLOW is a two-year project that is funded by the Institute of Museum and Library Services. The purpose of this project is to investigate the future of library services that can include cataloging and related workflows, new data models, and new encoding and exchange formats. At the end of the two-year time table, there will be a roadmap for the academic and library communities that would serve as a guide for the changes that are occurring in academia.

    Read More>>

  • library_of_congress

    Chronicling the rise of “creativity”

    The Creativeness Digital Scholarship Group (CDSG) is composed of a team of researchers uncovering and exploring the forgotten sources, meanings, and social worlds of creativeness prior to the meteoric rise of a scientific “creativity” in the 1970s. The CDSG’s focuses on applying a range of Natural Language Processing and Machine Learning techniques to perform an archaeology of discourses of creativeness and related concepts, unearthing new finds, making new connections, and interpreting its cultural and political relevance for the time period in which they were embedded. Most of our sources are from the post-Civil War period to the end of the Space Race, roughly the century between 1870-1970. This was a period in which the noun “creativity” rarely appeared and took its current form only toward this century’s end, especially during the 1950s.

    Read More>>

  • Capture_Cities

    City General Plans Topic Modeling & Mapping

    This project was a collaboration between Catherine Brinkley, professor in the UC Davis Department of Human Ecology, and DataLab. Catherine sought to understand the general plan documents for the cities in the state of California through topic modeling. Catherine and her team assembled the many general plan documents and DataLab staff performed a topic modeling analysis on the text of the documents and joined the resulting table to a spatial vector data containing city boundaries to allow the dataset to be easily mapping in a GIS.

    Read More>>

  • kellog_interactive_scrot

    Creating Co-Author Networks in R

    A co-author network is a great way to get a snapshot view of the breadth and depth of an individual’s body of research. I created such graphs and corresponding visualizations to highlight and celebrate the work of UC Davis scholars.

    In this post I will describe the packages I used to do this, common roadblocks and ways around them. I will highlight the use of interactive and dynamic co-author networks, which are especially useful for visualizing large co-author networks. I will assume some familiarity with R, and experience working with data structures likes lists and vectors, but no prior familiarity with packages for working with networks.

    Read More>>

  • Grass Valley Fire Districts Map from 1908

    Digitized Maps Demonstration

    Special Collections is undertaking a project to identify and digitize unique maps in our collection. Library volunteer, Scott Sibbett, is working with Map Assitant, Dawn Collings, to identify which of the library’s holdings are unique in the University of California system. The pilot focuses on out-of-copyright maps. After the list list of maps is complete, high quality scanning will begin, starting with the smaller maps that can be scanned on our existing scanning equipment, followed by larger maps that will be scanned off site.

    Read More>>

  • bioportal_2

    Disease BioPortal

    The Disease BioPortal dashboard provides data to researchers, veterinarians, and farmers interested in tracking and analyzing disease outbreaks in livestock. Currently, researchers at BioPortal are interested in expanding the data they collect and provide through their platform, particularly with a view toward making predictive assessments of outbreak events.  The DataLab worked with project partners Beatriz Martinez (Vet Medicine) and Xin Liu (Computer Science) to incorporate two new capabilities into BioPortal: the first, regularly updated weather data for selected geographies to check for potentially outbreak-inducing weather conditions, and the second, live monitoring of social media posts to watch for early warnings of developing outbreaks.

    Read More>>

  • 2009-5602

    English Short Title Catalogue

    This project was originally intended to create a, “machine-readable catalogue of books, pamphlets and other ephemeral material printed in English-speaking countries from 1701-1800.”

    Read More>>

  • wine_featured_image

    Extracting wine price data from historical catalogs

    This project is a collaboration between the DataLab and UC Davis Library funded by the Sloan Foundation to extract historical price data from an archive of wine catalogs published by Sherry Lehmann. The primary goal of the project was to create a database of historical price information that could help wine economists study wine markets over time. Secondary goals included the development of open-source table-extraction software for images built upon the Rtesseract package (an R interface to the tesseract OCR – Optical Character Recognition – system), and hosting hackathons promoting authentic data science skills for UC Davis students. 

    Read More>>

  • Brinkley-story-magnified2-Oct-2017-02-960x600-c-center

    Gender and Citation Disparities

    Leveraging bibliometrics to measure the impact of scholarly publications and explore under-representation and attribution in science. Citation counts help a research community understand the importance of a given scholarly work. But, implicit bias can affect how researchers cite one another. By employing bibliometrics and text mining, we aided researchers in the social sciences to explore the disparity between citation counts and scholarly influence for two pivotal case studies: Rachel Carson’s Silent Spring and Jane Jacobs’ The Life and Death of the Great American City.

    Read More>>

  • collector_summary

    Identifying minimum infrastructure needs for comfortable bicycling

    We implemented Bayesian models with random effects to determine which features of streets and individuals had the strongest relationships with comfort ratings. Not surprisingly, we found a mix of street-level and individual characteristics to be important predictors. We found random effects to be important for controlling for individual tendencies to rank low or high, and for interactions between street-level variables that we couldn’t put explicitly in our models.

    Read More>>

  • Capture_InternetNewsArchive

    Immigration in the Media: TV News Archive Scraping

    This project was a collaboration with Professor Caitlin Patler and postdoctoral scholar Robin Savinar in the UC Davis Department of Sociology and DataLab to scrape the Internet Archive’s TV News database for metadata on TV news programs with keywords related to immigration. While the Internet Archive has an API for searching some parts of their databases, at the time of this research and publication of this story, there was no way to use the available APIs to search the transcripts of the news stories. We solved this problem by using traditional webscraping methods to first search the captions database and then scrape the needed metadata from the links appearing in the search results. We combined the scraped metadata results with data from the FCC about station locations to assess the possibility of mapping the results and determined that a local TV schedule data would be needed to complete the mapping in a comprehensive way.

    Read More>>

  • The Pioneering Punjabis Digital Archive

    The Pioneering Punjabis Digital Archive ( offers a window into the story of South Asian immigrants from the Punjab region in north India to California since the turn of the twentieth century. Explore over 700 video interviews, speeches, diaries, photographs, articles, and letters in which Punjabi Americans share their life stories, values, and contributions to California’s history over the last hundred and twenty years.

    Read More>>

  • WaltWhitman2

    Places in Walt Whitman

    Merging text mining and the geospatial sciences to map the poetry of Walt Whitman. The American poet Walt Whitman worked during the period of transition from transcendentalism to realism and, due to this, many of his writings are rooted in physical spaces. Uncovering those spatial relationships provides another lens by which to understand American literature. This project used text mining to extract all locations mentioned in Whitman’s works, which were then assembled into a visual map for further exploration.

    Read More>>

  • Medical_Center3

    Predicting Length of Hospital Stays

    One of the most significant problems that hospitals across the country are facing at the moment is the prediction of how long each patient will remain in said hospital. This project is attempting to build a better predictive model by taking into account both quantitative and qualitative data from hospitals. The main source of information is coming from classifying and mining doctors and nurses notes and using that information to create a model that better provides an estimate on each patients duration of stay.

    Read More>>

  • Play the Knave

    Play the Knave Modlab

    The project, in coordination with the DSI, involves the creation of a gaming environment in which students recreate scenes from many works of Shakespeare. With this project, movement and vocal data are gathered as participants act out a given scene. From here, the data is taken and created into a video of the production and can be shared with others. This is an exploratory project in which the researchers are trying to not only bring about a better understanding of Shakespeare’s works but also recognizing speech and movement patterns.

    Read More>>

  • Screen Shot 2017-01-24 at 2.03.14 PM

    Social Networks of Citation

    Tracing scholarly influence in medicine. The purpose of this project was to create a peer network of all publications and collaborations that span from a single faculty member. Through mining med-lined data, the network was successfully created.

    Read More>>

  • STEM

    STEM Portal

    UC Davis is world renowned for its teaching and research in STEM (Science, Technology, Engineering, and Math), and in 2016 Forbes magazine ranked UC Davis as the “best value college for women in STEM.” Through a combination of undergraduate and graduate experimental research education opportunities, DataLab collaborated with UC Davis STEM Strategies to leverage data science tools and techniques to better understand these strengths and share them with various stakeholder communities.

    Read More>>