Skip to content Skip to navigation

Kang, Daniel (2022) Efficient and Accurate Systems for Querying Unstructured Data. PhD Dissertation, Computer Science, Stanford University. 

Year Published: 2022
Abstract: 

Volumes of unstructured, non-tabular data (e.g., videos, audio, and text) have been increasing exponentially. This data is exciting to scientific researchers, business analysts, and data scientists for downstream analyses. For example, video can be used by urban planners to analyze traffic, ecologists to understand hummingbird-bacteria microcosms, and data scientists to analyze customer behavior in stores. However, this is impossible to do manually at scale: exabytes of data are generated per day, outstripping manual processing capacity.

In recent years, automatic analysis over this unstructured data has become possible via machine learning (ML). Analysts can use ML to extract structured information from these unstructured sources, such as object types and location from a video. The structured information can subsequently be used in downstream analysis, e.g., the urban planner can count the number of cars that passed by an intersection.

Unfortunately, using ML for these analyses is challenging. Deploying ML is prohibitively expensive for many organizations: naively analyzing a year of video from a small town can cost millions in cloud compute credits. ML methods are also unreliable, returning incorrect results, which can lead to downstream errors. Finally, deploying ML for analytics requires knowledge of deep learning, data systems, programming, and other technical skills.

In light of these challenges, we make two observations: many applications can tolerate approximations, if there are guarantees on accuracy, and methods for answering unstructured data queries range by up to 10 orders of magnitude in cost.

In this dissertation, we develop systems and algorithms for efficient and reliable unstructured data analytics, leveraging the two observations. Instead of returning exact answers, we return approximate answers generated by cheap approximations to expensive ML methods. Our systems can return statistically valid answers on a wide range of query types, including selection, aggregation, and limit queries. Furthermore, our systems can be up to orders of magnitude cheaper than standard methods of answering queries.

We further develop systems for monitoring and quality assurance over ML pipelines. In addition to being deployed for analytics, ML is increasingly being deployed in mission-critical settings, such as in autonomous vehicles. Despite being deployed in these settings, models are often unmonitored and the training data is often not vetted.

To address this, we propose abstractions for monitoring and quality assurance of ML deployments: model assertions and learned observation assertions. These assertions allow domain experts to specify errors, both at deployment time and over the data used to train these models. Assertions can find errors with both high recall (75%) and high precision (100%) in real-world autonomous vehicle, video analytics, and medical datasets.

The systems and abstractions in this dissertation have been deployed in a variety of real-world settings, including for autonomous vehicles and ecological analysis.  [llink to publication]

Article Title: 
Efficient and Accurate Systems for Querying Unstructured Data.