A new tool makes it easier for database users to perform complex statistical analyses on table data without having to know what’s happening behind the scenes.
GenSQL, a generative AI system for databases, can help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes.
For example, if the system is used to analyze medical data from a patient who has always had high blood pressure, it might detect a blood pressure reading that is low for that specific patient, but otherwise within the normal range.
GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adapt decision making based on new data.
In addition, GenSQL can be used to produce and analyze synthetic data that mimics real data in a database. This can be especially useful in situations where sensitive data cannot be shared, such as patient records, or where real data is scarce.
This new tool is based on SQL, a programming language for creating and editing databases introduced in the late 1970s and used by millions of developers worldwide.
“Historically, SQL taught the business world what a computer could do. They didn’t have to write custom programs, they just had to ask queries to a database in a high-level language. We think that as we move from just querying data to asking queries of models and data, we need an analog language that teaches people the coherent questions you can ask a computer with a probabilistic model of the data,” says Vikash Mansinghka ’05, MEng ’09, PhD ’09, senior author of a paper introducing GenSQL and a principal investigator and leader of the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences.
When the researchers compared GenSQL to popular AI-based data analysis approaches, they found that it was not only faster but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are explainable, allowing users to read and edit them.
“If you look at the data and try to find meaningful patterns using only a few simple statistical rules, you might miss important interactions. You really want to capture the correlations and dependencies of the variables, which can be quite complex, in a model. With GenSQL, we want to enable a large group of users to query their data and their model without having to know all the details,” added lead author Mathieu Huot, a researcher in the Department of Brain and Cognitive Sciences and member of the Probabilistic Computing Project.
They are joined on the paper by Matin Ghavami and Alexander Lew, MIT graduate students; Cameron Freer, a research scientist; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, an MIT professor in the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.
Combining models and databases
SQL, which stands for Structured Query Language, is a programming language for storing and manipulating information in a database. In SQL, people can query data using keywords, such as adding, filtering, or grouping database records.
However, querying a model can yield deeper insights because models can capture what data means to an individual. For example, a female developer wondering if she is underpaid is likely more interested in what salary data means to her individually than in trends from database records.
The researchers noted that SQL did not provide an effective way to integrate probabilistic AI models, but at the same time, approaches that use probabilistic models to make inferences did not support complex database queries.
To fill this gap, they built GenSQL, which allows you to query both a dataset and a probabilistic model using a simple yet powerful formal programming language.
A GenSQL user uploads his data and probabilistic model, which the system automatically integrates. He can then execute queries on data that also get input from the probabilistic model running behind the scenes. This not only allows for more complex queries, but can also yield more accurate answers.
For example, a query in GenSQL might be something like, “How likely is it that a developer from Seattle knows the Rust programming language?” Simply looking at a correlation between columns in a database can miss subtle dependencies. Incorporating a probabilistic model can capture more complex interactions.
Furthermore, the probabilistic models that GenSQL uses are auditable, so people can see what data the model is using to make decisions. Furthermore, these models provide calibrated uncertainty measures along with each answer.
For example, if you query the model with this calibrated uncertainty on predicted outcomes of different cancer treatments for a patient from a minority group that is underrepresented in the dataset, GenSQL tells the user that it is uncertain and how uncertain it is, rather than overconfidently advocating the wrong treatment.
Faster and more accurate results
To evaluate GenSQL, the researchers compared their system to popular baseline methods that use neural networks. GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in a few milliseconds and delivering more accurate results.
They also applied GenSQL in two case studies: first, the system identified mislabeled data from clinical trials and second, it generated accurate synthetic data that captured complex relationships in genomics.
Next, the researchers want to apply GenSQL more broadly to large-scale modeling of human populations. With GenSQL, they can generate synthetic data to make inferences about things like health and salary, while controlling what information goes into the analysis.
They also want to make GenSQL easier to use and more powerful by adding new optimizations and automations to the system. In the long term, the researchers want to enable users to create natural language queries in GenSQL. Their goal is to eventually develop a ChatGPT-like AI expert that you can talk to about any database, basing its answers on GenSQL queries.
This research is funded in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.