MIT researchers have developed a new data science tool that empowers non-statisticians with the power to uncover trends and discover new insights from massive datasets just by writing a few lines of code.

Rob Matheson explains how the new tool pioneers a new approach in this article from MIT News:

The tool works using a modified version of a technique called “program synthesis,” which automatically creates computer programs given data and a language to work within. The technique is basically computer programming in reverse: Given a set of input-output examples, program synthesis works its way backward, filling in the blanks to construct an algorithm that produces the example outputs based on the example inputs.

The approach is different from ordinary program synthesis in two ways. First, the tool synthesizes probabilistic programs that represent Bayesian models for data, whereas traditional methods produce programs that do not model data at all. Second, the tool synthesizes multiple programs simultaneously, while traditional methods produce only one at a time. Users can pick and choose which models best fit their application.

“When the system makes a model, it spits out a piece of code written in one of these domain-specific probabilistic programming languages … that people can understand and interpret,” Mansinghka says. “For example, users can check if a time series dataset like airline traffic volume has seasonal variation just by reading the code — unlike with black-box machine learning and statistics methods, where users have to trust a model’s predictions but can’t read it to understand its structure.”

Probabilistic programming is an emerging field at the intersection of programming languages, artificial intelligence, and statistics. This year, MIT hosted the first International Conference on Probabilistic Programming, which had more than 200 attendees, including leading industry players in probabilistic programming such as Microsoft, Uber, and Google.