Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause-and-effect are also critical for the design and data-driven evaluation of many technological systems we build today.
To help data scientists better understand and deploy causal inference, Microsoft researchers built a tool that implements the process of causal inference analysis from end to end. The ensuing DoWhy library has been doing just that since 2018 and has cultivated a community devoted to applying causal inference principles in data science. To broaden access to this critical knowledge base, DoWhy is migrating to an independent open-source governance model in a new PyWhy GitHub organization. As a first step toward this model, we are announcing a collaboration with Amazon Web Services (AWS), which is contributing new technology based on structural causal models.
What is causal inference?
The goal of conventional machine learning methods is to predict an outcome. In contrast, causal inference focuses on the effect of a decision or action—that is, the difference between the outcome if an action is completed versus not completed. For example, consider a public utility company seeking to reduce their customers’ usage of water through a marketing and rewards program. The effectiveness of a rewards program is difficult to ascertain, as any decrease in water usage by participating customers is confounded with their choice to participate in the program. If we observe that a rewards program member uses less water, how do we know whether it is the program that is incentivizing their lower water usage or if customers who were already planning to reduce water usage also chose to join the program? Given information about the drivers of customer behavior, causal methods can disentangle confounding factors and identify the effect of this rewards program.
How do we know when we have the right answer? The effect of an action like signing up for a customer loyalty program is typically not an observable value. For any given customer, we see only one of the two respective outcomes and cannot directly observe the difference the program made. This means that processes developed to validate conventional machine learning models—based on comparing predictions to observed, ground truths—cannot be used. Instead, we need new processes to gain confidence in the reliability of causal inference. Most critically, we need to capture our domain knowledge, reason about our modeling choices, then validate our core assumptions when possible and analyze the sensitivity of our results to violations of assumptions when validation is not possible.
Four steps of causal inference analysis
Data scientists just beginning to explore causal inference are most challenged by the new modeling assumptions of causal methods. DoWhy can help them understand and implement the process. The library focuses on the four steps of an end-to-end causal inference analysis, which are discussed in detail in a previous paper, DoWhy: an End-to-End Library for Causal Inference, and related blog post:
Modeling: Causal reasoning begins with the creation of a clear model of the causal assumptions being made. This involves documenting what is known about the data generating process and mechanisms. To get a valid answer to our cause-and-effect questions, we must be explicit about what we already know.
Identification: Next, we use the model to decide whether the causal question can be answered, and we provide the required expression to be computed. Identification is the process of analyzing our model.
Estimation: Once we have a strategy for identifying the causal effect, we can choose from several different statistical and machine learning-based estimation methods to answer our causal question. Estimation is the process of analyzing our data.
Refutation: Once we have our answer, we must do everything we can to test our underlying assumptions. Is our model consistent with the data? How sensitive is the answer to the assumptions made? If the model missed an unobserved confounder, will that change our answer a little or a lot?
This focus on the four steps of the end-to-end causal inference process differentiates the DoWhy library from prior causal inference toolkits. DoWhy complements other libraries—which focus on individual steps—and offers users the benefits of those libraries in a seamless, unified API. For example, for estimation, DoWhy offers the ability to call out to Microsoft’s EconML library for its advanced estimation methods.