YLearn, a pun of "learn why", is a python package for causal inference which supports various aspects of causal inference ranging from causal effect identification, estimation, and causal graph discovery, etc.

Documentation website: https://ylearn.readthedocs.io

中文文档地址：https://ylearn.readthedocs.io/zh_CN/latest/

Installation

Pip

The simplest way of installing YLearn is using pip:

pip install ylearn

Note that Graphviz is required to plot causal graph in notebook, so install it before running YLearn. See https://graphviz.org/download/ for more details about Graphviz installation.

Conda

YLearn can also be installed with conda. Install it from the channel conda-forge:

conda install -c conda-forge ylearn

This will install YLearn and all requirements including Graphviz.

Docker

We also publish an image in Docker Hub which can be downloaded directly and includes the components:

Python 3.8
YLearn and its dependent packages
JupyterLab

Download the docker image:

docker pull datacanvas/ylearn

Run a docker container:

docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/ylearn

Then one can visit http://<ip-addr>:8888 in the browser and type in the token to start.

Overview of YLearn

Machine learning has made great achievements in recent years. The areas in which machine learning succeeds are mainly for prediction, e.g., the classification of pictures of cats and dogs. However, machine learning is incapable of answering some questions that naturally arise in many scenarios. One example is for the counterfactual questions in policy evaluations: what would have happened if the policy had changed? Due to the fact that these counterfactuals can not be observed, machine learning models, the prediction tools, can not be used. These incapabilities of machine learning partly give rise to applications of causal inference in these days.

Causal inference directly models the outcome of interventions and formalizes the counterfactual reasoning. With the aid of machine learning, causal inference can draw causal conclusions from observational data in various manners nowadays, rather than relying on conducting craftly designed experiments.

A typical complete causal inference procedure is composed of three parts. First, it learns causal relationships using the technique called causal discovery. These relationships are then expressed either in the form of Structural Causal Models or Directed Acyclic Graphs (DAG). Second, it expresses the causal estimands, which are clarified by the interested causal questions such as the average treatment effects, in terms of the observed data. This process is known as identification. Finally, once the causal estimand is identified, causal inference proceeds to focus on estimating the causal estimand from observational data. Then policy evaluation problems and counterfactual questions can also be answered.

YLearn, equipped with many techniques developed in recent literatures, is implemented to support the whole causal inference pipeline from causal discovery to causal estimand estimation with the help of machine learning. This is more promising especially when there are abundant observational data.

Concepts in YLearn

There are 5 main concepts in YLearn corresponding to the causal inference pipeline.

CausalDiscovery. Discovering the causal relationships in the observational data.
CausalModel. Representing the causal relationships in the form of CausalGraph and doing other related operations such as identification with CausalModel.
EstimatorModel. Estimating the causal estimand with various techniques.
Policy. Selecting the best policy for each individual.
Interpreter. Explaining the causal effects and polices.

These components are connected to give a full pipeline of causal inference, which are also encapsulated into a single API Why.

Pipeline in YLearn

A typical pipeline in YLearn The pipeline of causal inference in YLearn.

Starting from the training data:

One first uses the CausalDiscovery to reveal the causal structures in data, which will usually output a CausalGraph.
The causal graph is then passed into the CausalModel, where the interested causal effects are identified and converted into statistical estimands.
An EstimatorModel is then trained with the training data to model relationships between causal effects and other variables, i.e., estimating causal effects in training data.
One can then use the trained EstimatorModel to predict causal effects in some new test dataset and evaluate the policy assigned to each individual or interpret the estimated causal effects.

It is also helpful to use the following flow chart in many causal inference tasks

Helpful flow chart when using YLearn

Quick Start

In this part, we first show several simple example usages of YLearn. These examples cover the most common functionalities. Then we present a case stuy with Why to unveil the hidden causal relations in data.

Example usages

We present several necessary example usages of YLearn in this section, which covers defining a causal graph, identifying the causal effect, and training an estimator model, etc. Please see their specific documentations for more details.

Representation of the causal graph

Given a set of variables, the representation of its causal graph in YLearn requires a python dict to denote the causal relations of variables, in which the keys of the dict are children of all elements in the corresponding values where each value usually should be a list of names of variables. For an instance, in the simplest case, for a given causal graph X <- W -> Y, we first define a python dict for the causal relations, which will then be passed to CausalGraph as a parameter:
```
    causation = {'X': ['W'], 'W':[], 'Y':['W']}
    cg = CausalGraph(causation=causation)
```
cg will be the causal graph encoding the causal relation X <- W -> Y in YLearn. If there exist unobserved confounders in the causal graph, then, aside from the observed variables, we should also define a python list containing these causal relations. For example, a causal graph with unobserved confounders (green nodes)
<img src="./fig/graph_expun.png" width="400">
is first converted into a graph with latent confounding arcs (black dotted llines with two directions)
<img src="./fig/graph_un_arc.png" width="500">
To represent such causal graph, we should

(1) define a python dict to represent the observed parts, and

(2) define a list to encode the latent confounding arcs where each element in the list includes the names of the start node and the end node of a latent confounding arc:
```
     from ylearn.causal_model.graph import CausalGraph
     
     # define the dict to represent the observed parts
     causation_unob = {
         'X': ['Z2'],
         'Z1': ['X', 'Z2'],
         'Y': ['Z1', 'Z3'],
         'Z3': ['Z2'],
         'Z2': [], 
     }
     
     # define the list to encode the latent confounding arcs for unobserved confounders
     arcs = [('X', 'Z2'), ('X', 'Z3'), ('X', 'Y'), ('Z2', 'Y')]

     cg_unob = CausalGraph(causation=causation_unob, latent_confounding_arcs=arcs)
```
Identification of causal effect

It is crucial to identify the causal effect when we want to estimate it from data. The first step for identifying the causal effect is identifying the causal estimand. This can be easily done in YLearn. For an instance, suppose that we are interested in identifying the causal estimand P(Y|do(X=x)) in the causal graph cg defined above, then we can simply define an instance of CausalModel and call the identify() method:
```
     cm = CausalModel(causal_graph=cg)
     cm.identify(treatment={'X'}, outcome={'Y'}, identify_method=('backdoor', 'simple'))
```
where we use the backdoor-adjustment method here by specifying identify_method=('backdoor', 'simple'). YLearn also supports front-door adjustment, finding instrumental variables, and, most importantly, the general identification method developed in [1] which is able to identify any causal effect if it is identifiable. For an example, given the following causal graph,
<img src="./fig/id_fig.png" width="300">
if we want to identify P(Y1, Y2|do(X)), racalling that black dotted lines with two directions are latent confounding arcs (i.e., there is an unobserved confounder pointing to the two end nodes of each black dotted line), we can apply YLearn as follows
```
   causation = {
        'W1': [],
        'W2': [],
        'X': ['W1'],
        'Y1': ['X'],
        'Y2': ['W2']
    }
    arcs = [('W1', 'Y1'), ('W1', 'W2'), ('W1', 'Y2'), ('W1', 'Y1')]
    cg = graph.CausalGraph(causation, latent_confounding_arcs=arcs)
    cm = model.CausalModel(cg)
    p = cm.id({'Y1', 'Y2'}, {'X'})
    p.show_latex_expression()
```
which will give us the identified causal effect P(Y1, Y2|X) as follows
<img src="./fig/latex_exp.png" width="400">
and calling the method p.parse() will give us the latex expression
```
\sum_{W2}\left[\left[P(Y2|W2)\right]\right]\left[\sum_{W1}\left[P(W1)\right]\left[P(Y1|X, W1)\right]\right]\left[P(W2)\right]
```
Instrumental variables

Instrumental variable is an important technique in causal inference. The approach of using

YLearn

Install / Use

README