explanation

Infrastructure for explaining the outputs of differential dataflow computations

Explaining the outputs of data-parallel computations

This project demonstrates how one can use differential dataflow to explain the outputs of differential dataflow computations, as described in the paper Explaining outputs in modern data analytics. There is also an earlier blog post discussing the material, and trying to explain how it works.

In our context, programs run on input collections and produce output collections. An "explanation" for a subset of the output is a subset of the input so that if you ran the computation on this subset, it would produce at least the output subset you asked for.

Concise explanations of this form can be super helpful, both for debugging your program and for understanding / communicating why parts of the output are the way they are.

A quick taste

To give a taste, we can explain the outputs of "prioritized label propagation", an algorithm for determining connected components in a graph, by tracking down which graph edges and labels were important to arrive at each of the individual output records, even as the graph changes. Here we load the livejournal graph dataset, asks for an explanation of node 123456's label, and then removes one of the edges on the explanatory path, causing the explanation to change:

Echidnatron% cargo run --release --example interactive-cc -- ~/Projects/Datasets/livejournal
     Running `target/release/examples/interactive-cc /Users/mcsherry/Projects/Datasets/livejournal`
initialization elapsed:	Duration { secs: 55, nanos: 237042089 }

> query + 123456 0
label_must:	((0, 0), 1)
graph_must:	((123456, 133128), 1)
graph_must:	((0, 19), 1)
graph_must:	((19, 133128), 1)
round 1 elapsed:	Duration { secs: 0, nanos: 44231288 }

> graph - 123456 133128
graph_must:	((123456, 133128), -1)
graph_must:	((73495, 123456), 1)
graph_must:	((64, 47681), 1)
graph_must:	((47681, 73495), 1)
graph_must:	((0, 64), 1)
round 2 elapsed:	Duration { secs: 0, nanos: 922563085 }

For the present the project is a library and some examples, demonstrating how to explain outputs rather than automatically doing so for all computations. The process does seem automatable, but the host frameworks don't lend themselves to easily re-writing dataflow computations.

An example: connected components

In the examples directory you can find the interactive-cc example. This runs a connected components computation, via prioritized label propagation. The inputs are graph edges and node labels, and the outputs are labels assigned to nodes after propagation.

If you run the with no further arguments, it allows you to interactively add and remove graph edges, node labels, and to ask queries of the output labels. For example,

Echidnatron% cargo run --example interactive-cc
     Running `target/debug/examples/interactive-cc`
initialization elapsed:	Duration { secs: 0, nanos: 856767 }
> graph + 0 1
round 1 elapsed:	Duration { secs: 0, nanos: 6668874 }
> graph + 1 2
round 2 elapsed:	Duration { secs: 0, nanos: 6224347 }
> label + 0 0
round 3 elapsed:	Duration { secs: 0, nanos: 21278949 }
> query + 2 0
label_must:	((0, 0), 1)
graph_must:	((0, 1), 1)
graph_must:	((1, 2), 1)
round 4 elapsed:	Duration { secs: 0, nanos: 151156905 }

This is ... neat! We added some edges, from 0 to 1 to 2, then added a label at 0, and then asked about the label reaching 2. The infrastructure tells us that to explain node 2 receiving label 0, we need the label for 0 and the two edges. This probably isn't a huge surprise, so let's do a bit more.

> graph + 1 3
round 5 elapsed:	Duration { secs: 0, nanos: 13959840 }
> query + 3 0
graph_must:	((1, 3), 1)
round 6 elapsed:	Duration { secs: 0, nanos: 77679489 }

This is similar. We've added an edge to node 3 and asked for an explanation. Or, more precisely, we have added 3 to the nodes who require explanation. Node 2 is still there, so we only get told about adding the edge (1,3) to the required input, as this is all we need to add to now explain the labels of both 2 and 3.

> graph + 2 3
round 7 elapsed:	Duration { secs: 0, nanos: 8583967 }

Nothing interesting here. We've added an edge from 2 to 3, but it isn't used as part of an explanation, yet.

> graph - 1 2
graph_must:	((2, 3), 1)
graph_must:	((1, 2), -1)
round 8 elapsed:	Duration { secs: 0, nanos: 87361952 }

Whoa, we got rid of the edge from 1 to 2 which was crucial for explaining why 2 gets label 0. It still does get label 0, but now we need to bring in that edge between 2 and 3.

Larger graph data

If you go and snag the livejournal graph dataset and wrap it up in the graphmap format, you can run the computation with an optional argument that loads a bunch of graph data for you. I'm also running it in --release mode now because otherwise it would be very slow.

Echidnatron% cargo run --release --example interactive-cc -- ~/Projects/Datasets/livejournal
     Running `target/release/examples/interactive-cc /Users/mcsherry/Projects/Datasets/livejournal`
initialization elapsed:	Duration { secs: 55, nanos: 237042089 }

We can now interact with this, asking questions of a much larger computation:

> query + 123456 0
label_must:	((0, 0), 1)
graph_must:	((123456, 133128), 1)
graph_must:	((0, 19), 1)
graph_must:	((19, 133128), 1)
round 1 elapsed:	Duration { secs: 0, nanos: 44231288 }

Here we have asked the computation to explain the label 0 for node 123456. The "explanation" is given by the indicated input data. The label (0,0) propagates over edges (0, 19), then (19, 133128), then (123456, 133128). The computation does not necessarily need these edges, as there may be other ways for 0 to reach 123456, but these were the edges observed to provide that label when the computation executed.

Let's make life a bit more difficult, removing that last edge from the reported path and requiring a second path from 0 to 123456, assuming they are still connected.

> graph - 123456 133128
graph_must:	((123456, 133128), -1)
graph_must:	((73495, 123456), 1)
graph_must:	((64, 47681), 1)
graph_must:	((47681, 73495), 1)
graph_must:	((0, 64), 1)
round 2 elapsed:	Duration { secs: 0, nanos: 922563085 }

We see that the set of required edges changes; the edge (123456, 133128) no longer exists and is removed. Several new edges are added which reveal another path from 0 to 123456.

One may (and should) wonder why the other explanatory edges along the shorter path are not removed. The reason is that explanations are currently for the full history of the queried node. It is reasonable to want only the current explanation, and it seems likely that the same type of techniques could provide this behavior as well. At the same time, this introduces some ambiguities that we don't yet know how to resolve generally.

The amount of time taken, 923ms, doesn't seem so great. It turns out this is just timely dataflow shaking out some dust when we first perform updates to the graph (and should probably be fixed). Now that it is sorted out, we can repeat the type of update and see more brisk responses.

Let's remove that edge to 123456 again.

> graph - 73495 123456
graph_must:	((73495, 123456), -1)
graph_must:	((118715, 123456), 1)
graph_must:	((64, 13112), 1)
graph_must:	((13112, 118715), 1)
round 3 elapsed:	Duration { secs: 0, nanos: 12515838 }

We see here again that removing the edge removes it from the explanations, and then another path is identified. This path appears shorter than above, but only because it has the ability to start at 64, which is already required to explain the history of the label for 123456.

The time to perform this update is just 13ms, which is relatively interactive.

What about a more aggressive change? Let's snip the edge just after 64, cutting off more of the path

> graph - 64 13112
graph_must:	((64, 13112), -1)
graph_must:	((117977, 118715), 1)
graph_must:	((64, 117977), 1)
round 4 elapsed:	Duration { secs: 0, nanos: 38432881 }

Here the explanation routes around the damage, going through 117977 rather than 13112 now that the latter is unavailable. This decision is made because when the computation runs without the edge (64, 13112) the zero label now arrives first through node 117977.

How about a "catastrophic" change: cutting the link to zero itself!

> graph - 0 64
graph_must:	((0, 64), -1)
graph_must:	((0, 81), 1)
graph_must:	((81, 117977), 1)
round 5 elapsed:	Duration { secs: 0, nanos: 852422661 }

This is also a relatively simple change in explanation, but a substantially longer time to update: 852ms. The reason is that there is now much more work to do updating the actual computation; the zero label that previously flowed through 64 must be undone, and labels to other nodes recomputed, not just in explanation but in the computation itself. In principle we could return earlier once the explanation has settled (it does not depend on the other labels); in practice this is hard to see because the label outputs feed back into the explanation infrastructure.

A second example: stable matching

For a more exotic and technically challenging example, we also have an implementation of stable matching. Stable matching takes a bipartite graph as input, where each node on one side of the graph has a rank-ordered preference for neighbors on the other side of the graph. The computation repeatedly has one side propose to the other, each node choosing its most appealing neighbor who has not yet rejected their proposal. The recipients collect their proposals and reject any that are not the best they've received.

This is an iterative computation, in the same spir

Explanation

Install / Use

README