SkillAgentSearch skills...

Optimizer

Boolean query optimizer in Clojure.

Install / Use

/learn @sritchie/Optimizer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

#+STARTUP: showall indent #+STARTUP: hidestars #+PROPERTY: header-args :noweb yes :cache yes :padline yes :tangle no :mkdirp yes

This post is based on a challenge that [[https://twitter.com/peterseibel][Peter Seibel]] gave our team back in 2013 when I worked at Twitter. He'd been screwing around with Pig and Scalding and was thinking about how some optimizer could rewrite a Pig query to save time on some of the tremendous joins our "customers" were running. Here's the original challenge:

  • The Challenge

Write a program whose input is a boolean expression (some combination of ANDs, ORs, NOTs, literal true and false values, and boolean variables) some of whose variables are cheap to access and some of which are expensive to access. (We'll in fact use the simplest cost model: cheap variables are free and expensive variables are infinitely expensive.) The output of this program is a new boolean expression that uses only the cheap variables and which returns true whenever the original would (i.e. for any specific set of values for the variables) and which returns false as often as it can when the original would.

(The motivation for this challenge is things like this: imagine a query that joins data sets and then filters the result. The filter predicate may access variables from both sides of the join but it may be a win to perform a pre-filter on each side of the join first to weed out rows before the join is performed. The pre-filter predicates obviously can only use the variables that are present on one side of the join.)

In slightly more formal terms, given a function f of n variables, the first k < n of which are cheap, you need to produce g such that:

#+BEGIN_EXAMPLE f(v1,v2,..vn) = g(v1,v2,...vk) && f(v1,v2,..vn) #+END_EXAMPLE

Or, to put it another way:

#+BEGIN_EXAMPLE !g(v1,v2,...vk) implies !f(v1,v2,...vn) #+END_EXAMPLE

For purposes of this challenge, you need to write a program that can parse the following grammar:

#+BEGIN_EXAMPLE formula := variable | literal | expression variable := cheap | expensive cheap := v[0-9]+ expensive := w[0-9]+ literal := "T" | "F" expression := conjunction | disjunction | negation conjunction := "(and" ws formula ws formula ")" disjunction := "(or" ws formula ws formula ")" negation := "(not" ws formula ")" ws := " "+ #+END_EXAMPLE

Then write a program that takes a file containing expressions in the form, one per line, and output a file containing for each input expression a pre-filter that uses only the cheap variables.

Your entry is disqualified for any of your pre-filters, g:

#+BEGIN_EXAMPLE input.exists { x => f(x) != (g(x) && f(x)) } // i.e. must be correct #+END_EXAMPLE

And correct pre-filters should minimize:

#+BEGIN_EXAMPLE input.count { x => !g(x) } #+END_EXAMPLE

All other things being equal, smaller pre-filters (measured by tree size of the expression) beat bigger ones. Style points are awarded for efficient computation and general cleverness. The grand prize is bragging rights and the satisfaction of a job well done.

  • The Solution

Simple enough. I'll start by saying that at the time Peter issued the challenge I had NO idea how to solve a problem like this. How does one rip a boolean expression apart into TWO sub-expressions? In true [[http://www.amazon.com/gp/product/069111966X/ref%3Das_li_tl?ie%3DUTF8&camp%3D1789&creative%3D390957&creativeASIN%3D069111966X&linkCode%3Das2&tag%3Dtheroato201-20&linkId%3D4676I2A4I5RWW7U4][Polya]] style, I deferred that question. The first step was to write a parser. That was easy. Next I'd have to apply boolean transformations like ~(not (not a)) => a~ to try and simplify the original equation, hoping that might kill expensive variables. Beyond that, not sure.

I knew Oscar would be using Scala, and Peter'd code up some Common Lisp monstrosity, so I dusted off Leiningen and got back to the Clojure.

To be extra sexy about it, I wrote the exercise up in an org-mode file. Literate programming, baby. Code Is Data! Documentation Is Code!

Here's the =project.clj= file. For testing I'm using [[https://github.com/clojure/test.check][test.check]], a Clojure port of Haskell's [[https://hackage.haskell.org/package/QuickCheck][QuickCheck]]. This let me run my optimizer on randomly generated boolean expressions and saved me the pain in the ass of trying to construct edge cases. Because I'd been writing a bunch of Scala at the time and couldn't bear to lose my pattern matching, [[https://twitter.com/swannodette][David Nolen's]] [[https://github.com/clojure/core.match][core.match]] makes an appearance as well.

#+BEGIN_SRC clojure :tangle ./project.clj (defproject io.samritchie/optimizer "0.1.0-SNAPSHOT" :description "Boolean Optimizer in Clojure." :url "https://github.com/sritchie/optimizer" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/core.match "0.3.0-alpha4"]] :profiles {:provided {:dependencies [[org.clojure/clojure "1.6.0"]]} :dev {:dependencies [[org.clojure/test.check "0.7.0"]]}}) #+END_SRC

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj :exports none (ns optimizer.core (:require [clojure.core.match :refer [match]] [clojure.set :refer [subset? difference]])) #+END_SRC

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj :exports none (ns optimizer.core-test (:use optimizer.core) (:require [clojure.core.match :refer [match]] [clojure.test :refer [deftest is]] [clojure.test.check :as tc] [clojure.test.check.clojure-test :refer [defspec]] [clojure.test.check.generators :as gen] [clojure.test.check.properties :as prop])) #+END_SRC

** Parsing

The first step is to parse that grammar. Surprise surprise, Peter's grammar looks like Lisp! This means that we get to use Clojure's reader as our parser, saving us a few lines of code over the solutions those Strongly Typed fellows will have to implement. I used symbols to represent symbols and literals and lists prefixed with the symbols =and=, =or= and =not= to represent the compound expressions.

Alongside the grammar, I'll be writing generators for each expression type using [[https://github.com/clojure/test.check][test.check]]. Then, instead of unit tests, we can write "laws" for each function in the optimizer. test.check will use the generators to create hundreds of random boolean expressions and make sure that these laws hold for every huge expression it can come up with. This was enormously helpful.

*** Variables

Cheap variables start with =v=, expensive variables start with =w=.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj (def prefixes "The set of valid variable prefixes." #{\v \w})

(def prefix "Returns the supplied symbol's first character." (comp first name))

(defn cheap "Generates a cheap variable using the supplied number." [n] (symbol (str \v n)))

(defn expensive "Generates an expensive variable using the supplied number." [n] (symbol (str \w n))) #+END_SRC

Now, as promised, we get to the test.check generators.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj (def cheap-v (gen/fmap cheap gen/nat)) (def expensive-v (gen/fmap expensive gen/nat)) #+END_SRC

=gen/nat= is a generator that produces natural numbers. =gen/fmap= takes a function and an existing generator and produces a NEW generator by applying the function all generated values. For example:

#+BEGIN_SRC clojure optimizer.core-test> (gen/sample gen/nat 10) (0 1 1 1 4 4 5 6 4 3) optimizer.core-test> (gen/sample cheap-v 10) (v0 v0 v1 v1 v1 v1 v1 v7 v3 v1) optimizer.core-test> (gen/sample expensive-v 10) (w0 w1 w1 w1 w0 w5 w5 w1 w4 w7) #+END_SRC

=gen/one-of= samples randomly between a list of supplied generators:

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj (def variable (gen/one-of [cheap-v expensive-v]))

;; optimizer.core-test> (gen/sample variable 10) ;; (v0 w1 w0 v2 v4 w4 w5 w1 v3 v2) #+END_SRC

*** Literals

=true= and =false= are both represented as literals:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj (def literals #{'T 'F}) #+END_SRC

=gen/elements= creates a generator that chooses elements from some collection:

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj (def literal-gen (gen/elements literals)) #+END_SRC

We can use =gen/frequency= to build up a generator that spits out variables and literals, preferring variables with a 3:1 ratio.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj (def non-compound (gen/frequency [[3 variable] [1 literal-gen]])) #+END_SRC

Let's round out variables and literals with a couple of validators, since we don't have a type system to help us out:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj (defn variable? "Returns true if the argument is a valid cheap or expensive variable, false otherwise." [x] (and (symbol? x) (contains? prefixes (prefix x))))

(def literal? "Returns true if passed a literal, false otherwise." (comp boolean literals)) #+END_SRC

*** Compound Expressions

A formula is a variable, a literal or an expression. Let's implement expression parsing. Conjunctions and disjunctions, or =AND=s and =OR=s, are both binary expressions. negation, or =NOT=, is unary. These validators help us distinguish those cases and peel apart lists:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj (defn unary? [exp] (and (coll? exp) (= 2 (count exp))))

(defn binary? [exp] (and (coll? exp) (= 3 (count exp))))

(def func "Returns the function of the supplied boolean expression." first)

(def args "Returns the arguments of the supplied boolean expression." rest) #+END_SRC

Next, some functions to build and validate the various compound

View on GitHub
GitHub Stars9
CategoryDevelopment
Updated2y ago
Forks0

Languages

Clojure

Security Score

55/100

Audited on Dec 19, 2023

No findings