DataGenerators.jl
DataGenerators is a data generation package. It can use techniques for search and optimization to find effective data for uses such as testing.
Install / Use
/learn @simonpoulding/DataGenerators.jlREADME
DataGenerators
DataGenerators is a data generation package. It enables the generators for structured data to be defined in a natural manner as Julia functions. It can apply search and optimisation techniques to find data that, for example, can improve software testing by generating more effective test data.
You can write your own data generators utilizing the full power of Julia, or use the DataGeneratorTranslators package to automatically create data generators from specifications such as Backus-Naur Form (BNF), XML Schema Definition (XSD), and regular expressions.
Installation
Install by cloning the package directly from GitHub -- including two packages it requires, DataGeneratorTranslators and BaseTestMulti -- from a Julia REPL:
julia> Pkg.clone("https://github.com/simonpoulding/BaseTestMulti.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGeneratorTranslators.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGenerators.jl")
Usage
Don't forget to load the package:
julia> using DataGenerators
Generators
A First Example
A generator consists of rules which are written as Julia functions, and are defined using the @generator macro:
julia> @generator NumXStrGen begin
start() = join(plus(item()))
item() = 'X'
item() = choose(Int,0,9)
end
Data generation begins by executing the start rule-function, which in turn calls other rule-functions and accepts the values they return. The value returned by the start rule-function is the value emitted when the generator is run.
Here the start rule-function uses plus, a special DataGenerator construct that creates a list of length of 1 or more by repeatedly executing its argument. The length of the list is decided each time plus is executed.
The argument of plus is a call to the item rule-function. There are two item rule-functions defined in the generator, and which one of the two is executed is decided each time the rule-function is called.
The second item rule-function uses another DataGenerator construct, choose, to select a value from a data type: here an integer between 0 and 9. Again, which value is returned is decided each time choose is executed.
The macro creates the generator as Julia type, in this case named NumXStrGen. To run the generator so that it emits a datum, first create an instance of the generator, and then call choose on that instance:
julia> g = NumXStrGen()
data generator NumXStrGen with 3 choice points using sampler choice model
julia> choose(g)
"8X37X"
julia> choose(g)
"X0"
(For convenience, it is also possible to apply choose directly to the generator type itself: choose(NumXStrGen).)
The NumXStrGen generator emits a string consisting of digits and the letter X. The length of the list returned by plus, which item rule-function is executed, and the digit returned by choose are called choice points. The default behaviour is that the random choices are made at these choice points. (A powerful feature of DataGenerators is that this behaviour can be changed and refined: see the section 'Choice Models' below.) Therefore each time the generator is run using choose, a string of different length and consisting of different combinations of Xs and digits is returned.
Choice Points
Rule Choice Points
Rule choice points occur when two or more rule-functions in the generator have the same name.
The default behaviour is to choose one of the rule-functions at random, with all having the same probability of being chosen.
Sequence Choice Points
Sequence choice points are defined using one of the following constructs:
mult(x)- returns a list (Vector) of zero or more itemsplus(x)- returns a list of one or more itemsreps(x, a, b)- returns a list with length betweenaandbinclusive (ifbis omitted, then there is no upper bound)
x is often a call to another rule-function, in which case either the just rule-function name (e.g. item) or the full call syntax (item()) may be used. x many also be any expression that returns a value, including a constant.
The default behaviour is to choose the length of list according to a Geometric distribution so that short lists are more likely than longer lists.
Value Choice Points
Value choice points are defined using one of the following constructs:
choose(Bool)- returnstrueorfalse(i.e. a value of type Bool). The default behaviour is to choose these values using a Bernoulli distribution such that true and false have the same probability of being chosen.choose(T, a, b)whereTis one of the following numeric types:Int,Int8,Int16,Int32,Int64,UInt,UInt8,UInt16,UInt32,UInt64,Float16,Float32, orFloat64- returns a value of that is betweenaandb. If bothaandbare omitted, no bounds are placed on the value chosen. Ifbis omitted, no upper is placed on the value chosen. The default behaviour is to choose from values according a uniform distribution so that all values in the range have the same probability of being chosen.choose(String, r)- returns a string that conforms to the regular expressionr(rshould be specified using a Julia String rather than a Regex type). Ifris not specified, a variable length string of any characters is returned. (This construct uses theDataGeneratorTranslatorspackage to construct additional generator rule-functions to return values that satisfy the regular expression.)
The Julia built-in function rand could be used instead of choose when random values are required, but choose is preferred since enables finer control over how the values are chosen (see the section 'Choice Models' below). A call to rand is not identified as a choice point by the @generator macro.
Functions, not Production Rules
Although generator rule-functions resemble the production rules of a formal grammar, they differ in that they are functions written in a Turing complete language, here Julia. This is one of the distinguising features of DataGenerators. It enables rule-functions to be much expressive and compact than formal production rules since, like any other Julia functions, rule-functions can use all the features of the Julia standard library and installed packages. For example, the NumXStrGen generator above uses the function join from the standard library to concenate the items in a list to form a string.
The rule-functions need not be limited to short-form function syntax such as item() = choose(Int,0,9). Longer-form syntax may also be used in generators, including functions with local variables:
julia> @generator FibStrGen begin
start() = begin
join(plus(fib), " ")
end
function fib()
fn0 = 0
fn1 = 1
for i in 1:choose(Int,0,10)
fn2 = fn0 + fn1
fn0 = fn1
fn1 = fn2
end
fn0
end
end
julia> choose(FibStrGen)
"55 1 34"
julia> choose(FibStrGen)
"0 21 2 55"
Passing Arguments to Rules
Another difference from the production rules of a formal grammar is that arguments may be passed between rule-functions. A consequence of this is that generator as a whole, or a subset of rule-functions, can pass state between them. This mechanism enables constraints between elements within the datum emitted by the generator to be satisfied in a straightforward manner. For example:
julia> @generator DateGen begin
start() = begin
y = year()
m = month()
d = day(y, m)
Date(y, m, d)
end
year() = choose(Int, 1583, 2999)
month() = choose(Int, 1, 12)
day(y, m) = choose(Int, 1, Dates.daysinmonth(Date(y, m)))
end
(Dates.daysinmonth is a built-in Julia function)
Subgenerators
Generators may call other generators. The generators that are called -- the subgenerators -- are declared as parameters in the generator definition, and then instances of the subgenerators are passed as arguments when an instance of the generator is created. For example:
julia> @generator DictGen(keyGen, valueGen) begin
start() = Dict(plus(pair))
pair() = choose(keyGen)=>choose(valueGen)
end
julia> @generator ShortStringGen begin
start() = choose(String, "[A-Z]{5,15}")
end
julia> @generator SmallIntGen begin
start() = choose(Int16)
end
julia> sg = ShortStringGen()
data generator ShortStringGen with 2 choice points using sampler choice model
julia> ig = SmallIntGen()
data generator SmallIntGen with 1 choice points using sampler choice model
julia> g = DictGen(sg, ig)
data generator DictGen with 1 choice points using sampler choice model
julia> choose(g)
Dict{String,Int16} with 3 entries:
"DPFMV" => 16152
"MUFIOY" => -17445
"RSNWTD" => 5122
Generation Parameters
Additional named parameters can be passed to choose(g) (where g is a generator instances) to control the generation process:
startrule=<rulename>(default::start) - generation begins from specified rule name (which should be specified as a Symbol)maxchoices=<integer>(default: 10017) - specifies the maximum number of choices to be made, above which an exception is raisedmaxruledepth=<integer>(default: 11765) - specifies the maximum rule call depth, above which an exception is raisedmaxseqreps=<integer>(default: 4848) - specifies the an upp
