SkillAgentSearch skills...

DataGenerators.jl

DataGenerators is a data generation package. It can use techniques for search and optimization to find effective data for uses such as testing.

Install / Use

/learn @simonpoulding/DataGenerators.jl
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DataGenerators

Build Status

Coverage Status

codecov.io

DataGenerators is a data generation package. It enables the generators for structured data to be defined in a natural manner as Julia functions. It can apply search and optimisation techniques to find data that, for example, can improve software testing by generating more effective test data.

You can write your own data generators utilizing the full power of Julia, or use the DataGeneratorTranslators package to automatically create data generators from specifications such as Backus-Naur Form (BNF), XML Schema Definition (XSD), and regular expressions.

Installation

Install by cloning the package directly from GitHub -- including two packages it requires, DataGeneratorTranslators and BaseTestMulti -- from a Julia REPL:

julia> Pkg.clone("https://github.com/simonpoulding/BaseTestMulti.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGeneratorTranslators.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGenerators.jl")

Usage

Don't forget to load the package:

julia> using DataGenerators

Generators

A First Example

A generator consists of rules which are written as Julia functions, and are defined using the @generator macro:

julia> @generator NumXStrGen begin
	start() = join(plus(item()))
	item() = 'X'
	item() = choose(Int,0,9)
end

Data generation begins by executing the start rule-function, which in turn calls other rule-functions and accepts the values they return. The value returned by the start rule-function is the value emitted when the generator is run.

Here the start rule-function uses plus, a special DataGenerator construct that creates a list of length of 1 or more by repeatedly executing its argument. The length of the list is decided each time plus is executed.

The argument of plus is a call to the item rule-function. There are two item rule-functions defined in the generator, and which one of the two is executed is decided each time the rule-function is called.

The second item rule-function uses another DataGenerator construct, choose, to select a value from a data type: here an integer between 0 and 9. Again, which value is returned is decided each time choose is executed.

The macro creates the generator as Julia type, in this case named NumXStrGen. To run the generator so that it emits a datum, first create an instance of the generator, and then call choose on that instance:

julia> g = NumXStrGen()
data generator NumXStrGen with 3 choice points using sampler choice model

julia> choose(g)
"8X37X"
 
julia> choose(g)
"X0"
 

(For convenience, it is also possible to apply choose directly to the generator type itself: choose(NumXStrGen).)

The NumXStrGen generator emits a string consisting of digits and the letter X. The length of the list returned by plus, which item rule-function is executed, and the digit returned by choose are called choice points. The default behaviour is that the random choices are made at these choice points. (A powerful feature of DataGenerators is that this behaviour can be changed and refined: see the section 'Choice Models' below.) Therefore each time the generator is run using choose, a string of different length and consisting of different combinations of Xs and digits is returned.

Choice Points

Rule Choice Points

Rule choice points occur when two or more rule-functions in the generator have the same name.

The default behaviour is to choose one of the rule-functions at random, with all having the same probability of being chosen.

Sequence Choice Points

Sequence choice points are defined using one of the following constructs:

  • mult(x) - returns a list (Vector) of zero or more items
  • plus(x) - returns a list of one or more items
  • reps(x, a, b) - returns a list with length between a and b inclusive (if b is omitted, then there is no upper bound)

x is often a call to another rule-function, in which case either the just rule-function name (e.g. item) or the full call syntax (item()) may be used. x many also be any expression that returns a value, including a constant.

The default behaviour is to choose the length of list according to a Geometric distribution so that short lists are more likely than longer lists.

Value Choice Points

Value choice points are defined using one of the following constructs:

  • choose(Bool) - returns true or false (i.e. a value of type Bool). The default behaviour is to choose these values using a Bernoulli distribution such that true and false have the same probability of being chosen.
  • choose(T, a, b) where T is one of the following numeric types: Int, Int8, Int16, Int32, Int64, UInt, UInt8, UInt16, UInt32, UInt64, Float16, Float32, or Float64 - returns a value of that is between a and b. If both a and b are omitted, no bounds are placed on the value chosen. If b is omitted, no upper is placed on the value chosen. The default behaviour is to choose from values according a uniform distribution so that all values in the range have the same probability of being chosen.
  • choose(String, r) - returns a string that conforms to the regular expression r (r should be specified using a Julia String rather than a Regex type). If r is not specified, a variable length string of any characters is returned. (This construct uses the DataGeneratorTranslators package to construct additional generator rule-functions to return values that satisfy the regular expression.)

The Julia built-in function rand could be used instead of choose when random values are required, but choose is preferred since enables finer control over how the values are chosen (see the section 'Choice Models' below). A call to rand is not identified as a choice point by the @generator macro.

Functions, not Production Rules

Although generator rule-functions resemble the production rules of a formal grammar, they differ in that they are functions written in a Turing complete language, here Julia. This is one of the distinguising features of DataGenerators. It enables rule-functions to be much expressive and compact than formal production rules since, like any other Julia functions, rule-functions can use all the features of the Julia standard library and installed packages. For example, the NumXStrGen generator above uses the function join from the standard library to concenate the items in a list to form a string.

The rule-functions need not be limited to short-form function syntax such as item() = choose(Int,0,9). Longer-form syntax may also be used in generators, including functions with local variables:

julia> @generator FibStrGen begin
	start() = begin
		join(plus(fib), " ")
	end
	function fib()
		fn0 = 0
		fn1 = 1
		for i in 1:choose(Int,0,10)
			fn2 = fn0 + fn1
			fn0 = fn1
			fn1 = fn2
		end
		fn0
	end
end

julia> choose(FibStrGen)
"55 1 34"

julia> choose(FibStrGen)
"0 21 2 55"

Passing Arguments to Rules

Another difference from the production rules of a formal grammar is that arguments may be passed between rule-functions. A consequence of this is that generator as a whole, or a subset of rule-functions, can pass state between them. This mechanism enables constraints between elements within the datum emitted by the generator to be satisfied in a straightforward manner. For example:

julia> @generator DateGen begin
	start() = begin
		y = year()
		m = month()
		d = day(y, m)
		Date(y, m, d)
	end
	year() = choose(Int, 1583, 2999)
	month() = choose(Int, 1, 12)
	day(y, m) = choose(Int, 1, Dates.daysinmonth(Date(y, m)))
end

(Dates.daysinmonth is a built-in Julia function)

Subgenerators

Generators may call other generators. The generators that are called -- the subgenerators -- are declared as parameters in the generator definition, and then instances of the subgenerators are passed as arguments when an instance of the generator is created. For example:

julia> @generator DictGen(keyGen, valueGen) begin
    start() = Dict(plus(pair))
	pair() = choose(keyGen)=>choose(valueGen)
end

julia> @generator ShortStringGen begin
	start() = choose(String, "[A-Z]{5,15}")
end

julia> @generator SmallIntGen begin
	start() = choose(Int16)
end
	
julia> sg = ShortStringGen()
data generator ShortStringGen with 2 choice points using sampler choice model

julia> ig = SmallIntGen()
data generator SmallIntGen with 1 choice points using sampler choice model

julia> g = DictGen(sg, ig)
data generator DictGen with 1 choice points using sampler choice model

julia> choose(g)
Dict{String,Int16} with 3 entries:
  "DPFMV"  => 16152
  "MUFIOY" => -17445
  "RSNWTD" => 5122

Generation Parameters

Additional named parameters can be passed to choose(g) (where g is a generator instances) to control the generation process:

  • startrule=<rulename> (default: :start) - generation begins from specified rule name (which should be specified as a Symbol)
  • maxchoices=<integer> (default: 10017) - specifies the maximum number of choices to be made, above which an exception is raised
  • maxruledepth=<integer> (default: 11765) - specifies the maximum rule call depth, above which an exception is raised
  • maxseqreps=<integer> (default: 4848) - specifies the an upp
View on GitHub
GitHub Stars13
CategoryDevelopment
Updated2mo ago
Forks6

Languages

Julia

Security Score

75/100

Audited on Jan 12, 2026

No findings