Eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
Install / Use
/learn @Interana/EventsimREADME
eventsim
Eventsim is a program that generates event data for testing and demos. It's written in Scala, because we are big data hipsters (at least sometimes). It's designed to replicate page requests for a fake music web site (picture something like Spotify); the results look like real use data, but are totally fake. You can configure the program to create as much data as you want: data for just a few users for a few hours, or data for a huge number of users of users over many years. You can write the data to files, or pipe it out to Apache Kafka.
You can use the fake data for product development, correctness testing, demos, performance testing, training, or in any other place where a stream of real looking data is useful. You probably shouldn't use this data to research machine learning algorithms, and definitely shouldn't use it to understand how real people behave.
Statistical Model
I wrote this simulator based on observations about how real users behave. I wanted to make sure that data looked real: users would come and go randomly, some users would stay much longer than others, users would be more likely to use the service in the middle of the day than the middle of the night, and much less likely to use the service on weekends and holidays.
To make this work, I did the following:
- If you set the "damping" factors to zero, then users randomly arrive at the site according to a Poisson (memoryless) process, but with a minimum gap of 30 minutes between sessions.
- The time between events is given by a log-normal distribution
- Once a sessions has started, the user will randomly traverse a set of states until the session ends. The probability of each state transition (including end of session) depends on the current state.
- On average, users will behave the same way in a session, regardless of the time of day or day of week
- If you enable damping for weekends and holidays, the probability that a user arrives on weekends and holidays drops. The odds are scaled linearly over a course of a few hours (by default) around midnight (by default).
- If you enable damping for nighttime, the probability that a user arrives in the middle of the night is lower than the probability that they arrive in the middle of the day. The odds roughly follow a sine wave.
How the simulation works
When you run the simulator, it starts by generating a set of users with randomly picked properties. This includes attributes like names and location as well as usage characteristics, like user engagement. Eventsim uses a pseudo-random number generator: the generator is deterministic, but looks random.
You need to specify a configuration file (samples are included in examples). This file
specifies how sessions are generated and how the fake website works. The simulator will also load in a set of data
files that describe distributions for different parameters (like places, song names, and user agents).
The simulator works by creating a priority queue of user sessions, ordered by the timestamp of the next event in each session. The simulator picks each session off the queue, outputs the details of the event, then determines the next event in the session for each user (or creates a new session for the user), and puts the session back in the queue.
When the simulator figures out the next event in the sessions, it looks at all of the possible transitions from the current state to other states. If the total probability of all the transitions is 1.0, then there will always be a next page. However, if the probability is n < 1.0, then with probability 1.0 - n the session will end, and the user will next be seen at a future session.
Most of the time, the next event will occur at a non-deterministic, log-normally distributed time after the current event. But there are two exceptions: "nextSong" events and redirects. The next song events will typically occur after the duration of the current song. Redirects occur at a fixed time afterwards (we did this to model the action of submitting a form then being redirected to a new page).
By default, song titles are picked randomly based on how popular they are. But optionally, the simulator can use
data on similar songs to pick sequences of similar songs. (To do this, you need the similar songs data file. That file
was too big to include, but we included the utility to generate it. Run eventsim with the generate-similars
option to create it.)
By the way: the current version of the simulator is hard-coded for a music web site. You can modify it to work for other types of sites, but doing so will probably require modifications to the code (and not just to the config files).
Config File
Take a look at the sample config file. It's a JSON file, with key-value pairs. Here is an explanation of the values (many of which match command line options):
seedFor the pseudo-random number generator. Changing this value will change the output (all other parameters being equal).alphaThis is the expected number of seconds between events for a user. This is randomly generated from a lognormal distrbutionbetaThis is the expected session interarrival time (in seconds). This is thirty minutes plus a randomly selected value from an exponential distributiondampingControls the depth of daily cycles (larger values yield stronger cycles, smaller yield milder)weekend-dampingControls the difference between weekday and weekend traffic volumeweekend-damping-offsetControls when the weekend/holiday starts (relative to midnight), in minutesweeeknd-damping-scaleControls how long traffic tapering lasts, in minutessession-gapMinimum time between sessions, in secondsstart-dateStart date for data (in ISO8601 format)end-dateEnd date for data (in ISO8601 format)n-usersNumber of users at start-datefirst-user-idUser id assigned to first user (these are assigned sequentially)growth-rateAnnual growth rate for userstagTag added to each line of the output
You also specify the event state machine. Each state includes a page, an HTTP status code, a user level, and an authentication status. Status should be used to describe a user's status: unregistered, logged in, logged out, cancelled, etc. Pages are used to describe a user's page. Here is how you specify the state machine:
- Transitions. Describe the pair of page and status before and after each transition, and the probability of the transition.
- New user. Describes the page and status for each new user (and probability of arriving for the first time with one of those states).
- New session. Describes that page and status for each new session.
- Show user details. For each status, states whether or not users are shown in the generated event log.
When you run the simulator, you specify the mean values for alpha and beta and the simulator picks values for specific users.
Usage
To build the executable, run
$ sbt assembly
$ # make sure the script is executable
$ chmod +x bin/eventsim
(By the way, eventsim requires Java 8.)
The program can accept a number of command line options:
$ bin/eventsim --help
-a, --attrition-rate <arg> annual user attrition rate (as a fraction of
current, so 1% => 0.01) (default = 0.0)
-c, --config <arg> config file
--continuous continuous output
--nocontinuous run all at once
-e, --end-time <arg> end time for data
(default = 2015-08-12T14:56:25.006)
-f, --from <arg> from x days ago (default = 15)
--generate-counts generate listen counts file then stop
--nogenerate-counts run normally
--generate-similars generate similar song file then stop
--nogenerate-similars run normally
-g, --growth-rate <arg> annual user growth rate (as a fraction of
current, so 1% => 0.01) (default = 0.0)
--kafkaBrokerList <arg> kafka broker list
-k, --kafkaTopic <arg> kafka topic
-n, --nusers <arg> initial number of users (default = 1)
-r, --randomseed <arg> random seed
-s, --start-time <arg> start time for data
(default = 2015-08-05T14:56:25.040)
--tag <arg> tag applied to each line (for example, A/B test
group)
-t, --to <arg> to y days ago (default = 1)
-u, --userid <arg> first user id (default = 1)
--help Show help message
trailing arguments:
output-file (not required) File name
Only the config file is required.
Parameters can be specified in three ways: you can accept the default value, you can specify them in the config file, or you can specify them on the command line. Config file values override defaults; command line options override everything.
Example for about 2.5 M events (1000 users for a year, growing at 1% annually):
$ bin/eventsim -c "examples/site.json" --from 365 --nusers 1000 --growth-rate 0.01 data/fake.json
Initial number of users: 1000, Final number of users: 1010
Starting to generate events.
Damping=0.0625, Weekend-Damping=0.5
Start: 2013-10-06T06:27:10, End: 2014-10-05T06:27:10, Now: 2014-10-05T06:27:07, Events:2468822
Example for more events (30,000 users for a year, growing at 30% annually):
$ bin/eventsim -c "examples/site.json" --from 365 --nusers 30000 --growth-rate 0.30 data/fake.json
Building huge data sets in parallel
You can run multiple instances of this application simultaneously if you need to generate a lot of da
Related Skills
diffs
339.1kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
clearshot
Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.
openpencil
1.8kThe world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.
ui-ux-pro-max-skill
53.2kAn AI SKILL that provide design intelligence for building professional UI/UX multiple platforms
