Streaminer

A collection of algorithms for mining data streams, including frequent itemsets, quantiles, sampling, moving averages, set membership and cardinality.

Releases

Maven:

<dependency>
    <groupId>com.github.mayconbordin</groupId>
    <artifactId>streaminer</artifactId>
    <version>1.1.1</version>
</dependency>

API Documentation

Frequent Itemsets

Algorithms

CountSketch [1]
CountMinSketch [2]
LossyCounting [3]
Majority [4]
MisraGries [5]
SpaceSaving [6]
StickySampling [3]
RealCounting
SimpleTopKCounting
TimeDecayCountMinSketch
TimeDecayRealCounting
AMSSketch
CCFCSketch
CGT

Usage

Except for the CountMinSketchAlt class, all algorithms implement the IRichFrequency interface. Here's an example using the SpaceSaving algorithm:

Random r = new Random();
int counters = 20;
double support = 0.01;
double maxError = 0.1;

IRichFrequency<Integer> counter = new SpaceSaving<Integer>(counters, support, maxError);
for (int i=0 i<1000; i++) {
    counter.add(r.nextInt(100), 1);
}

// get the top 10 items
List<CountEntry<Integer>> topk = counter.peek(10);

// print the items
for (CountEntry<Integer> item : topk) {
    System.out.println(item);
}

// get the frequency of a single item
int item = 25;
long freq = counter.estimateCount(item);
System.out.println(item + ": " + freq);

Time Decaying Algorithms

TimeDecayRealCounting and TimeDecayCountMinSketch are algorithms that use a decay function to update the current values of their counts in order to give more importance to newer values, while older values will slowly fade away.

The decay function implements the DecayFormula interface. Currently there are three implementations: the exponential (ExpDecayFormula), the linear (LinDecayFormula), and the logarithmic (LogDecayFormula).

Those counting algorithms implement a different interface called ITimeDecayFrequency, as both methods for adding and estimating the frequency need an additional argument, the timestamp.

Top-K

Algorithms

StreamSummary [6]
ConcurrentStreamSummary
Frequent
StochasticTopper

Usage

The basic usage of a Top-K algorithm is basically the same as the frequent itemset, except that these algorithms do not support the estimateCount method.

ITopK<String> counter = new StreamSummary<String>(3);

String[] stream = {"X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "C", "A", "A"};
for (String i : stream) {
    counter.add(i);
}

List<CountEntry<String>> topk = counter.peek(3);
for (CountEntry<String> item : topk) {
    System.out.println(item);
}

Quantiles

Algorithms

CKMSQuantiles [7]
Frugal2U [8]
GKQuantiles [9]
MPQuantiles [10]
QDigest [11]
WindowSketchQuantiles [12]
RSSQuantiles [13]
EnsembleQuantiles
ExactQuantiles
ExactQuantilesAll
SimpleQuantiles
SumQuantiles
TDigest

Usage

double[] quantiles = new double[]{0.05, 0.25, 0.5, 0.75, 0.95};
IQuantiles<Integer> instance = new Frugal2U(quantiles, 0);

RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(100, 50, r);
int numSamples = 1000;
        
for(int i = 0; i < numSamples; ++i) {
    int num = (int) Math.max(0, dist.nextDouble());
    instance.offer(num);
}

for (double q : quantiles) {
    System.out.println(q + ": " + instance.getQuantile(q));
}

Cardinality

Algorithms

AdaptiveCounting [14]
LogLog [15]
HyperLogLog [16]
HyperLogLogPlus [17]
LinearCounting [18]
CountThenEstimate
BJKST [26]
FlajoletMartin [27]
KMinCount

Usage

ICardinality card = new LogLog(8);

for (int i=0; i<100; i++) {
    card.offer(Math.random()*100.0);
}

System.out.println("Cardinality: " + card.cardinality());

Average

Algorithms

MovingAverage
ExponentialMovingAverage
SimpleEWMA
VariableEWMA
TEWMA [25]

Usage

// create a EWMA with 15 seconds of age for the metrics in the period
IAverage avg = new VariableEWMA(15.0);

for (int i=0; i<100; i++) {
    avg.add(Math.random()*100.0);
    if (i%10 == 0)
        System.out.println("Average: " + avg.getAverage());
}

Membership

Algorithms

BloomFilter [22]
BloomFilterAlt (alternative implementation)
CountingBloomFilter [19]
VarCountingBloomFilter (with variable bucketsPerWord)
DynamicBloomFilter [20]
RetouchedBloomFilter [21]
StableBloomFilter [23]
TimingBloomFilter [24]
ODTDBloomFilter [28]

Usage

IFilter bloom = new BloomFilter(1000, 32, Hash.MURMUR_HASH);

for (int i = 0; i < 100; i++) {
    String val = UUID.randomUUID().toString();
    Key k = new Key(val.getBytes());
    bloom.add(k);
    System.out.println(val + " exists? " + bloom.membershipTest(k));
}

Sampling

Algorithms

BernoulliSampler
ChainSampler [29]
ReservoirSampler
SystematicSampler
WRSampler (With Replacement)
WeightedRandomSampler
L0Sampler [30]
SpaceSavingSampler
FrequentSampler

Usage

// Create a sampler with 30% probability
ISampler sampler = new BernoulliSampler(0.3);

Random rand = new Random();

// Create a dummy stream of ints
List<Integer> stream = new ArrayList<Integer>(1000);
for (int i=0; i<1000; i++)
    stream.add(rand.nextInt(100));

for (Integer tuple : stream) {
    if (sampler.next()) {
        // tuple was sampled, do something
    } else {
        // tuple was ignored, move on
    }
}

Classifiers

Algorithms

Perceptron
NaiveBayes
NaiveBayesWOP
BoundedBayes
LossyBayes
MultiBayes
MultiLossyBayes
MultiTopKBayes
SticySamplingBayes
TopKBayes
MajorityClass
RandomClassifier
MultiRandomClassifier
AROWClassifier (Adaptive Regularization of Weight Vectors) [32]
BWinnowClassifier (Balanced Winnow Classifier) [33]
PAClassifier, MultiClassPAClassifier [34]
WinnowClassifier

Usage

NaiveBayes nb = new NaiveBayes();
nb.setLabelAttribute("play");
            
ICsvListReader listReader = new CsvListReader(
        new FileReader("src/test/resources/golf.csv"), 
        CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);

listReader.getHeader(true);

List<String> list;
while( (list = listReader.read()) != null ) {
    Data data = new DataImpl();
    data.put("outlook", list.get(0));
    data.put("temperature", Integer.parseInt(list.get(1)));
    data.put("humidity", Integer.parseInt(list.get(2)));
    data.put("wind", Boolean.parseBoolean(list.get(3)));
    data.put("play", list.get(4));

    nb.learn(data);
}

Data test = new DataImpl();
test.put("outlook", "sunny");
test.put("temperature", "cool");
test.put("humidity", "high");
test.put("windy", "TRUE");

String prediction = nb.predict(test);
System.out.println("Item is: " + test);
System.out.println("Prediction is: " + prediction);

Clustering

Algorithms

K-Means
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [31]

References

[1] <a name="ref1"></a>Charikar, Moses, Kevin Chen, and Martin Farach-Colton. "Finding frequent items in data streams." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 693-703.

[2] <a name="ref2"></a>Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75.

[3] <a name="ref3"></a>Manku, Gurmeet Singh, and Rajeev Motwani. "Approximate frequency counts over data streams." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.

[4] <a name="ref4"></a>M. J. Fischer and S. L. Salzberg. "Finding a Majority Among N Votes: Solution to Problem 81-5(Journal of Algorithms, June 1981)", Journal of Algorithms, 3:4, December 1982, pp. 362-380.

[5] <a name="ref5"></a>Misra, Jayadev, and David Gries. "Finding repeated elements." Science of computer programming 2.2 (1982): 143-152.

[6] <a name="ref6"></a>Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Efficient computation of frequent and top-k elements in data streams." Database Theory-ICDT 2005. Springer Berlin Heidelberg, 2005. 398-412.

[7] <a name="ref7"></a>Cormode, Graham, et al. "Effective computation of biased quantiles over data streams." Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005.

[8] <a name="ref8"></a>Ma, Qiang, S. Muthukrishnan, and Mark Sandler. "Frugal Streaming for Estimating Quantiles." Space-Efficient Data Structures, Streams, and Algorithms. Springer Berlin Heidelberg, 2013. 77-96.

[9] <a name="ref9"></a>Greenwald, Michael, and Sanjeev Khanna. "Space-efficient online computation of quantile summaries." ACM SIGMOD Record. Vol. 30. No. 2. ACM, 2001.

[10] <a name="ref10"></a>Munro, J. Ian, and Mike S. Paterson. "Selection and sorting with limited storage." Theoretical computer science 12.3 (1980): 315-323.

[11] <a name="ref11"></a>Shrivastava, Nisheeth, et al. "Medians and beyond: new aggregation techniques for sensor networks." Proceedings of the 2nd international conference on Embedded networked sensor systems. ACM, 2004.

[12] <a name="ref12"></a>Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles over sliding windows." Proceedings of the twenty-third ACM SIGMOD-SIGACT-S

Streaminer

Install / Use

README

Streaminer

Releases

API Documentation

Frequent Itemsets

Algorithms

Usage

Time Decaying Algorithms

Top-K

Algorithms

Usage

Quantiles

Algorithms

Usage

Cardinality

Algorithms

Usage

Average

Algorithms

Usage

Membership

Algorithms

Usage

Sampling

Algorithms

Usage

Classifiers

Algorithms

Usage

Clustering

Algorithms

References