SkillAgentSearch skills...

Streaminer

A collection of algorithms for mining data streams

Install / Use

/learn @mayconbordin/Streaminer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Streaminer

Download

A collection of algorithms for mining data streams, including frequent itemsets, quantiles, sampling, moving averages, set membership and cardinality.

Releases

Maven:

<dependency>
    <groupId>com.github.mayconbordin</groupId>
    <artifactId>streaminer</artifactId>
    <version>1.1.1</version>
</dependency>

API Documentation

Frequent Itemsets

Algorithms

  • CountSketch [1]
  • CountMinSketch [2]
  • LossyCounting [3]
  • Majority [4]
  • MisraGries [5]
  • SpaceSaving [6]
  • StickySampling [3]
  • RealCounting
  • SimpleTopKCounting
  • TimeDecayCountMinSketch
  • TimeDecayRealCounting
  • AMSSketch
  • CCFCSketch
  • CGT

Usage

Except for the CountMinSketchAlt class, all algorithms implement the IRichFrequency interface. Here's an example using the SpaceSaving algorithm:

Random r = new Random();
int counters = 20;
double support = 0.01;
double maxError = 0.1;

IRichFrequency<Integer> counter = new SpaceSaving<Integer>(counters, support, maxError);
for (int i=0 i<1000; i++) {
    counter.add(r.nextInt(100), 1);
}

// get the top 10 items
List<CountEntry<Integer>> topk = counter.peek(10);

// print the items
for (CountEntry<Integer> item : topk) {
    System.out.println(item);
}

// get the frequency of a single item
int item = 25;
long freq = counter.estimateCount(item);
System.out.println(item + ": " + freq);

Time Decaying Algorithms

TimeDecayRealCounting and TimeDecayCountMinSketch are algorithms that use a decay function to update the current values of their counts in order to give more importance to newer values, while older values will slowly fade away.

The decay function implements the DecayFormula interface. Currently there are three implementations: the exponential (ExpDecayFormula), the linear (LinDecayFormula), and the logarithmic (LogDecayFormula).

Those counting algorithms implement a different interface called ITimeDecayFrequency, as both methods for adding and estimating the frequency need an additional argument, the timestamp.

Top-K

Algorithms

  • StreamSummary [6]
  • ConcurrentStreamSummary
  • Frequent
  • StochasticTopper

Usage

The basic usage of a Top-K algorithm is basically the same as the frequent itemset, except that these algorithms do not support the estimateCount method.

ITopK<String> counter = new StreamSummary<String>(3);

String[] stream = {"X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "C", "A", "A"};
for (String i : stream) {
    counter.add(i);
}

List<CountEntry<String>> topk = counter.peek(3);
for (CountEntry<String> item : topk) {
    System.out.println(item);
}

Quantiles

Algorithms

  • CKMSQuantiles [7]
  • Frugal2U [8]
  • GKQuantiles [9]
  • MPQuantiles [10]
  • QDigest [11]
  • WindowSketchQuantiles [12]
  • RSSQuantiles [13]
  • EnsembleQuantiles
  • ExactQuantiles
  • ExactQuantilesAll
  • SimpleQuantiles
  • SumQuantiles
  • TDigest

Usage

double[] quantiles = new double[]{0.05, 0.25, 0.5, 0.75, 0.95};
IQuantiles<Integer> instance = new Frugal2U(quantiles, 0);

RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(100, 50, r);
int numSamples = 1000;
        
for(int i = 0; i < numSamples; ++i) {
    int num = (int) Math.max(0, dist.nextDouble());
    instance.offer(num);
}

for (double q : quantiles) {
    System.out.println(q + ": " + instance.getQuantile(q));
}

Cardinality

Algorithms

  • AdaptiveCounting [14]
  • LogLog [15]
  • HyperLogLog [16]
  • HyperLogLogPlus [17]
  • LinearCounting [18]
  • CountThenEstimate
  • BJKST [26]
  • FlajoletMartin [27]
  • KMinCount

Usage

ICardinality card = new LogLog(8);

for (int i=0; i<100; i++) {
    card.offer(Math.random()*100.0);
}

System.out.println("Cardinality: " + card.cardinality());

Average

Algorithms

  • MovingAverage
  • ExponentialMovingAverage
  • SimpleEWMA
  • VariableEWMA
  • TEWMA [25]

Usage

// create a EWMA with 15 seconds of age for the metrics in the period
IAverage avg = new VariableEWMA(15.0);

for (int i=0; i<100; i++) {
    avg.add(Math.random()*100.0);
    if (i%10 == 0)
        System.out.println("Average: " + avg.getAverage());
}

Membership

Algorithms

  • BloomFilter [22]
  • BloomFilterAlt (alternative implementation)
  • CountingBloomFilter [19]
  • VarCountingBloomFilter (with variable bucketsPerWord)
  • DynamicBloomFilter [20]
  • RetouchedBloomFilter [21]
  • StableBloomFilter [23]
  • TimingBloomFilter [24]
  • ODTDBloomFilter [28]

Usage

IFilter bloom = new BloomFilter(1000, 32, Hash.MURMUR_HASH);

for (int i = 0; i < 100; i++) {
    String val = UUID.randomUUID().toString();
    Key k = new Key(val.getBytes());
    bloom.add(k);
    System.out.println(val + " exists? " + bloom.membershipTest(k));
}

Sampling

Algorithms

  • BernoulliSampler
  • ChainSampler [29]
  • ReservoirSampler
  • SystematicSampler
  • WRSampler (With Replacement)
  • WeightedRandomSampler
  • L0Sampler [30]
  • SpaceSavingSampler
  • FrequentSampler

Usage

// Create a sampler with 30% probability
ISampler sampler = new BernoulliSampler(0.3);

Random rand = new Random();

// Create a dummy stream of ints
List<Integer> stream = new ArrayList<Integer>(1000);
for (int i=0; i<1000; i++)
    stream.add(rand.nextInt(100));

for (Integer tuple : stream) {
    if (sampler.next()) {
        // tuple was sampled, do something
    } else {
        // tuple was ignored, move on
    }
}

Classifiers

Algorithms

  • Perceptron
  • NaiveBayes
  • NaiveBayesWOP
  • BoundedBayes
  • LossyBayes
  • MultiBayes
  • MultiLossyBayes
  • MultiTopKBayes
  • SticySamplingBayes
  • TopKBayes
  • MajorityClass
  • RandomClassifier
  • MultiRandomClassifier
  • AROWClassifier (Adaptive Regularization of Weight Vectors) [32]
  • BWinnowClassifier (Balanced Winnow Classifier) [33]
  • PAClassifier, MultiClassPAClassifier [34]
  • WinnowClassifier

Usage

NaiveBayes nb = new NaiveBayes();
nb.setLabelAttribute("play");
            
ICsvListReader listReader = new CsvListReader(
        new FileReader("src/test/resources/golf.csv"), 
        CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);

listReader.getHeader(true);

List<String> list;
while( (list = listReader.read()) != null ) {
    Data data = new DataImpl();
    data.put("outlook", list.get(0));
    data.put("temperature", Integer.parseInt(list.get(1)));
    data.put("humidity", Integer.parseInt(list.get(2)));
    data.put("wind", Boolean.parseBoolean(list.get(3)));
    data.put("play", list.get(4));

    nb.learn(data);
}

Data test = new DataImpl();
test.put("outlook", "sunny");
test.put("temperature", "cool");
test.put("humidity", "high");
test.put("windy", "TRUE");

String prediction = nb.predict(test);
System.out.println("Item is: " + test);
System.out.println("Prediction is: " + prediction);

Clustering

Algorithms

  • K-Means
  • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [31]

References

[1] <a name="ref1"></a>Charikar, Moses, Kevin Chen, and Martin Farach-Colton. "Finding frequent items in data streams." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 693-703.

[2] <a name="ref2"></a>Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75.

[3] <a name="ref3"></a>Manku, Gurmeet Singh, and Rajeev Motwani. "Approximate frequency counts over data streams." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.

[4] <a name="ref4"></a>M. J. Fischer and S. L. Salzberg. "Finding a Majority Among N Votes: Solution to Problem 81-5(Journal of Algorithms, June 1981)", Journal of Algorithms, 3:4, December 1982, pp. 362-380.

[5] <a name="ref5"></a>Misra, Jayadev, and David Gries. "Finding repeated elements." Science of computer programming 2.2 (1982): 143-152.

[6] <a name="ref6"></a>Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Efficient computation of frequent and top-k elements in data streams." Database Theory-ICDT 2005. Springer Berlin Heidelberg, 2005. 398-412.

[7] <a name="ref7"></a>Cormode, Graham, et al. "Effective computation of biased quantiles over data streams." Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005.

[8] <a name="ref8"></a>Ma, Qiang, S. Muthukrishnan, and Mark Sandler. "Frugal Streaming for Estimating Quantiles." Space-Efficient Data Structures, Streams, and Algorithms. Springer Berlin Heidelberg, 2013. 77-96.

[9] <a name="ref9"></a>Greenwald, Michael, and Sanjeev Khanna. "Space-efficient online computation of quantile summaries." ACM SIGMOD Record. Vol. 30. No. 2. ACM, 2001.

[10] <a name="ref10"></a>Munro, J. Ian, and Mike S. Paterson. "Selection and sorting with limited storage." Theoretical computer science 12.3 (1980): 315-323.

[11] <a name="ref11"></a>Shrivastava, Nisheeth, et al. "Medians and beyond: new aggregation techniques for sensor networks." Proceedings of the 2nd international conference on Embedded networked sensor systems. ACM, 2004.

[12] <a name="ref12"></a>Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles over sliding windows." Proceedings of the twenty-third ACM SIGMOD-SIGACT-S

View on GitHub
GitHub Stars207
CategoryDevelopment
Updated3d ago
Forks57

Languages

Java

Security Score

95/100

Audited on Mar 28, 2026

No findings