AdaOja

Methods from the paper "AdaOja: Adaptive Learning Rates for Streaming PCA."

Generate Convert Improve

Install / Use

/learn @aamcbee/AdaOja

About this skill

Quality Score

0/100

README

AdaOja

This repository contains the Python code that produces all of the experimental results from the paper "AdaOja: Adaptive Learning Rates for Streaming PCA". AdaOja is a new version of Oja's method with an adaptive learning rate that performs comparably to other state of the art methods and better than Oja's for standard learning rate choices such as eta_i = c/i, c/sqrt(i). The file <code>streaming_subclass.py</code> provides the framework for several different algorithms--including AdaOja--for streaming principal component analysis and can easily be used for a wider set of problems and datasets than those presented here.

Dependencies

Python: tested with version 3.5.2
Jupyter Notebook
NumPy: tested with version 1.13.1
SciPy: tested with version 0.19.1
Matplotlib: tested with version 2.0.2

Note that all of these packages can most easily be installed using Anaconda as follows:

<code>conda install (package-name) </code>

The Anaconda distribution can be downloaded here.

Streaming PCA Objects

The key code containing our streaming PCA objects is found in <code>streaming_subclass.py</code>. The main functionality for our PCA objects is found in <code>StreamingPCA</code>. Additionally, several subclasses are defined for specific algorithms:

AdaOja 1
Oja: Oja's method 2 for learning rates c/t and c/sqrt(t).
HPCA: History Principal Component Analysis 3
SPM: Streaming Power Method. 4, 5

The file <code>data_strm_subclass.py</code> provides several examples for how to stream data into these classes. Current functionality runs AdaOja, HPCA and SPM simultaniously by streaming data from a list of blocks (<code>run_sim_blocklist</code>), an array already loaded fully into memory (<code>run_sim_fullX</code>), and directly from a bag-of-words file (<code>run_sim_bag</code>).

Plotting and Comparing AdaOja to other Algorithms

Datasets

We run AdaOja against several other streaming algorithms on three different kinds of datasets.

Synthetic Data

The functions to generate synthetic data are found in <code>simulated_data.py</code>.

Bag-of-words

These sparse, real-world bag-of-words datasets are available on the UCI Machine Learning Repository. Note that in order to run <code>ExpVar_Comparison.ipynb</code> your working directory must contain the following files:

docword.kos.txt
docword.nips.txt
docword.enron.txt
docword.nytimes.txt
docword.pubmed.txt

The file <code>data_strm_subclass.py</code> contains functions for parsing these bag-of-words text files in python.

For example, for small bag-of-words datasets the dimensions n, d, the number of non-zeros, the density, the dataset (as a sparse nxd csr matrix) and the norm of the dataset squared are computed by running:

n, d, nnz, dense, SpX, norm2 = dssb.get_bagX('docword.kos.txt')

Alternatively, a list of the first m sparse blocks of size B can be returned by running the following:

n, d, nnz, dense, SpX, norm2 = dssb.get_bagXblocks('docword.nytimes.txt', B, block_total=m)

CIFAR-10

The CIFAR-10 dataset is available online. It is a subset of the considerably larger Tiny Images Dataset. Note that in order to run <code>ExpVar_Comparison.ipynb</code>, you must download the following files and include them in your working directory:

data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5

Running Experiments

We generate our comparison plots in <code>ExpVar_Comparison.ipynb</code>. These plots largely draw on two files: <code>data_strm_subclass.py</code> and <code>plot_functions.py</code>. To run this file, download the CIFAR-10 dataset and Bag-of-Words datasets as outlined in the section above and make sure the necessary files are in your working directory.

The file <code>plot_functions.py</code> compares and visualizes the end explained variance achieved by Oja's method varying over c for learning rates eta_i = c / i, c / sqrt(i) compared to the end explained variance achieved by AdaOja. These methods are stored in the class <code>compare_lr</code>. It also plots HPCA, AdaOja, and SPM against each other using the function <code>plot_hpca_ada</code> in conjunction with the streaming methods from <code>data_strm_subclass.py</code>.

The class <code>compare_time </code> contains preliminary functionality to compare these methods' (AdaOja, HPCA, and SPM) time costs.

Sources

License and Reference

This repository is licensed under the 3-clause BSD license, see <code>LICENSE.md</code>.

To reference this code base, please cite:

Amelia Henriksen and Rachel Ward. AdaOja: Adaptive Learning Rates for Streaming PCA. arXiv e-prints, page arXiv:1905.12115, May 2019

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

aamcbee

View profile

View on GitHub

GitHub Stars9

CategoryEducation

Updated1y ago

Forks1

aamcbee/AdaOja

Languages

Jupyter Notebook

Security Score

70/100

Audited on May 24, 2024

No findings