AdaOja
Methods from the paper "AdaOja: Adaptive Learning Rates for Streaming PCA."
Install / Use
/learn @aamcbee/AdaOjaREADME
AdaOja
This repository contains the Python code that produces all of the experimental results from the paper "AdaOja: Adaptive Learning Rates for Streaming PCA". AdaOja is a new version of Oja's method with an adaptive learning rate that performs comparably to other state of the art methods and better than Oja's for standard learning rate choices such as eta_i = c/i, c/sqrt(i). The file <code>streaming_subclass.py</code> provides the framework for several different algorithms--including AdaOja--for streaming principal component analysis and can easily be used for a wider set of problems and datasets than those presented here.
Dependencies
- Python: tested with version 3.5.2
- Jupyter Notebook
- NumPy: tested with version 1.13.1
- SciPy: tested with version 0.19.1
- Matplotlib: tested with version 2.0.2
Note that all of these packages can most easily be installed using Anaconda as follows:
<code>conda install (package-name) </code>
The Anaconda distribution can be downloaded here.
Streaming PCA Objects
The key code containing our streaming PCA objects is found in <code>streaming_subclass.py</code>. The main functionality for our PCA objects is found in <code>StreamingPCA</code>. Additionally, several subclasses are defined for specific algorithms:
- AdaOja <sup>1</sup>
- Oja: Oja's method <sup>2</sup> for learning rates c/t and c/sqrt(t).
- HPCA: History Principal Component Analysis <sup>3</sup>
- SPM: Streaming Power Method. <sup>4, 5</sup>
The file <code>data_strm_subclass.py</code> provides several examples for how to stream data into these classes. Current functionality runs AdaOja, HPCA and SPM simultaniously by streaming data from a list of blocks (<code>run_sim_blocklist</code>), an array already loaded fully into memory (<code>run_sim_fullX</code>), and directly from a bag-of-words file (<code>run_sim_bag</code>).
Plotting and Comparing AdaOja to other Algorithms
Datasets
We run AdaOja against several other streaming algorithms on three different kinds of datasets.
Synthetic Data
The functions to generate synthetic data are found in <code>simulated_data.py</code>.
Bag-of-words
These sparse, real-world bag-of-words datasets are available on the UCI Machine Learning Repository. Note that in order to run <code>ExpVar_Comparison.ipynb</code> your working directory must contain the following files:
- docword.kos.txt
- docword.nips.txt
- docword.enron.txt
- docword.nytimes.txt
- docword.pubmed.txt
The file <code>data_strm_subclass.py</code> contains functions for parsing these bag-of-words text files in python.
For example, for small bag-of-words datasets the dimensions n, d, the number of non-zeros, the density, the dataset (as a sparse nxd csr matrix) and the norm of the dataset squared are computed by running:
n, d, nnz, dense, SpX, norm2 = dssb.get_bagX('docword.kos.txt')
Alternatively, a list of the first m sparse blocks of size B can be returned by running the following:
n, d, nnz, dense, SpX, norm2 = dssb.get_bagXblocks('docword.nytimes.txt', B, block_total=m)
CIFAR-10
The CIFAR-10 dataset is available online. It is a subset of the considerably larger Tiny Images Dataset. Note that in order to run <code>ExpVar_Comparison.ipynb</code>, you must download the following files and include them in your working directory:
- data_batch_1
- data_batch_2
- data_batch_3
- data_batch_4
- data_batch_5
Running Experiments
We generate our comparison plots in <code>ExpVar_Comparison.ipynb</code>. These plots largely draw on two files: <code>data_strm_subclass.py</code> and <code>plot_functions.py</code>. To run this file, download the CIFAR-10 dataset and Bag-of-Words datasets as outlined in the section above and make sure the necessary files are in your working directory.
The file <code>plot_functions.py</code> compares and visualizes the end explained variance achieved by Oja's method varying over c for learning rates eta_i = c / i, c / sqrt(i) compared to the end explained variance achieved by AdaOja. These methods are stored in the class <code>compare_lr</code>. It also plots HPCA, AdaOja, and SPM against each other using the function <code>plot_hpca_ada</code> in conjunction with the streaming methods from <code>data_strm_subclass.py</code>.
The class <code>compare_time </code> contains preliminary functionality to compare these methods' (AdaOja, HPCA, and SPM) time costs.
Sources
- Amelia Henriksen and Rachel Ward. AdaOja: Adaptive Learning Rates for Streaming PCA. arXiv e-prints, page arXiv:1905.12115, May 2019
- Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267-273, Nov 1982
- P. Yang, C.J. Hsieh, and J.L. Wang. History PCA: A New Algorithm for Streaming PCA. ArXiv e-prints, February 2018
- Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming pca. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein-berger, editors, Advances in Neural Information Processing Systems 26, pages 2886–2894. Curran Associates, Inc., 2013.
- Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with ap-plications. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2861–2869, Cambridge, MA, USA, 2014.MIT Press
License and Reference
This repository is licensed under the 3-clause BSD license, see <code>LICENSE.md</code>.
To reference this code base, please cite:
Amelia Henriksen and Rachel Ward. AdaOja: Adaptive Learning Rates for Streaming PCA. arXiv e-prints, page arXiv:1905.12115, May 2019
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
