C++ TfidfVectorizer

Convert raw documents to a matrix of TF-IDF features.

Requirements:

Armadillo, g++, boost

sudo apt install g++ libboost-all-dev libarmadillo-dev

Compiling and running example in main.cc:

g++ main.cc src/tfidf_vectorizer.cc -larmadillo -std=c++11 && ./a.out

Features:

Tokenizes raw documents.
Work with both tf-idf and binary values.
Can use a selected number of features (the ones with highest idf).
Similar interface to sklearn: fit, transform and fit_transform methods, as well as idf_ and vocabulary_ members. However, this is not a port from sklearn TfidfVectorizer, but it tries to mimic sklearn. The example given here produces the same tfidf matrix as sklearn in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Notes:

Features are in rows, documents (objects) are in columns.
This behavior is opposed to what is normally done in Python, but it is the default in C++ libraries such as MLPack.

Optional: unit tests

Install catch2

git clone https://github.com/catchorg/Catch2.git # somewhere else
cd Catch2
cmake -Bbuild -H. -DBUILD_TESTING=OFF
sudo cmake --build build/ --target install

Run tests

cd tests/
g++ t1.cc -larmadillo -std=c++11 -o tests
./tests

TfidfVectorizer

Install / Use

README

C++ TfidfVectorizer

Requirements:

Compiling and running example in main.cc:

Features:

Notes:

Optional: unit tests