SkillAgentSearch skills...

TfidfVectorizer

Convert raw documents to a matrix of TF-IDF features.

Install / Use

/learn @phfaustini/TfidfVectorizer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

C++ TfidfVectorizer

Convert raw documents to a matrix of TF-IDF features.

Requirements:

  • Armadillo, g++, boost
sudo apt install g++ libboost-all-dev libarmadillo-dev

Compiling and running example in main.cc:

g++ main.cc src/tfidf_vectorizer.cc -larmadillo -std=c++11 && ./a.out

Features:

  • Tokenizes raw documents.
  • Work with both tf-idf and binary values.
  • Can use a selected number of features (the ones with highest idf).
  • Similar interface to sklearn: fit, transform and fit_transform methods, as well as idf_ and vocabulary_ members. However, this is not a port from sklearn TfidfVectorizer, but it tries to mimic sklearn. The example given here produces the same tfidf matrix as sklearn in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Notes:

  • Features are in rows, documents (objects) are in columns.
  • This behavior is opposed to what is normally done in Python, but it is the default in C++ libraries such as MLPack.

Optional: unit tests

  • Install catch2
git clone https://github.com/catchorg/Catch2.git # somewhere else
cd Catch2
cmake -Bbuild -H. -DBUILD_TESTING=OFF
sudo cmake --build build/ --target install 
  • Run tests
cd tests/
g++ t1.cc -larmadillo -std=c++11 -o tests
./tests
View on GitHub
GitHub Stars5
CategoryDevelopment
Updated1y ago
Forks2

Languages

C++

Security Score

70/100

Audited on Apr 9, 2024

No findings