SkillAgent Search skills...⌘K

Docem

The repository contains the code and notebooks for the tutorials on how to extract embedding features from pictures using the ResNext model. The quality and effectiveness of the techniques are proved by the clustering in the embedding space and the correlation of clusters with their corresponding labels.

Generate Convert Improve

Install / Use

/learn @gm-spacagna/Docem

About this skill

Quality Score

0/100

Category

Development & Engineering

Supported Platforms

Universal

Tags

clustering document embedding

README

Document Embedding (DocEm)

The repository contains the code and notebooks for the tutorials on:

How to extract embedding features from COCO pictures using the ResNext model developed by Facebook AI.
Visualizing the picture embedding vectors in a 3D space using PCA and t-SNE.
Find the nearest neighbors of each picture based on the cosine distance.
Reduce the embedding space dimensionality while preserving manifold structures using UMAP.
Find the optimal GMM clusters using the BIC elbow method and the Silhouette analysis.
Visualize the pictures closest to each centroid to identify the cluster topic.
Apply an adapted version of the p-SIF (partition averaging) algorithm in order to produce document embeddings from the bag-of-word model and the original picture embedding vectors.
Test the effectiveness of the novel proposed method against the baseline methods for document averaging (weighted averaging and TF-IDF).

Overview of the p-SIF algorithm

Original paper: P-SIF: Document Embeddings Using Partition Averaging, V. Gupta et al.

Algorithm overview diagram:

alt text

Read more

Articles of the "Embed, Cluster, Average" series:

Extracting rich embedding features from COCO pictures using PyTorch and ResNeXt-WSL
Manifold clustering in the embedding space using UMAP and GMM
A novel approach to Document Embedding using Partition Averaging on Bag of Words (soon to be published)

Experiment yourself

You can view and execute the development notebook in Colab:

gm-spacagna

View profile

GitHub Stars15

CategoryDevelopment

Updated8mo ago

Forks4

gm-spacagna/docem

Languages

Jupyter Notebook

Security Score

87/100

Audited on Jul 10, 2025

No findings