Docem
The repository contains the code and notebooks for the tutorials on how to extract embedding features from pictures using the ResNext model. The quality and effectiveness of the techniques are proved by the clustering in the embedding space and the correlation of clusters with their corresponding labels.
Install / Use
/learn @gm-spacagna/DocemREADME
Document Embedding (DocEm)
The repository contains the code and notebooks for the tutorials on:
- How to extract embedding features from COCO pictures using the ResNext model developed by Facebook AI.
- Visualizing the picture embedding vectors in a 3D space using PCA and t-SNE.
- Find the nearest neighbors of each picture based on the cosine distance.
- Reduce the embedding space dimensionality while preserving manifold structures using UMAP.
- Find the optimal GMM clusters using the BIC elbow method and the Silhouette analysis.
- Visualize the pictures closest to each centroid to identify the cluster topic.
- Apply an adapted version of the p-SIF (partition averaging) algorithm in order to produce document embeddings from the bag-of-word model and the original picture embedding vectors.
- Test the effectiveness of the novel proposed method against the baseline methods for document averaging (weighted averaging and TF-IDF).
Overview of the p-SIF algorithm
Original paper: P-SIF: Document Embeddings Using Partition Averaging, V. Gupta et al.
Algorithm overview diagram:

Read more
Articles of the "Embed, Cluster, Average" series:
- Extracting rich embedding features from COCO pictures using PyTorch and ResNeXt-WSL
- Manifold clustering in the embedding space using UMAP and GMM
- A novel approach to Document Embedding using Partition Averaging on Bag of Words (soon to be published)
Experiment yourself
You can view and execute the development notebook in Colab:
