ArXivAtlas

The ArXiv Atlas is a web-based tool that visualizes research papers from the ArXiv repository, enabling users to explore and discover connections between papers through an interactive map and receive recommendations based on their queries.

Generate Convert Improve

Install / Use

/learn @Jaluus/ArXivAtlas

About this skill

Quality Score

0/100

README

ArXiv Atlas

This project visualizes the Arxiv Atlas, allowing users to explore research papers and their relationships visually. It also allows for RAG recommendations based on user queries.

ArXiv Atlas

Project Overview

The Arxiv Atlas is a web-based tool that provides an interactive map of research papers from the Arxiv repository. Users can navigate through different categories, discover connections between papers, and get recommendations based on their interests.

Features

Interactive Map: Explore an interactive map of Arxiv papers creaed using LLM embeddings of the papers.
Search Functionality: Search for specific papers or use semantic search to find relevant papers based on free text.
Visualization: View relationships and connections between papers based purely on their content and resulting embeddings.

How does the Recommendation System work?

The recommendation system relies on embeddings of the papers. These embeddings are vector representations of the abstracts and titles, created using a BERT-type language model. While it's possible to use GPT-type language models for contextualized embeddings, they typically require more computational resources.

Embeddings capture the semantic meaning of the papers, enabling the calculation of similarity between papers based on their embeddings. The similarity is measured using cosine similarity, which compares the angles between the vectors representing the papers. A higher cosine similarity score indicates a greater semantic similarity between two papers.

I calculated the similarity between all[^1] papers in the dataset and stored the results in a similarity matrix. When a user selects a paper, we use this similarity matrix to retrieve and recommend the most similar papers.

Additionally, the system can provide recommendations based on a semantic query, such as "What are the challenges when using GAN-based classification methods?". The query is embedded, and the system retrieves the most similar papers based on the closest match. The search results are further improved by reranking them using a reranking model.

What model was used for the embeddings?

For internal testing, I used the AnglE-optimized Text Embeddings[^2] model with some fine-tuning to ArXiv embeddings. A good library for fine-tuning is Ragas.

But for the ArXiv Atlas I'm using the OpenAI text-embedding-3-large model. It has a good balance between performance and cost, and allows me not to run a GPU on my server 24/7 to serve requests. Also, the model has the very nice property of being trained with Matryoshka Representation Learning ! This means that the model compresses a lot of information into its first 256 dimensions[^3], giving me the ability to discard the rest of the dimensions, saving a lot of memory and thus being extremely performant.

How does the visualization work?

Since the embedding vectors are typically 256 to 3072 dimensional and we are 3 dimensional beings, we need to somehow project the embeddings down to 2 or 3 dimensions to visualize them. There are several different ways to do this, Random Projections, PCA, t-SNE, PaCMAP, etc....
I'm using UMAP for this task because it's very performant, gives good results, and I'm a bit biased towards it. I also use an autoencoder under the hood to compute the UMAP projections using a parametric approach. Its not in this repo as the code is messy, but I may publish it in the future.

If you are generally interested in how these methods work, there is a nice paper that explains a lot of them, but beware, its biased towards PaCMAP.

Can I get the Data?

Sure, but I can't provide the similarity matrix because it's way too big.
But I can provide a script to calculate the N closest papers to a given paper.

| Name | Number of Papers | Size | Last Updated | Link | | ------------------- | ---------------- | ------- | ------------- | ----------------------------------------------------------------------- | | Quantum Physics | 68.548 | 59.3 MB | 21. July 2024 | Download | | High Energy Physics | 114.218 | 99.4 MB | 21. July 2024 | Download | | Physics | 134.741 | 126 MB | 21. July 2024 | Download | | Astro Physics | 160.252 | 169 MB | 21. July 2024 | Download | | Condensed Matter | 171.503 | 155 MB | 21. July 2024 | Download | | Computer Science | 485.772 | 452 MB | 21. July 2024 | Download | | Combined | 1.238.980 | 1.12 GB | 21. July 2024 | Download |

Data Format

The data is stored in a Pandas DataFrame and saved as a Pickle file. The DataFrame has the following columns:

| Column Name | Type | Description | Example | | ------------------ | ----------------------- | --------------------------------------------------------------------------------------------- | -------------------------------------------------- | | title | str | Title of the paper | "GPT-4 Technical Report" | | arxiv_id | str | Arxiv ID of the paper | "2303.08774" | | abstract | str | Abstract of the paper | "We report the development of GPT-4, a large-sc.." | | main_category | str | Main category of the paper | "cs.CL" | | categories | list | Categories of the paper | ["cs.CL", "cs.AI"] | | revision | str | Revision of the paper | "6" | | published | datetime | Date of publication | "2023-03-15 17:15:04" | | updated | datetime | Date of last update | "2024-03-04 06:01:33" | | authors | list | Authors of the paper | ["OpenAI", "J. Achiam", "S. Adler", ...] | | journal_ref | str | Journal reference | "J. Mach. Learn. Res. 22 (2021) 1-21" or "<NA>" | | doi | str | DOI of the paper | "10.1234/5678" or "<NA>" | | arxiv_comment | str | Arxiv comment | "Submitted to ICLR 2023" or "<NA>" | | arxiv_DOI | str | Arxiv DOI of the paper | "10.1234/5678" or "<NA>" | | abstract_embedding | np.ndarray (np.float16) | Embedding of the abstract (256 Dimensions) computed using the text-embedding-3-large model | [-0.011314, -0.0605, -0.02097, -0.004242, ...] | | arxiv_year | int32 | The first part of the arxiv id, used for sorting | 2303 | | arxiv_number | int32 | The second part of the arxiv id, used for sorting | 8774 | | x_umap | float32 | UMAP projection of the embedding in the x dimension (Not available in the Combined dataset) | 0.1234 | | y_umap

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding