SkillAgentSearch skills...

DRIFT

DRIFT is a tool for Diachronic Analysis of Scientific Literature.

Install / Use

/learn @rajaswa/DRIFT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="./misc/images/logo_svg.svg" alt="Logo" height=12.5% width=12.5%/> </p> <div align='center'>

GitHub license GitHub stars GitHub forks GitHub issues

</div> <div align='center'>

Open in Streamlit arXiv

</div>

About

DRIFT is a tool for Diachronic Analysis of Scientific Literature. The application offers user-friendly and customizable utilities for two modes: Training and Analysis. Currently, the application supports customizable training of diachronic word embeddings with the TWEC model. The application supports a variety of analysis methods to monitor trends and patterns of development in scientific literature:

  1. Word Cloud
  2. Productivity/Frequency Plot
  3. Acceleration Plot
  4. Semantic Drift
  5. Tracking Clusters
  6. Acceleration Heatmap
  7. Track Trends with Similarity
  8. Keyword Visualisation
  9. LDA Topic Modelling

NOTE: The online demo is hosted using Streamlit sharing. This is a single-instance single-process deployment, accessible to all the visitors publicly (avoid sharing sensitive information on the demo). Hence, it is highly recommended to use your own independent local deployment of the application for a seamless and private experience. One can alternatively fork this repository and host it using Streamlit sharing.

We would love to know about any issues found on this repository. Please submit an issue for any query, or contact us here. If you use this application in your work, you can cite this repository and the paper here.


Setup

Clone the repository:

git clone https://github.com/rajaswa/DRIFT.git
cd DRIFT

Install the requirements:

make install_req

Data

The dataset we have used for our demo, and the analysis in the paper was scraped using the arXiv API (see script). We scraped papers from the cs.CL subject. This dataset is available here.

The user can upload their own dataset to the DRIFT application. The unprocessed dataset should be present in the following format (as a JSON file):

{
   <year_1>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
   <year_2>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
   ...,
   <year_m>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
}

where year_x is a string (e.g., "1998"), and paper_x is a dictionary. An example is given below:

{
   "url":"http://arxiv.org/abs/cs/9809020v1",
   "date":"1998-09-15 23:49:32+00:00",
   "title":"Linear Segmentation and Segment Significance",
   "authors":[
      "Min-Yen Kan",
      "Judith L. Klavans",
      "Kathleen R. McKeown"
   ],
   "abstract":"We present a new method for discovering a segmental discourse structure of a\ndocument while categorizing segment function. We demonstrate how retrieval of\nnoun phrases and pronominal forms, along with a zero-sum weighting scheme,\ndetermines topicalized segmentation. Futhermore, we use term distribution to\naid in identifying the role that the segment performs in the document. Finally,\nwe present results of evaluation in terms of precision and recall which surpass\nearlier approaches.",
   "journal ref":"Proceedings of 6th International Workshop of Very Large Corpora\n  (WVLC-6), Montreal, Quebec, Canada: Aug. 1998. pp. 197-205",
   "category":"cs.CL"
}

The only important key is "abstract", which has the raw text. The user can name this key differently. See the Training section below for more details.


Usage

Launch the app

To launch the app, run the following command from the terminal:

streamlit run app.py

Train Mode

Preprocesing

The preprocessing stage takes a JSON file structured as shown in the Data section. They key for raw text is provided on which preprocessing takes place. During the preprocessing of the text, year-wise text files are created in a desired directory. During the preprocessing:

  • All html tags are removed from the text.
  • Contractions are replaced (e.g. 'don't' is converted to 'do not')
  • Punctuations, non-ascii characters, stopwords are removed.
  • All verbs are lemmatized.

After this, each processed text is stored in the respective year file separated by a new-line, along with all the data in a single file as compass.txt

Training

<p align="center"> <img src="./misc/images/training_demo.gif" alt="training_demo" height=65% width=65%/> </p>

The training mode uses the path where the processed text files are stored, and trains the TWEC model on the given text. The TWEC model trains a Word2Vec model on compass.txt and then the respective time-slices are trained on this model to get corresponding word vectors. In the sidebar, we provide several options like - whether to use Skipgram over CBOW, number of dynamic iterations for training, number of static iterations for training, negative sampling, etc. After training, we store the models at the specified path, which are used later in the analysis.

Analysis Mode

Word Cloud

<p align="center"> <img src="./misc/images/word_cloud_usage.gif" alt="word_cloud_usage" height=65% width=65%/> </p>

A word cloud, or tag cloud, is a textual data visualization which allows anyone to see in a single glance the words which have the highest frequency within a given body of text. Word clouds are typically used as a tool for processing, analyzing and disseminating qualitative sentiment data.

References:

Productivity/Frequency Plot

<p align="center"> <img src="./misc/images/prod_freq_usage.png" alt="prod_freq_usage" height=65% width=65%/> </p>

Our main reference for this method is this paper. In short, this paper uses normalized term frequency and term producitvity as their measures.

  • Term Frequency: This is the normalized frequency of a given term in a given year.
  • Term Productivity: This is a measure of the ability of the concept to produce new multi-word terms. In our case we use bigrams. For each year y and single-word term t, and associated n multi-word terms m, the productivity is given by the entropy:
<p align="center"> <img src="https://latex.codecogs.com/svg.latex?%5Csmall%20e%28t%2Cy%29%20%3D%20-%20%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20%5Clog_%7B2%7D%28p_%7Bm_%7Bi%7D%2Cy%7D%29.p_%7Bm_%7Bi%7D%2Cy%7D" alt="prod_plot_1"/> </p> <p align="center"> <img src="https://latex.codecogs.com/svg.latex?%5Csmall%20p_%7Bm%2Cy%7D%20%3D%20%5Cfrac%7Bf%28m%29%7D%7B%5Csum_%7Bi%3D1%7D%5E%7Bn%7Df%28m_%7Bi%7D%29%7D" alt="prod_plot_2"/> </p>

Based on these two measures, they hypothesize three kinds of terms:

  • Growing Terms: Those which have increasing frequency and productivity in the recent years.
  • Consolidated Terms: Those that are growing in frequency, but not in productivity.
  • Terms in Decline: Those which have reached an upper bound of productivity and are being used less in terms of frequency.

Then, they perform clustering of the terms based on their frequency and productivity curves over the years to test their hypothesis. They find that the clusters formed show similar trends as expected.

NOTE: They also evaluate quality of their clusters using pseudo-labels, but we do not use any automated labels here. They also try with and without double-counting multi-word terms, but we stick to double-counting. They suggest it is more explanable.

Acceleration Plot

<p align="center"> <img src="./misc/images/acceleration_plot_usage.png" alt="acceleration_plot_usage" height=65% width=65%/> </p>

This plot is based on the word-pair acceleration over time. Our inspiration for this method is this paper. Acceleration is a metric which calculates how quickly the word embeddings for a pair of word get close together or farther apart. If they are getting closer together, it means these two

View on GitHub
GitHub Stars126
CategoryDevelopment
Updated1mo ago
Forks14

Languages

Python

Security Score

100/100

Audited on Feb 20, 2026

No findings