SkillAgentSearch skills...

TwitterSentimentAndCryptocurrencies

Analysis of Twitter Sentiment to discover correlations with Bitcoin and other cryptocurrencies

Install / Use

/learn @Drabble/TwitterSentimentAndCryptocurrencies
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Correlation of Twitter sentiments with the evolution of cryptocurrencies

Context and goal of the project

This project is part of the Web Mining course at the HEIG-VD for the MSE HES-SO. The students realizing the project are Antoine Drabble & Sébastien Richoz.

A cryptocurrency is a controversial digital asset designed to work as a medium of exchange that uses strong cryptography to secure financial transactions, control the creation of additional units, and verify the transfer of assets. Cryptocurrencies are extremely volatile. Four years of volatility in the stock market can be covered in a month of pricing movements in the cryptocurrency markets.

Twitter is an online news and social networking service on which users post and interact with messages known as "tweets". It is also the primary channel of communication for cryptocurrencies. Many important news are relayed (retweeted) by thousands of user and can reach a very large audience. For example, John McAfee, who has 828'338 followers at the moment, is a famous cryptocurrency user. His tweets can affect the course of a cryptocurrency up to more than 100% due to his huge audience. There are many other influential Twitter users.

The goal of this project is to propose a tool to visualize the correlation between cryptocurrencies' price in USD and a score based on the sentiment analysis of tweets, the number of followers of the user, the number of retweets and the number of likes.

In the first part we will make an historical analysis of the correlation.

In the second part we will propose a tool for the realtime visualisation of the evolution of tweets scores with cryptocurrency's price in USD.

Planification and work distribution

Here are the steps/milestones of the project:

  1. Retrieval of the tweets from the Twitter API.
  2. Retrieval of the cryptocurrencies change to USD historical data by interval of 1 minute from the cryptocompare API.
  3. Sentiment analysis of the tweets using the VADER algorithm.
  4. Computation of a score based on the sentiment, the number of likes, the number of tweets in a time range and the number of followers of the person who tweeted.
  5. Computation and visualisation of a cross-correlation score with different lag values between the tweets scores and the cryptocurrency's change in USD.
  6. Creation of a real-time visualisation tool of the correlation between the tweets and cryptocurrency's change.
  7. This step is optional and not implemented yet. This step's goal is to train a recurrent nural network (LSTM or GRU) to predict the currency of the bitcoin based on the tweets scores and the cryptocurrenciy's change. And automate the purchase and the sale of the cryptocurrency with the trained network.

Below is the list of the task that each member of the team worked the most on. Note that we worked together on many tasks and on the analysis of the problems.

Antoine:

  • Extraction/Preprocessing of tweets
  • Cross-correlation analysis

Sébastien.

  • Extraction of the currencies
  • Real-time correlation viewer

Data sources and quantity

Two data sources where needed to retrieve the information: one for the tweets and one for the cryptocurrency.

Twitter

The Twitter API is the source for all the tweets. It is limited to 450 requests of maximum 100 tweets per 15 minutes with the App login. It can only retrieve tweets 7 days old at most. We have retrieved around 20 days of tweets which represent ~1'120'000 tweets for the Bitcoin (BTC). We also retrieved around 10 days for two other cryptocurrencies. ~2500 tweets for Nexo (NEXO) and 15'000 for Zilliqa (ZIL).

To make searches in the Twitter API you must use the query operators to match on multiple keywords. Here is an example of a query for the Bitcoin related tweets.

  • bitcoin OR #BTC OR #bitcoin OR $BTC OR $bitcoin

For each tweets we extracted the following informations:

  • ID
  • Text
  • Username
  • Number of followers of the user
  • Number of retweets
  • Number of likes
  • Creation date

At this step we also filtered the non english tweets by specifying it to the Twitter API.

Cryptocurrencies

Bitcoin is a cryptocurrency and worldwide payment system. It is the first decentralized digital currency, as the system works without a central bank or single administrator. The system was designed to work as a peer-to-peer network, a network in which transactions take place between users directly, without an intermediary. The Bitcoin currently has a market cap of 28'256'866'432 and a price of ~7500$ for 1 Bitcoin making it the largest cryptocurrency in the world.

Zilliqa is the next generation, high throughput blockchain platform, designed to scale, using sharding technology which allows transaction rates to increase as the network expands which will scale with an increase of miners. It was introduced the 27th of december 2017.

Nexo a lending Platform platform for the world’s first instant crypto-backed loans, which gives investors and businesses access to instant cash, while retaining ownership of their digital assets and thus keeping all upside potential – a much needed service by the crypto community. It was introduced the 27th of february 2018.

The API CryptoCompare (https://min-api.cryptocompare.com/) was used to retrieve the crypto currencies that we analyzed. Different endpoints are available to retrieve historical data every minute up until 7 days, after what we can retrieve currencies hourly or daily.

About 150 crypto currencies are available from the CryptoCompare API. We focused on Bitcoin, Zilliqa and Nexo as a starting point. The data retrieved contains the following fields:

  • time: when the data was emitted
  • close: the price at the end of the time frame (minutely, hourly or daily depending the targeted endpoint)
  • high: the highest price reached during the time frame
  • low: the lowest price reached during the time frame
  • open: the price at the opening of the time frame

The smallest time unit we have is every minute. We didn't find any free API that offers smaller time units like every second even if we considered some others like Binance/Coinbase/Bitstamp/Kraken/ITBIT. However we found csv files on this platform (https://www.cryptodatasets.com/platforms/Bitfinex/BTC/) but at the time we visited the website, there were not updated recently enough to match the retrieved tweets.

With the CryptoCompare API we were able to retrieve currencies minutely for the three different cryptos:

  • Bitcoin : 11.05.18 - 06.06.18
  • Zilliqa : 18.05.18 - 06.06.18
  • Nexo : 22.05.18 - 06.06.18

Even if we have the currencies grouped minutely with an OHLCV (Open High Low Close Volume) system, it is sufficient to use the closing price to study the correlation.

Preprocessing

For the preprocessing, we remove all of the useless data from the tweets, such as HTTP links, @pseudo tags, images, videos and hashtags (#happy->happy). We finally stored them in a CSV file. These files are located in /data/twitter/<CurrencySymbol>/<CurrencyName>_tweets_clean.csv.

Once the cleaned files are obtained, we process the sentiment analysis on each textual content of the tweet to obtain a sentiment score named compound. This compound is then multiplied by the number of followers of the user and the number of likes to emphasize the importance of the sentiment. Here is the calculation made on the compound to obtain the tweet's score:

tweet's score = (#like + #follower) * compound

If the tweet comes from an influencer, the number of followers will be high, so the score will be. If it is retweeted by a lambda person with a dozen of followers, the score will be small. We don't consider the number of retweet because it is the same for the original tweet and for any of its retweet: we don't want that a lambda person could have an impact on the score by considering the number of retweet in the tweet's score calculation.

Finally, after the tweets have been fully processed, we end up with two features:

  1. the creation date of the tweet
  2. the score of the tweet where a negative value indicates a bad sentiment, a positive a good sentiment and a zero indicates a neutral sentiment.

Techniques and algorithms

In this section are listed the techniques and algorithms we used to find a correlation and to analyze the sentiments of the tweets.

Cross-correlation analysis

Applying a correlation on the series (tweets' scores and crypto currency) is not enough. That's why we need cross-correlation. The difference is that cross-correlation adds a lag which permit to shift one of the timeseries left or right to find, maybe, a better correlation. This is coherent with our problem as the currency changes come after the tweets' sentiments. So we are fully allowed to operate it.

Now the correlation's method we use can be either Pearson, Kendall or Spearman. We tried all of them and their are pretty equivalent. However Spearman obtains globally better results because it is able to correlate on linear and non-linear data.

Sentiment analysis - vaderSentiment

For the analysis of the sentiment we use the VADER algorithm. There is a great implementation in Python called vaderSentiment. https://github.com/cjhutto/vaderSentiment.

Here is a description of the 3 sentiment analysis algorithms that we considered.

Polarity classification

Since the rise of social media, a large part of the current research has been focused on classifying natural language as either positive or negative sentiment. Polarity classification have been found to achieve high accuracy in predicting change or trends in public sentiment, for a myriad of domains (e.g. stock price prediction).

Lexicon-based approach

A lexicon is a collection of features (e.g. words and their sentiment classification). The lexicon-based approach is a common method used in sentiment analysis where a piece of text is compared to a lexicon and attributed sentiment classifications. Lexicons can be complex to create, but once created require little res

View on GitHub
GitHub Stars108
CategoryDevelopment
Updated6mo ago
Forks36

Languages

Jupyter Notebook

Security Score

72/100

Audited on Sep 17, 2025

No findings