AI-Powered Cryptocurrency Sentiment Analysis

8 min readOct 20, 2023

The Rise of Cryptocurrency Investing

As a data scientist and cryptocurrency expert, I’ve noticed that investors are increasingly investing in cryptocurrency. Even people with no prior experience in investing are drawn to the potential for high returns. However, the volatility of cryptocurrency makes it difficult to make decisions about when to buy or sell. One way to get insights into the market is to consult tweets about cryptocurrency, but this is time-consuming and impractical. There are at least 500 million tweets about cryptocurrency every day, and it would take months to analyze them all. To overcome this problem, I’m creating an AI tool that can analyze the sentiment of tweets about cryptocurrency and extract the relevant information for my use case.

How Cryptocurrency Works

To fully understand why it is so important to capture the sentiment of a tweet in the context of cryptocurrency, we need to first understand what a cryptocurrency is.

A cryptocurrency is a digital or virtual currency that uses cryptography for security. This means that it is secured by a network of computers and is difficult to counterfeit or double-spend. One of the key features of cryptocurrencies is that they are decentralized. This means that they are not issued by any central authority, such as a government or bank. Instead, they are created and managed by a network of users. This makes them more resistant to government interference or manipulation. Cryptocurrencies also use blockchain technology to record transactions. Blockchain is a distributed ledger that is secure and transparent. This means that all transactions are recorded on a public database, which makes it difficult to cheat or steal. Finally, almost all cryptocurrencies have a limited supply. This means that there is a finite number of coins that will ever be created, which helps to prevent inflation.

AI tools development

Now that we know what is a cryptocurrency and why it is important to have a brief knowledge in this field in order to invest on it. The next step is now to create a tool that measure the sentiment of a tweet for helping us to take a better decision. Our AI tool will be a machine learning model that has been pre-trained on a large dataset of tweets. This model will be fine-tuned on a dataset of tweets specifically about cryptocurrency. This will allow the model to learn the nuances of the language used in cryptocurrency tweets and to identify the sentiment of these tweets more accurately. However, in order to fine-tune a pre-trained model, we first need to collect a dataset of tweets about cryptocurrency.

Data preprocessing

For this context, we will need a dataset that contains tweets about a cryptocurrency, the BTC tweets sentiment is the better choice for us. This dataset comes from data.world which is a great place to find a public dataset for creating an AI tools like ours. So let’s, analyze the data.

Before, we need to clean the tweet to do that we need to :

Removes all hashtags (#).
Removes all URLs.
Removes all mentions (@username).

def preprocess_tweet(data, column_text):
    texts = []
    for i, text in enumerate(tqdm(data[column_text], position=0)):
        # text = str(tweets_sub.loc[i, 'text'])
        try:
            text = text.replace("#", "")
            text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
            text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
        except AttributeError:
            data.drop(i, inplace=True)
            continue
        texts.append(text)
    data[column_text] = np.array(texts)
    return data

The purpose of this function is to preprocess the tweets in order to remove any unwanted elements that could affect the sentiment analysis. By removing hashtags, URLs, and mentions, the function ensures that the sentiment analysis is focused on the actual text of the tweet.

btc_sentiment = pd.read_csv("./BTC_tweets_daily_example.csv")
btc_sentiment.drop("Unnamed: 0", inplace=True, axis=1)
btc_sentiment = preprocess_tweet(btc_sentiment, 'Tweet')
btc_sentiment.dropna(axis=0, inplace=True)
btc_sentiment.drop_duplicates(inplace=True)
btc_sentiment.head()

The Tweet and sent_score columns are the two main columns needed for fine-tuning a model. The last two columns (New_sentiment_score, New_sentiment_state) were added by a NLP-trained model to compare against the original sentiments as described on the website. However, our goal is to create an AI tool that analyzes the sentiment of a tweet, so we do not need these columns.

We will use the pytorch Dataset API to load this dataset. This will help us to train the model efficiently since it will load the data by batch so that the RAM will not be in surcharge. Also the tokenizing step will also perform only when the data will be fit into the RAM.

class RobertaTweetDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.tokenizer = tokenizer
        self.features_text = data['Tweet'].values
        self.labels = data['sent_score'].values

    def __len__(self):
        return len(self.features_text)

    def __getitem__(self, idx):
        inputs = self.tokenizer(self.features_text[idx], padding=True)
        labels = self.labels[idx] + 1
        return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "labels": labels}

Now it’s time to train our model. After that, we will analyze tweets from more complete datasets.

def load_training(train_dataloader, model, criterion, optimizer):
    total_train_loss = 0
    total_correct = 0

    model.train()
    size = len(train_dataloader.dataset)
    num_batches = len(train_dataloader)

    for batch, data in enumerate(train_dataloader):
        for k, v in data.items():
            data[k] = v.to(device)
        preds = model(input_ids=data['input_ids'], attention_mask=data['attention_mask'])
        preds = preds.type(torch.FloatTensor).to(device)
        loss = criterion(preds, data['labels'].type(torch.LongTensor).to(device))
        correct = (torch.argmax(preds, axis=1) == data['labels']).type(torch.float).sum().item()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_train_loss += loss.item()
        total_correct += correct

        if batch % 10 == 0 and not batch == 0:
            loss, current = loss, (batch + 1) * len(data['input_ids'])
            print(f"loss: {loss:>7f} - accuracy: {correct/BATCH_SIZE} [{current:>5d}/{size:>5d}]")

    return total_train_loss / num_batches, total_correct / size

def load_validation(valid_dataloader, model, criterion):
    model.eval()
    size = len(valid_dataloader.dataset)
    num_batches = len(valid_dataloader)
    val_loss, correct = 0, 0
    soft = nn.Softmax(dim=1)
    with torch.no_grad():
        for data in valid_dataloader:
            for k, v in data.items():
                data[k] = v.to(device)
            preds = model(input_ids=data['input_ids'], attention_mask=data['attention_mask'])
            preds = preds.type(torch.FloatTensor).to(device)
            val_loss += criterion(preds, data['labels'].type(torch.LongTensor).to(device)).item()
            correct += (torch.argmax(preds, axis=1) == data['labels']).type(torch.float).sum().item()
    val_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {val_loss:>0.8f}")

    return val_loss, correct

def training_loop(train_dataloader, valid_dataloader, model, criterion,
                  optimizer, epochs):
    training_stats = []
    min_val_loss = 1000
    for t in range(epochs):
        print(f"Epoch {t+1}\n------------------------------")
        train_loss, train_accuracy = load_training(train_dataloader, model, criterion, optimizer)
        val_loss, val_accuracy = load_validation(valid_dataloader, model, criterion)
        if val_loss < min_val_loss:
            min_val_loss = val_loss
            print("Saving the model")
            torch.save(model.state_dict(), model_dir)
        training_stats.append({
          "epoch": t+1,
          "Training Loss": train_loss,
          "Valid. Loss": val_loss,
          "Training Accuracy": train_accuracy,
          "Valid. Accuracy": val_accuracy
        })
    print("Done!")
    return training_stats

Notice that the pretrained model we will fine-tune here is the cardiffnlp/twitter-roberta-base-sentiment, a RoBERTa model that has been fine-tuned on a dataset of tweets for sentiment analysis. It is available on the Hugging Face Hub.

class RobertaBitcoinTweetSentiment(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = AutoModel.from_pretrained(model)
        self.fc1 = nn.Linear(768, 64)
        self.fc2 = nn.Linear(64, 3)

        self.dropout = nn.Dropout(0.2)

    def forward(self, input_ids, attention_mask):
        _, features = self.model(input_ids, attention_mask, return_dict=False)
        x = self.fc1(self.dropout(features))
        x = F.relu(x)
        x = self.fc2(self.dropout(x))

        return x

This allows us to fine-tune a pretrained model on a new dataset. The pretrained model provides a good starting point for the fine-tuning process. The two fully connected layers allow us to customize the model to our specific needs. The dropout layer helps to prevent overfitting. Notice that, I provided a notebook here. Now, let’s see the final result of our training step.

The final model will be the model with the lowest validation loss since we provide a script to save the model at this condition.

Applying Sentiment Analysis to Bitcoin Tweets

Now that we have a model that can perform sentiment analysis on Bitcoin tweets, let’s apply it to a more complex dataset that is more closely aligned with real-world scenarios. The dataset is from the Kaggle community and contains tweet data about BTC that was updated six months ago. However, we only read the first 1,000,000 rows due to resource constraints and performance considerations.

tweets = pd.read_csv(TWEET_DIR, nrows=10000000)
tweets.info()

<ipython-input-9-53338e1948f6>:1: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False.
  tweets = pd.read_csv(TWEET_DIR, nrows=10000000)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 13 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   user_name         object
 1   user_location     object
 2   user_description  object
 3   user_created      object
 4   user_followers    object
 5   user_friends      object
 6   user_favourites   object
 7   user_verified     object
 8   date              object
 9   text              object
 10  hashtags          object
 11  source            object
 12  is_retweet        object
dtypes: object(13)
memory usage: 991.8+ MB

Upon reviewing the dataset containing tweet data related to Bitcoin, several data issues and necessary data type transformations have been identified. These observations are crucial for ensuring data accuracy and compatibility with subsequent sentiment analysis tasks.

The user_friends and user_favourites columns should be numeric types, but they are currently object types. This means that they are not being stored efficiently and could be taking up more memory than necessary.
The date column should be a datetime type, but it is currently an object type. This means that it is not being stored in a way that is easy to parse and analyze.
The user_verified column should be a boolean type, but it is currently an object type. This means that it is not being stored in a way that is easy to use for machine learning algorithms.

So let’s fix that.

tweets[['user_friends', 'user_favourites']] = tweets[['user_friends', 'user_favourites']].\
                                                apply(pd.to_numeric, errors='coerce', downcast='float')
tweets['date'] = pd.to_datetime(tweets['date'], errors='coerce')
tweets['user_verified'] = tweets['user_verified'].map({'True': True, 'False': False})  # Replace string by boolean

tweets.describe().apply(lambda s: s.apply('{0:.3f}'.format))

Now that our dataset is clean, we are ready to analyze the sentiment of 1,000,000 tweets present in the dataset. As we did in the fine-tuning step, we will use the PyTorch Dataset API. This will help us to speed up the inference step and improve memory efficiency. I have also preprocessed the tweets by using the preprocess_tweet function.

def inference(dataloader, model):
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    tweet_sentiments = []
    with torch.no_grad():
        for data in tqdm(dataloader):
            for k, v in data.items():
                data[k] = v.to(device)
            preds = model(input_ids=data['input_ids'], attention_mask=data['attention_mask'])
            tweet_sentiments.extend(torch.argmax(preds, axis=1).to('cpu').tolist())
    # return val_loss, correct
    return tweet_sentiments

class TweetDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.tokenizer = tokenizer
        self.features_text = data['text'].values

    def __len__(self):
        return len(self.features_text)

    def __getitem__(self, idx):
        inputs = self.tokenizer(self.features_text[idx], padding=True)
        return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]}

tweet_dataset = TweetDataset(tweets, tokenizer)
data_collator = DataCollatorWithPadding(tokenizer)
tweet_dataloader = DataLoader(tweet_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn = data_collator)

tweet_sentiments = inference(tweet_dataloader, sentiment_model)

Conclusion

We have shown that using a pretrained model on a classical tweet dataset for sentiment analysis on a crypto tweet is not straightforward. This is because the language used in crypto tweets is often different from the language used in regular tweets. To address this, we fine-tuned the pretrained model known as twitter-roberta-base-sentiment on a specific dataset based only on crypto tweet datasets. We achieved good results for the sentiment analysis task, even without using any sophisticated training methods or a large number of epochs. However, our work has some limitations. We only used a small dataset of crypto tweets, and we did not evaluate our model on a held-out test set. Future work could address these limitations by using a larger dataset of crypto tweets and by testing whether a sequence of tweets can influence Bitcoin prices in US dollars.