Whether it is about smart homes, cancer detection or the day when machines will take over our world - it is all connected to one buzzword: machine learning. But as much as machine learning is often depicted as some kind of black magic; playing around with it is actually easy. (The hard part is getting it to perfection and I am still working on this.)
I have been using mostly scikit-learn, a well-documented library for Python. Other applications and libraries that you might want to look at are:
- Tensorflow for Python from Google
- WEKA for Java
- CRAN for R
The reason why I started using scikit-learn was the topic of my master thesis. I wanted to build a machine learning based application that monitors the German Twittersphere and estimates which party is heavily pushed by bots on Twitter. But before I built my application I asked myself two questions:
Is this a case for supervised learning or unsupervised learning?
Supervised learning means that the algorithm is trained on a dataset with defined characteristics. Essentially, it is like giving a kid a basket full of fruits and some other items. There is a sticker attached to each item telling whether it is a fruit or not. You let the kid learn from this basket and then give it a second basket and check how well it can spot the items in the second basket that are not fruits.
Unsupervised learning means essentially that you skip step one. You do not give the algorithm prior training but essentially ask it right away to find outliers, in our example the non-fruit items.
In my case, I used supervised learning. I classified a tweet as a clear bot-tweet, if it came from an account that tweeted more than 24 times per day. And I classified a tweet as non-bot, if it came from an account that tweeted only once per day. I then trained an algorithm on these classified tweets, so that it could learn to distinguish between the two.
Do I need to interpret the outcome of the algorithm?
There are models in machine learning from which you can easily understand the reasoning that the algorithm has used. An example is for instance a regression analysis showing that in areas of higher unemployment, less people voted for the Social Democrats. Or a decision tree first asking you whether you are a boy or a girl and then asks you whether you are third grader or not to estimate how likely it is that you will kill a plant.
But there are also more complicated models like support vector machines and neural networks And I have to say that I struggle with imagining a 15-dimensional space with a 14-dimensional surface that divides a group of dots from the other. However, if this method does the job and I do not need to understand the solution, then this is fine.
This is why I picked a so-called random forest classifier as it turned out to be the most-accurate algorithm and I did not need to interpret its reasoning. It is essentially a method in which you build different decision trees for subsamples of the data set and then create an average of all these decision trees ( I have no idea what this average of trees looks like).
Now, after answering these two questions I can build my application.
Step 1: Create a dataset for bots and non-bot tweets
I take a dataset for the last 24 hours of tweets that I monitored. Then I simply run a for loop to check which account-ids can be found in the dataset more than 24 times and which are only in there once and save both lists. In the next step, I then use these lists to built a dataset of 1000 bot-tweets and 1000 non-bot-tweets.
Step 2: Convert the tweets into a set of numbers
In the next step, I converted these 2000 tweets into a list of numbers by counting properties like number of characters, number of unique words, number of exclamation marks etc. This is called a stylomertric approach.
Step 3: Prepare the dataset for being read in by the algorithm
Now, we have to take care of a couple of things. First, we want to make sure that the dependent variable (the one we want to predict), the botornot-criteria is stored separately from the rest of the dataset. Then we split the dataset into two parts. The training dataset and the test dataset on which we will check how accurate the algorithm is.
Step 4: Build the machine learning model, let it predict and check its accuracy
With two lines of code, we train the algorithm on the training dataset and test its performance by running it against the test dataset. The performance stats that we look at are the accuracy, the mean absolute error and the confusion matrix. The documentation is here, but essentially, the more values are on the diagonal axis from the upper-left to the lower-right, the better is your algorithm.
Step 5: Use the model to make predictions for a new, unknown dataset
Finally, we run the random forest classifier against a new dataset to make predictions whether a tweet is a bot or not a bot-tweet. In my case, I actually used a combined approach. A tweets was classified as a bot-tweet if the account had tweeted more than 24 times per day or if the algorithm determined that it was a bot.
Conclusion: What could be done better?
In this code, I look at the stylometric features of tweets. However, it might have been more fruitful to look for distinctive features on the accounts’ profile pages rather than for each tweet. This is an approach that essentially SRF Data did on its Instagram influencer story that also used machine learning to determine how many followers of influencers were probably fake.---
Want to try Halukas’s example yourself? Download the code and datasets here.