Premier League Twitter Analysis with Python Tweepy
Written on
In this article, we'll build a dataset from the ground up by collecting real-time Tweets from around the globe. Welcome to Python Data Science December #6.
Social media has become an essential platform in today's sports arena, allowing clubs to connect and engage with their fans. In this analysis, we will delve into the Twitter interactions of the six leading Premier League football clubs in England.
This piece is part of my ongoing series, Python — Data Science December. Comprehensive resources, datasets, and the necessary Python libraries and installations can be found at the end in the Summary & Resources section.
Creating a Twitter App
Warning: This article was written prior to Elon Musk's acquisition of Twitter. There have been numerous reports regarding chaotic events at Twitter recently. While my code remains functional, I cannot guarantee that every step will work seamlessly for you.
To fetch Tweets from Twitter using their API, you must:
- Have a Twitter account (register at Twitter.com).
- Create an application on the Twitter Developer Portal. Let's walk through this process together.
Head over to the Developer Portal and initiate a new project.
Twitter will assist you throughout this setup. You need to configure three items:
- Project Name: Code&Dogs
- Use Case: Exploring the API
- Project Description: Exploring Twitter API with Python
Once this is complete, you can establish an App within your project, again following three steps:
- App Environment: Development (alternatively, you could select Staging or Production)
- App Name: DataScienceDecember
- Keys & Tokens: You will receive an API Key, API Key Secret, and Bearer Token. Make sure to copy these or write them down.
With the Project and App set up, we can utilize them with a monthly limit of 2,000,000 Tweets.
In addition to your API Key, API Key Secret, and Bearer Token, you'll require two more credentials found in the Authentication Tokens section of your App: the Access Token and Access Token Secret. I recommend storing all keys and tokens in a dedicated credentials.py file.
Exploring Tweepy
Let's dive into Tweepy and learn how to authenticate our Twitter App using Python.
- Import the tweepy library and the previously created credentials.py (lines 1-2).
- Load all keys and secrets from credentials.py (lines 4-8).
- Establish a tweepy OauthHandler with your credentials (lines 10-11) and connect to the Twitter API (line 12).
- As a preliminary test, send a new tweet to your timeline (lines 14-15) and verify it on Twitter.
Tip: By default, your Twitter App is set to Read-only access. You can modify this in the App settings to Read/Write, but you may need to regenerate your access token and secret. For the next steps, you can skip this if you do not require write access.
Next, let's investigate how to read Tweets from any user's timeline.
To begin, we select Elon Musk's Twitter account (line 1) and read his timeline with the following parameters, storing the results in the tweets variable (lines 2-7):
- count=10: Specifies the number of Tweets to retrieve.
- include_rts=False: Excludes Re-Tweets from the timeline.
- exclude_replies=True: Prevents replies from appearing in the results.
- tweet_mode='extended': Ensures Tweets with more than 140 characters are included.
Once we have all the Tweets, we loop through them and print details such as text and creation date (lines 9-13).
The results confirm that we accurately retrieved data from Elon Musk's Twitter profile.
Building Our Dataset
Now that we understand how to use Tweepy to obtain Tweets, let's focus on building a dataset to compare the Twitter activities of the top six Premier League football clubs.
Starting with Manchester United, we can break the code into manageable sections:
- We already know how to read Tweets from a user's timeline. We designate userID = 'ManUtd', which is the official Twitter account for Manchester United (line 1), and increase the count to 200 (line 3). The rest remains unchanged (lines 1-6).
- We store all Tweets in the TweetCollector list (lines 8-9) and save the id of the last fetched Tweet (line 10).
Next, we will continuously request more Tweets (lines 12-26) until no more are available (lines 20-22). There are certain Twitter limits that restrict the number of Tweets you can fetch within a specified timeframe.
- Begin the while loop (line 12).
- Request Tweets from the user ManUtd, starting from where we last stopped using the max_id parameter (line 18). If there are no more Tweets, we exit the loop (lines 20-22).
- Append the Tweets to the TweetCollector list, save the id of the last fetched Tweet, and display the total number of Tweets collected so far (lines 24-26).
The loop will repeat, and it appears that ~3000 Tweets is the maximum we can obtain.
Once the loop is complete, we process the full list of Tweets stored in TweetHelper (lines 1-7) and split each Tweet into the following components:
- The club name ‘Manchester United’ (line 1).
- The Id, creation date, favorite count, and retweet count (lines 2-5).
- The Tweet text (line 6), ensuring to remove any line breaks.
We then save this information in tweetsHelper.
Finally, we will save tweetsHelper into a pandas DataFrame, add the necessary headers, and export it as a CSV file.
Let's take a brief look at the generated CSV file.
Now, we will replicate the process for Liverpool F.C. by simply changing the userID variable to LFC and running the script again. Once completed, we will have a CSV file with Tweets from Liverpool F.C.
Next, we will perform the same steps for the remaining top six clubs in the Premier League:
- Arsenal (userID = 'Arsenal')
- Chelsea (userID = 'ChelseaFC')
- Manchester City (userID = 'ManCity')
- Tottenham (userID = 'SpursOfficial')
This will result in six distinct CSV files, one for each club.
The final task is to combine these datasets into a single file. We can easily read all CSV files into separate DataFrames (lines 1-6), merge them into one dataset (line 3), and save it as a single large CSV file.
Let's quickly examine the structure and value counts of the combined data.
We observe that we have a total of 16,598 Tweets organized across six columns. Each Twitter account has a roughly comparable number of Tweets (between ~2,700 and ~3,000), with the exception of Manchester City, which has only ~2,000 Tweets.
The discrepancy may be due to Twitter's documentation stating that the user_timeline method only retrieves the 3,200 most recent activities from a user's timeline. Even if many retweets or other statuses are included, they are counted, despite being filtered out using exclude_replies=True and include_rts=False.
That's all for today; see you tomorrow! ?
Summary & Resources
This marks the sixth installment of Python Data Science December. We constructed our dataset by extracting Tweets from the top six Premier League football clubs in England.
To stay updated with my stories and support me, consider registering on Medium. If you have questions or need assistance, feel free to leave a comment—I'll be sure to respond.
You can access the complete Python code along with the datasets (totaling 16,598 rows) for free on GitHub. Additionally, I have prepared an advanced dataset containing Tweets from ALL Premier League clubs (totaling 53,123 rows), which will be shared exclusively on Patreon for a small donation.
- ? GitHub (free) — full code & datasets (16,598 rows total)
- ? Patreon ($3/month for regular & advanced content) — advanced dataset with Tweets from ALL Premier League clubs (53,123 rows total)