Make America Tweet Again Part 1

Part 1 | Getting Our Data

A Tutorial on Text Analysis, Twitter Mining, Automated Twitter Bots.

In the spirit of election season, I figured a good tutorial to demonstrate would be a Donald Trump style Twitter Bot.  To do this we’ll touch on twitter mining, creating a system to generate random Trump-like sentences, and create an automated process to post these sentences to Twitter.  No matter what your political beliefs may be, we can all agree that Donal Trump is certainly unique.  This especially applies to his Tweets, which is what makes him such a fun subject for this program.

At the end of this tutorial we will have done the following:

  1. Set up an appropriate Virtual Environment for data collection and analysis.
  2. To access the Twitter API through Python
  3. Save the data in a format that is easy to use and analyze.
  4. Create a system to randomly generate new data that is recognizable as being a “Trump-like” tweet.
  5. Create an automated process to post this randomly generated data to Twitter.

What You’ll Need

  • Python 3
  • Virtual Env/Pip
  • A bash shell

Source Code will also be posted to my GitHub Page for MakeAmericaTweetAgain.


Getting Started

First let’s get our initial set up going.  Open your shell and set up a home directory.

This is where we will keep all of our files.  Now we will use Virtualenv to create our virtual environment to install things that will help us access Twitter.

This will create an environment called  (env) .  For more information about what Virutalenv does and how it can help you, check out the Virutalenv site.

You should see (env)  to the left of your bash user name.  This means that you are now in our Virtualenv environment and can install libraries for our project that won’t conflict with any other versions of Python or Python libraries on our computer.

This command let’s you see all the libraries that are installed into our environment.  There shouldn’t be much right now.  One of the cool features of Virtualenv is that we can save the output of this command so that other users can install and recreate the same environment on their computer.

This will install Tweepy into our environment.  It will handle a lot of the messy Twitter stuff like authentication for us.  The second command will execute the freeze  command and save the output in a file called requirements.txt  for use in recreating our environment on other machines.  Check that the file was created correctly by executing:

You can see that tweepy and some other modules have been installed to your environment.


Accessing Twitter

Setting up a Twitter Application is very quick and easy.  Create an account or log into whatever account you would like to have your program tweet from.  You can go to the Twitter Application page to start the process of creating an application that is allowed to use your account.  Click on the “Create A New App” button and fill out the form.  Make sure your app has read and write permissions.  Once you fill out the form, go to the tab at the top that says “Keys and Access Tokens.”  This is what let’s our app access Twitter.  Don’t give these keys out!  Anybody who has these keys can access your application.

At the bottom of this page click the “Create my access tokens” button.  Make note of your Consumer Key, Consumer Secret Key, Access Token, and Access Secret Token.  Create a Python file called twitterKeys.py  and add the following code:

Here we add in our keys and provide a method to return them in a tuple.  I chose to do this because we can easily copy and paste new keys in the global variables above and not have to change any code.  It also lets me use the keys in my app and keep them hidden away at the same time!

We’re going to use these keys to get all tweets every from @realDonaldTrump to create a system to randomly generate Trump-isms for our Twitter Bot.  Launch the Python shell and type the following code:

This is all we need to interact with Twitter in Python.  First we import Tweepy to handle authentication and interaction with Twitter.  Second we unpack the tuple from the code we wrote in “twitterKeys” to load the keys into our variables.  Finally we create a Tweepy OAuthHandler object to load in our keys, and use it to create our Tweepy API object.  We will access Twitter exclusively through this newly created api  object.  To read more about authentication with Tweepy and more check out the Tweepy Documentation.

Let’s test out the API to make sure it’s working.

This is an example of how we can use the “api” object to access twitter.  These are a few examples of how we can see the timeline of the account associated with the keys, or access a sample of tweets from a use of our choosing.  The data won’t look very pretty.  It will be a nasty mess of nested JSON.  We will deal with that later.

To make sure we don’t have to repeat ourselves, I’m going to add the code we typed into the Python shell into a file called api.py .  Add the following code to this file:

Now we have this getAPI()  function ready to go anytime we need to use our Twitter API.  This will certainly come in handy later.


Getting Our Tweets

If this is your first time working with 3rd Party API’s you’ll realize that these companies guard their data fiercely.  They have limits on how much you can access and how often you can ask for data.  So we’ll have to be cautious about what we ask from the API and how often we do it.  Create a file called getTweets.py  and enter the following code:

I wanted to make this program be able to accept arguments from the command line so that people could use it for their own purposes whether they are doing this tutorial or just something of their own.   The use will use the program with the following command:

The sys  module stores the file name in argv[0] , and any other arguments from the command line after.

Because we are working with an API that could possibly break or return errors, I put the main statement in a try/except block.  If something breaks, we’ll get to see what error occurred because of the print statement.  Go ahead and try it out.  You should see the program print out whatever string you entered.  If you try to execute the program without an argument, you’ll see the IndexError message.  If you’re curious about the if __name__ == '__main__':  part, see this Stack Overflow Link.

We now have Twitter Access and we have a Twitter Handle from the command line.  No we have to combine these two to use our API to get as many tweets as possible.  We have a significant challenge in front of us.  This API is only capable of taking so many requests and returning so much data per a 15 minute window according to the Twitter Documentation on Rate Limiting.  We can also see the rate limits according to our specific user_timeline() method that we’re going to be using.

According to my understanding of the documentation, our application gets 300 requests per 15 minute window.  15 minutes / 300 requests = 0.05 minutes per request.   0.05 m.p.r. * 60 seconds = one request per 3 seconds.  So to be safe we’ll try to delay each call 5 seconds, and we should be able to avoid having our requests denied due to rates.  Open up  getTweets.py and make the following changes:

I’ll go through the changes I’ve made line by line.

  • Line 2:  We import the time library.  This allows us to delay between requests to avoid rate limits.
  • Line 6 – 7:  I’m setting our delay time, and our number of times to loop our program as global variables.  Doing this ensures we can easily change them in one clear spot in our program quickly.  If we’re running into problems later or if we need to test different parameters this will come in handy.
  • Line 13:  This is the variable that will hold the final list of tweets.
  • Line 15:  We get the index of the last Tweet in our list.  We only requested 1 tweet (the most recent one) so we can get the id.  Each time we loop we’ll change this so we’re not getting the same tweets over and over.  Because we get a list of length 1, I can access the Tweepy object inside through array indexing to 0.  Finally we can extract the id by it’s named member.
  • Line 16 & 22:  We call this method to stall the program the required amount of seconds to avoid rate limits.
  • Line 18:  We make another request, this time getting the max amounts of tweets in return.  We set retweets to false, so we don’t have a lot of duplicates in our results.  This is the method where we use our tweetIndex  variable as the max_id  argument.
  • Line 19 & 20:  We loop through our results and store the text of the tweet in our tweetResult  variable.
  • Line 21:  Update tweetIndex to reflect last tweet we saved.

Go ahead and test out the program!  Did you get a list of tweets to print out?  How many results did you get?  Everything seems to be working well except for one thing.  Our program gathers all the data, but we have no way of using this data later.  Let’s fix that.


Saving Data For Analysis

There are a lot of choices for us when it comes to saving data for later.  For this project I chose JSON.  You might have worked with JSON in the past because of Javascript, or you might recognize their similarity to Python dictionaries.  Python also has a built in module.  Make the following changes to your code:

A couple things that have changed:

  1. Line 3:  We import the JSON module.
  2. Line 23:  I removed the print statements we used for testing.
  3. Line 28 – 30:  We added a finally  block.  Now if our code breaks during our collection of data or it finishes just fine, it will still save whatever data we received.  The with statement is a handy way of simplifying opening and closing of files when you’re done with them.

After you run your program, you can retrieve the data again by doing the following in a Python shell:

What do you notice about the data?  How many tweets can we collect by editing the global variables?  Hopefully this gets you thinking a little bit and you’re coming up with your own fun things to do with your twitter collection application.  I suggest setting the MAX_REQUESTS  global variable to something a lot higher and let it run for a while so you have a lot of data to work with for Part 2.


What Next?

From this point forward we can use the programs we created to collect tweets given any Twitter User Name we want and have a system to access the Twitter API in other programs we create.  The next installment will focus on data exploration and developing a system to use the Tweets we collected to randomly generate Tweets in the style of @realDonaldTrump.

Thanks!


Part 2 of make america tweet again
References

15 thoughts on “Make America Tweet Again Part 1

    1. Yes, you’re very correct!

      I’m kind of writing this as I go through the process, so I expect to make some mistakes. Thanks for pointing that out.

  1. Great blog post Joe, this was fun to do. I attempted to pull all of Trump’s tweets but seem to not be able to get any ones before Dec. 6th, 2015, any idea why that is? Looking forward to your next write up!

    1. FWIW I ran this script which was incredibly fast, but ultimately ended up stopping at the same point that yours did: https://gist.github.com/yanofsky/5436496

      Any idea why that script was able to go so much faster than yours? I’m inexperienced with API calls in general so hopefully you can shed some light as to why that script was able to make so many calls without being stopped by Twitter.

    2. It all depends on the value of MAX_REQUESTS.

      The loop will iterate that many times, and goes progressively back in time. This might have to do with the date issue. Twitter places limits on what you can get and how fast. Trump specifically tweets several times almost every day, so there’s a lot of source material, too!

      As far as the speed issue, it probably has to do with the fact that I pause between requests in order to avoid having Twitter block my API calls. I’m not an expert on the Twitter API, but it seemed like a reasonable precaution.

      The next write up is coming soon. I finished the code, just need to clean up the blog post first. You can see some of the sample output here.

      1. Yeah so I did some more investigation and it seems like Twitter has instituted a roughly 3200 tweet limit to be pulled from a user’s history and does not make any tweets publicly available via the API before that. Disappointing.

    1. If you look at the output of your error, you are getting NameError: global name 'ACCESS_TOKEN' is not defined.

      So somewhere you have a global variable named ACCESS_TOKEN. Your error says there’s a conflict on line 9 of your code. Look at the line 9, or the return statement. What do you think the problem is?

      In other words, it expects something named ACCESS_TOKEN but it can’t find it. If you look up the definition of a Python NameError name not defined and look at your code, you should figure it out pretty quickly

      Getting in the habit of analyzing these error statements is going to be the best tool you have in quickly overcoming problems like this.

  2. Great tutorial.
    One small bug: tweetIndex = tweet.id should be tweetIndex = tweet.id – 1. Otherwise, you’ll have a duplicate tweet every 20 tweets. From the API: Returns only statuses with an ID less than (that is, older than) or equal to the specified ID (or equal being the key here).
    One formatting issue: “for tweet in tweets:” and trailing clause shouldn’t be indented as far as it is.

    1. Thanks for pointing that out. I’m noticing while balancing writing these tutorials, coding on my local machine, and testing on my server there are some code inconsistencies. I’m not used to coding/writing about it for an audience, so thanks for your insights. I’m learning, too!

      I’ll look into fixing these as soon as I get a chance.

Leave a Reply

Your email address will not be published. Required fields are marked *