Make America Tweet Again Part 2

Part 2 | Data Analysis and Text Generation

A Tutorial on Text Analysis, Twitter Mining, Automated Twitter Bots.

This is a continuation of  Part 1 of Make America Tweet Again.  In Part 1 we created a system to access the Twitter API to collect tweets.   Source code can be found on my GitHub Page for MakeAmericaTweetAgain.


What Your Project Folder Should Look Like

  1. env/:  This is the directory for our Virtual Environment Files
  2. requirements.txt:  These are the installed modules to our environment
  3. twitterKeys.py:  This is the Python module used for holding our Twitter Keys.
  4. api.py:  A small module we made to give us an authenticated Twitter API
  5. getTweetsFromUser.py:  A program that takes in a command line argument for a Twitter Screen Name and extracts tweets from their timeline.

Looking at Data From Part 1

After I completed Part 1, I set MAX_REQUESTS = 150, and ran the program with the argument realDonaldTrump.  I left it to run before I went to bed because I wasn’t sure how long it was going to take.  I awoke to find a file named realDonaldTrumpTweets, so it appears our program is working!  Run your program on a Twitter Screen Name.  Now open up your Python shell.

Keep in mind getTweetsFromUser() takes an argument for the Twitter Screen Name.  Whatever name you give it will save data in a file called [twitterName]Tweets.  If you aren’t taking tweets from @theRealDonald then you’ll have to change the file name accordingly.

So my saved data recorded 2999 tweets.  After reading a few of the tweets here are some things that stick out to me about the data:

  • We have a lot of links in our tweets.  These will be randomly inserted into Tweets as well.  For our purposes this might help with creating the feel of a @realDonaldTrump Tweet.
  • There are a lot of Hashtags as well.  We might find some use in saving these to add into Tweets later for authenticity.

I imagine we’ll be doing a lot of data editing, so we’ll need a system to save and load different versions of data.  In the Python shell enter the following:

Now we have a method we can easily save new versions of our data as we create them with new file names.  Go back to your Python shell.

A set is defined as a collection of elements that does not allow duplicate members.  My list shrank down from 2999 to 2843.  There were some duplicates, but not too many.  We still have plenty of data to work with.  We turn the data structure back into a list for consistency and for compatibility with JSON.  Save your uniqueTweets:

We’ll use uniqueTweets  as the baseline data from which to generate new Trump-like Tweets from.


Filtering Data With Regular Expressions

Regular expressions are one of the easiest ways to capture specific bits of text from a string.  If you’re new to the idea of regular expressions, I recommend looking through the Python re module.  Here’s an example of how we can use them to get all of our hashtags out of the Tweets.  Return to the Python shell.

You should see:

Let’s walk through our regular expression.  Again, if you’re confused please look at the Documentation for the module.  It is very helpful in guiding you toward building a Regular Expression for any situation.  Here’s a breakdown of the Regular Expression.

  • r'...':  This is telling Python that you’re going to give it a regular expression.  You might have noticed similar notation with our Unicode tweets.
  • #:  Not surprisingly, the hashtag character will match a hashtag character.
  • \w+:  This is a special character that stands for any Unicode number, character, or underscore.  The plus in regular expressions means 1 or more times.
  • [?.!]:  The brackets say, anything in this set of characters qualifies as a match for this part of the regular expression.  This is so we can match a hashtag if it’s used at the end of a sentence with a period, question, or exclamation mark.
  • ?:  The following question mark after the range specifies that we don’t have to have whatever it follows is optional for the match.  We could have a match ending with a punctuation, or we could have a hashtag that is followed by a space only.

All together you have an expression that says if a part of a string starts with the hashtag mark, is followed by a series of letters or numbers, and could possibly end with a punctuation mark.  Trump tends to use a very specific set of hashtags, so saving them for later and putting them into our Tweets will add an extra bit of authenticity to our Tweet.  Return to the Python shell:

We’re going to use this logic later in our program, but for now we’ll move on.  Now that we’ve gotten familiar with our data and filtered through it for the hashtags, let’s start building our text generation system.


Our TweetGenerator Class Skeleton

We will create a system inside of a Class to manage functionality for Tweet Generation.  Our class needs to have the following format:

This looks pretty straight forward so far.  We can load in raw Tweet JSON, build our data structure in which we’ll use to generate our Tweets, load or save an existing data structure, and of course get a novel Tweet.


A Note On Markov Chains

There are many ways in which we can design our data structure to handle Tweet Generation.  One method I think is very fun is Markov Chains.  Say you have these 3 Sentences:

  1. I like dogs.
  2. I like to eat ice cream.
  3. Sometimes I buy hot dogs to BBQ on the weekend.

Think of the set of all initial starting states, or {I, Sometimes}.  The word “I” is in there twice, and we want to preserve this idea of frequency in our Data Structure.  This means the word “I” has a 2/3 chance of selection and the word “Sometimes” has a 1/3 chance of selection.  So we randomly select one.  If the word “I” is selected, we look at all the possible states that follow it, or {“like”, “buy”}.  Again “like” has a 2/3 chance of selection and “buy” has a 1/3 chance.

If we repeat this process until we reach an ending state, or word that ends in a period, we have generated a new sentence.  Possible output could be “Sometimes I like dogs.”, “I like dogs to BBQ on the weekend.” or “Sometimes I buy hot dogs.”

The problem with this method is that it does not preserve sentence structure and can sometimes be nonsensical.  Some users on Reddit suggested a mad-lib style approach.  One drawback of this solution is that I would have to do a lot more data analysis on all the tweets to effectively figure out the template for a @realDonaldTrump tweet.

A compromise I think would be more effective would be to use this template:
[One or two Markov Chain Sentences] + [Optional Trump-Like Exclamatory Statement!] + [Optional Hashtag]

This would create some silly output from our Tweet data and add on an ending to the sentence that is hallmark of @theRealDonald.  We can add a hashtag at the end by some method of statistical selection.  @theRealDonald peppers in enough hashtags already in the middle of the sentences, so it might not always be necessary.


Creating A System To Generate Tweets

Let’s create a file called tweetGenerator.py and add the following code:

Take some time to examine our class.  We take the same methods that we described earlier and add them to our file.  We added class members rawTweets to hold our processed Tweet data, intialStates to hold the possible starting states in our Markov Chains, and markovDictionary that will be the same data structure we described earlier.

Notice that we also added in the JSON handling methods we created earlier, loadJSON() and saveJSON().  In our loadTweets() method, we can add data we got from Twitter to our class members.  The saveTweetGen() method will be responsible for saving our data structure used to build Markov Chain Tweets.  The loadTweetGen() method will just add back in any data saved already by a previous call to saveTweetGen().

Loading and saving Tweet Generator data will be slightly more complicated.  As we need to save data for markovDictionary, initialStates, and rawTweets.  Add the following methods to your TweetGenerator class:

As you can see, all we’re doing here is using our loadJSON() and saveJSON() methods to add our class members to JSON and saving it, or extracting the class members back out from loaded JSON.

Now let’s focus next on our __init__() method.  Again, I’m trying to make this flexible for other people to us and for purposes of debugging.  I want the method to be able to take in an argument for a Tweet source file, or no argument if the user would like to build the TweetGenerator one step at a time or load in previously generated JSON.  Edit your __init__() method:

Here is a line by line breakdown of this method:

  • Line 2:  Our method has a default argument.  If there is no sourceFile provided, None is used as a default.  This way we can handle both scenarios.
  • Line 3:  We saved our JSON data as a list.  So we’ll make rawTweets type list.
  • Line 4:  We saved our JSON data as a list.  So we’ll make initialStates type list.
  • Line 5:  Because we’ll need to look up all sets of words following a specific word, we’ll use a Dictionary to store our Markov Chain data structure.  We’ll use a lot of key value lookups to generate our sentences.
  • Line 6:  If sourceFile is provided, it will not be the default None.  Therefore, we can automate the building of our class if this statement is True so the user doesn’t have to do it by hand.
  • Line 7 – 9:  Load, process, and save our Tweet Data.

Let’s shift to our processTweets() method.  This is probably the most difficult part of this Tutorial.  We need to know when we’re starting a sentence to add the word to our initialStates() method.  We need to know when we reach the end of a sentence so the program can stop execution.  We need to keep track of each new word and each previous word so we can accurately build our markovDictionary.  Examine the processTweets() method below:

Again, this is a big somewhat complicated method, so we’ll walk through it:

  • Line 2: endingCharacters are the characters that we assume are going to end a sentence.  We could make our Markov Chain for a whole Tweet composed of multiple sentences, but I’m choosing to generate specific sentences.
  • Line 4: priorWord will keep track of what came before each iteration of words in our Tweets.  We’ll set it to None to signal when we’re starting a new sentence.
  • Line 5:  We’re going to iterate through rawTweets to process one tweet at a time.
  • Line 6:  We’re going to iterate through words in tweet.split() + [None].  Split gives us a list of individual words, and adding None to the end gives us a handy way to know when we’re at the end of the list.
  • Lines 7-9:  We’re at the beginning of a sentence.  This means we need to add a word to our initial states.  After this, we set priorWord to word.
  • Lines 10-12:  If our word is None, we’re at the end of a sentence.  So we update our dictionary to reflect that priorWord can terminate with our special value END_OF_SENTENCE.
  • Lines 13-16:  We check to see if our last character in the word is in our special set of endingCharacters.  This also means we’ve finished a sentence.  So now we update our priorWord entry in our dictionary, as well as update it for the current word with our special value END_OF_SENTENCE.  Remember, any time we reach the end of a sentence we set priorWord to None.
  • Lines 17-19:  If we get here this means we’re still processing our sentence somewhere in the middle.  We just have to update our dictionary and set priorWord to word.

This gives us a system to where we can pick out a value from initialStates, pick randomly from the set of following states in our markovDictionary, and repeat the process until we hit an END_OF_SENTENCE value.  However, we used a special function updateDictionary() to make our code cleaner by removing a lot of key/value checking.  Add the method updateDictionary() below to your code:

If you’re wondering why I made another function for this process, imagine for a minute that we hadn’t.  We would have had 4 lines of duplicate code in 4 sections of that one method.  We removed 16 lines of duplicate code from our function in the spirit of Don’t Repeat Yourself.

Your class definition should now look like the code below.

At this point, we know exactly what’s going to be in our class, specifically rawTweets, initialSates, and markovDictionary.  Let’s make methods to save this processed data.  That way if we so choose we can easily load in fresh class members without having to process the data all over again.  Add these methods to your class:

So now all we need to do now is generate a sentence from our Tweets.  This is a straight forward process because all we need to do is key value lookups until we come up with a value of END_OF_SENTENCE.  Add in the following code to your generateTweet() method:

Let’s look at a line by line breakdown of the function.

  • Line 2:  We use the randint() function to give us an index that falls within the range of initialStates and take the value corresponding to that index as our first word.
  • Line 3:  We store our word in a list in variable tweet.  This is to make spacing between words easier as you’ll see later.
  • Line 4:  We’ll loop until we find the END_OF_SENTENCE value.
  • Line 5:  Randomly select an index in the range of the set of all words that follow the word variable in markovDictionary.
  • Line 6:  Select a new word from markovDictionary with key randomIndex.
  • Line 7-9:  If word isn’t our END_OF_SENTENCE value, we keep going.
  • Line 10:  This function is a handy way of putting all the values of our list together with spacing in between.  If we just added everything to one string, we might get confused about where to put the spacing.

Now that we have our complete class, we need to test our functionality before moving forward.


Taking A Look At Our TweetGenerator Class

Go to your Python shell and type the following commands:

It seems everything is functioning properly.  What did you think about the Tweets that came out.  Some sound more like nonsense than others, and others might pass for a sentence out of a TrumpTweet.  You might be getting weird characters and other relics in your Tweets.  While you might be able to figure out systems to refine this method, I’ll leave it for you to do that on your own.


Trumpifying Our Class With Inheritance

Taco Bowl Trump
Taco Bowls Are Not Mexican Food.

We now have a class that can generate tweets based off a bank of their prior Tweets.  However, my goal is to emulate Trump Specifically.  To do this we’re going to use Inheritance to build off of the methods we just created and modify them as necessary.

Again, the format I’ve chosen for a Trump Tweet is [One or two Markov Chain Sentences] + [Optional Trump-Like Exclamatory Statement!]+[Optional Hashtag].  Because of this we need new class members for hashtags and Trump-like exclamatory words.  We’ll reflect this in the __init__() method of our code.

Below the last method of your TweetGenerator class in tweetGenerator.py, add the following Code:

If you’re new to Inheritance, it is a key principle of Object Oriented Programming.  To the right of the Class name you can place the class you want to inherit from in parentheses.  Please read up more if you’re curious, but for now I’ll just give you a rough idea of what’s happening.

We need the __init__() method to say how we’re going to handle a TrumpTweetGenerator object different from our TweetGenerator object.  First we add in new members theBestHashtags and theBestWords.  I added in the word set by hand after looking through some more of @theRealDonalds more notorious Tweets.   We’ll leave the hashtag member empty for now and use code we wrote earlier to analyze hashtags to help us out.  We also need to handle both cases of creating a class with the sourceFile argument or not.  We can do this with a simple If/Else statement. TweetGenerator.__init__(self) passes the object that is calling it to the TweetGenerator.__init__() method.  This ensures that we have all the same class members available to us in the TrumpTweetGenerator as we do in TweetGenerator.

Because we have new class member variables, we need to handle how we save them in loadTweetGen() and saveTweetGen().  Add the following methods to the TrumpTweetGenerator class:

When we redefine a method from a class that we inherit from, it overrides any code from TweetGenerator.  Everything else stays as is.  So we override our new methods by using the same code we did for loadTweetGen() and saveTweetGen(), except we add the new class variables.

Now we’re going to add code to collect our hashtags from our rawTweet variable.  Add the following methods to your TrumpTweetGenerator class:

This shouldn’t look new.  We use the same Regular Expression we did before and collect the hashtags in the same way.  The only difference is that we (1) use the class variable theBestHashtags to store our filtered out data, and (2) the If/Else statement.  This statement is just a precaution in case somebody calls this method by hand before we’ve loaded anything into our rawTweets class member.

You’ll notice that we also override the processTweets() method.  All we’re doing here is calling the same method from TweetGenerator, but immediately calling our collectHashtags() method once it’s done.  This just makes sure that any time loadTweets() is called on a TrumpTweetGenerator object, it immediately extracts the hashtags in rawTweets.

Our very last step in the process is to generate our Trump Tweet.  In your TrumpTweetGenerator class add the following function:

This function is described line by line below:

  • Line 2: finalTweet will be the output of this function.  You’ll see why I chose to set it to [] below
  • Line 3:  We want to repeat this process if finalTweet is larger than the 144 character Twitter Maximum, is too short, or if we’re entering this loop for the first time ( finalTweet = []).  I just chose the too short mark arbitrarily.  After testing the output of this function I felt it helped generate better data.
  • Line 4:  Reset finalTweet to an empty list.  Again, we’re using a list to eliminate misplaced spacing between words.
  • Line 5-6:  Randomly get 1, 2, or 3 sentences and add them to finalTweet.
  • Line 7-8:  Add an element of theBestWords with 50% probability.
  • Line 9-10: Add an element of theBestHashtags with 50% probability.
  • Line 11:  Convert output to a string for length comparison.
  • Line 12:  Return all the elements in finalTweet separated by a string.

We’re finally done!  Let’s go back to our Python shell to test some of the output:

What do you think about the output?  Feel free to comment with any particularly interesting Tweets in the comments section.  I can already think of a few improvements, but I think it’s good enough for the time being.

And remember:

Join me last night, failed badly in Colorado! Thank you Denver, Colorado! Same old stuff, our country needs change! TERRIBLE! #TrumpPence16 


What Next?

We now have a reusable system to gather a large set of Tweets from a Twitter User that we built in Part 1.  We have a class that is capable of using that data and processing it in order to generate random Tweet sentences.  We also have a class specific to Trump that generates a more authentic @theRealDonald like Tweet.  If you want to see the complete source code for part 2, you can find it on the JoeDevelops GitHub Page.

Next in Part 3 we’ll focus on creating an automated system to post our Tweets to Twitter.  I always appreciate any comments, criticisms, or ideas that you might have to make the program function better.

Thanks for reading.

REFERENCES

14 thoughts on “Make America Tweet Again Part 2

  1. since im doing this through a terminal, i was wondering if for part 3 you could go through the steps of re-entering the virtualenv?

    like i exit my shell session few days pass (like the break inbtwn part 2 and 3) and when i am to return to it, will just cd-ing into the directory put me back in the virtualenv? or do i need to go through the steps in part 1 of setting it up to rejoin it (or is it just that command:

    source env/bin/activate

    to get back in to it?

    1. the command you need to enter is:
      source [path to your folder]/env/bin/activate

      You can tell because it will add (env) to the left of your prompt.

  2. Thanks for this great tutorial!
    Unfortunately if I’m trying to generate a tweet I get this error:

    Traceback (most recent call last):
    File “”, line 1, in
    File “tweetGenerator.py”, line 52, in generateTweet
    word = self.initialStates[randint(0, len(self.initialStates) – 1)]
    File “/usr/lib/python2.7/random.py”, line 242, in randint
    return self.randrange(a, b+1)
    File “/usr/lib/python2.7/random.py”, line 218, in randrange
    raise ValueError, “empty range for randrange() (%d,%d, %d)” % (istart, istop, width)
    ValueError: empty range for randrange() (0,0, 0)

    It’s the same with Python3. Do you know where the problem is?

    1. So the error is ValueError: empty range for randrange... This means that they expected a range, but there was nothing to select from. Perhaps you got an empty list as a value. You might be trying to access initialStates before there’s anything there. This could hint at the class not properly being initialized before you’re trying to access the data inside. Are you assigning values to initialStates before trying to access them again?

      1. Thanks for your reply. I tried it once again with the project files you uploaded on GitHub, but I’m getting still the same error.

        1. Yes, I’m still getting used to writing these tutorials. While trying to write out a thought process, come up with the program and sync the files on GitHub, and testing code on my server, there’s inconsistencies.

          Look at how you’re initializing the class. It could be how you’re saving and reloading JSON. Walk through the code on anything related to creating/initializing your tweet generators. You’ll find the problem in there somewhere.

          For example, if you create an Object of type TweetGenerator, but don’t pass it source data, you have to manually load in the JSON. Otherwise, the methods cannot create data and they’ll try to access empty lists and you get an error like the one you’re seeing.

          When I get some time I’ll see if I can’t track down the error you get. Put your code and the complete error on a pastebin and I’ll check it out.

          1. I was getting the same error, in walking through your last example in idle. It looks like processTweets to load the initialStates and other info from the tweet file is never run so after the t1.loadTweets(“realDonaldTrumpTweets”) add
            t1.processTweets() and you should be gtg

Leave a Reply

Your email address will not be published. Required fields are marked *