in text analysis which using twitter data, crawling is an important thing to do. There are many ways for us to do that, to crawl twitter data we can use official twitter API and many programming languages. Python 3 comes with many useful libraries which makes easier for us to do a lot of things with it. Tweepy is one of the Python 3 libraries which can be used to crawl twitter data. I assume the reader has the basic knowledge in Python so I didn’t explain it from basic and I will be focused on Tweepy things.
If you are new to Tweepy and want comprehensive knowledge about it, you can go to http://docs.tweepy.org/en/latest/getting_started.html and read the Tweepy documentation. After that, you must install Tweepy using pip, go to http://docs.tweepy.org/en/latest/install.html for the installation steps.
Here is the full Python code
now I will explain about the codes above,
first of all, of course, we must import the Tweepy library to use it.
save your consumer key, consumer secret, access token and access secret on a variable, its make easier for us if we want to change the keys.
consumer_key = ‘change with your consumer key’
consumer_secret = ‘change with your consumer secret’
access_token = ‘change with your access token’
access_secret = ‘change with your access secret’
tweetsPerQry = 100
maxTweets = 1000000
hashtag = “#mencatatindonesia”
tweetsPerQry is the number of results that we retrieve per request, maxTweets is the maximum number of tweets that we want to retrieve and the hashtag variable is a keyword that we want to search. I use mencatatindonesia hashtags as a search query, mencatatindonesia is a tagline of Indonesia 2020 census which held every 10 years.
Line 11–13 are used for twitter authentication, in line 13 there are two parameters besides authentication, wait_on_rate_limit and wait_on_rate_limit_notify are used to call the auto-sleep function in Tweepy when hits the rate limit of Twitter API.
we use a while loop to request all available tweets in #MencatatIndonesia hashtag. If Statements in Line 17–20 is used to request tweet data, when maxId is less than 1, the programs run search query from the latest tweet and then save the last id to become maxId, so the program can do a request again from the latest tweet based on maxId.
Line 26 is used for iterate tweets data that we get from every request, for every tweet we print out the text content. If there are no more tweets that can be found in a request, it will break our Python Twitter Crawler application and print out “Tweet habis” (Tweet habis means There are no more tweets in Bahasa languages)right before its break.
this is my command line screenshot of our twitterCrawler.py. There are 1082 available tweets contain #mencatatindonesia hashtag. If you want to print the number of tweets, you can print the tweetCount variable which has existed in our programs.
That’s all for this tutorial, Thank You…..