How to choose keywords in Twitter for data mining

Keyword Cloud as extracted from twitter

Keywords play an important role in Twitter. People often differentiate arts and science. But, the power of both working collectively is often ignored. Choosing a seed keyword(s) in Twitter is an art. But, using science to make a large list of keywords from the seed keyword is as important as a seed keyword(s). The code below uses a seed keyword and propagates and collects all other keywords in the same tweet.

'''
Takes tweets as Input
'''
import re
import sys
from operator import itemgetter
hash_l = []
hash_t = {}
''' Take filenames from the command line. This file contains all the tweets with seed keyword #nuclearenergy'''
filename = sys.argv[-1]
with open(filename) as f:
tweets = str(f.readlines())
hash_l.extend(re.findall(r"#(\w+)", tweets.rstrip('\n')))
for i in range(len(hash_l)):
if not hash_l[i] in hash_t:
hash_t[hash_l[i]]=1
else:
hash_t[hash_l[i]] +=1
'''
Prints hashtag list
for i in range(len(hash_l)):
print hash_l[i]
'''
'''
Sorts the dictionary according to frequency of hashtags
'''
for key, value in sorted(hash_t.items(),key=itemgetter(1), reverse=True):
print str(key)+' : '+ str(value)

For example for a seed keyword #nuclearenergy the following tweet gives secondary keywords as #nuclear,#nuclearforclimate and #USElection.

Nuclear safety case engineer jobs with Atkins UK. Full details at #nuclear #nuclearenergy #nuclearforclimate #USElection …

The keywords can be arranged according to the frequency and a cut-off can be decided. The cut-off should be decided after plotting all the frequencies. The pattern in the graph depends on the amount of tweets collected and the hashtags used. Though the repetitive hashtags(if any in a tweet) should be removed before plotting. The plotting helps in deciding a window for cut-off and continue collecting the tweets.

Hashtags were supposed to make data scientists life easier but the noise makes it a lot more difficult to analyze the tweets. Suppose, If someone is searching for #USElection then should the above tweet be a candidate tweet to be analyzed? The answer is No.

Unfortunately, users use hashtags as a tool to get their tweet more views irrespective of the tweet’s topic of concern. In order to remove such noisy tweets, manual intervention is mandatory.

from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
import tweepy
Consumer_Key = "xxx"
Consumer_Secret = "yyy"
Access_Token = "aaa"
Access_Token_Secret = "bbb"
auth = tweepy.OAuthHandler(Consumer_Key, Consumer_Secret)
auth.set_access_token(Access_Token, Access_Token_Secret)
api = tweepy.API(auth)
class StdOutListener (StreamListener):
def on_data(self, data):
print data
return True

if __name__ == '__main__':
l = StdOutListener()
auth = OAuthHandler(Consumer_Key, Consumer_Secret)
auth.set_access_token(Access_Token, Access_Token_Secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords:
stream.filter(track=['List',of','your','keywords'])

Applications of the proper keyword Targeting in Twitter:

  1. Language Targeting
  2. Gender Targeting
  3. Interest Targeting
  4. Follower Targeting
  5. Behavior Targeting
  6. Geography Targeting
  7. Tailored Audience Targeting

Note :

  1. Keywords need not be single word, but should not be separated by a space .
  2. Do not trust the hashtags blindly.

Feel free to give us your valuable feedback to fine tune the article.