Extract Facebook and Twitter data from any page
You may need data from social media like Facebook and Twitter for a variety of reasons. I for one use it for statistical analysis — to get the reactions on posts from a certain page and make it into a spreadsheet for easy analysis.
To be able to extract publicly available data using a python code, you need to register as a developer and then get your app’s access tokens.
The provided APIs are no longer public APIs and it requires user authentication via access tokens
- Create a Facebook Developer Account
- Go to “My apps” drop down in the top right corner and select “add a new app”. Choose a display name and a category and then “Create App ID”.
- Go to your app dashboard from the side-menu. There, you’ll find your App ID and App Secret.
- To avoid security risks always create a new App for the sole purpose of scraping and never share your access IDs
- Create a new Twitter App with your login credentials
- Fill out the required form information and accept the Developer Agreement at the bottom of the page, then click the button labeled “Create your Twitter application”.
- After successfully creating your application, you will be redirected to your application’s settings page. Before you create your application keys, you will need to first modify the access level permissions in order to allow your application to post on your behalf.
- Click on the link labeled modify app permissions. You will then be able to choose which permissions to allow. Select Read and Write.
- After updating your application’s permissions to allow posting, click the tab labeled Keys and Access Tokens. This will take you to a page that lists your Consumer Key and Consumer Secret, and also will allow you to generate your Access Token and Access Token Secret.
Importing Python dependencies
import urllib2 import json import datetime import csv import time import tweepy from tweepy import OAuthHandler
Accessing Facebook page data requires an access token.
Since the user access token expires within an hour, we use the app ID and app secret generated above from our dummy application solely made for scraping, both of which never expire.
app_id = "your_facebook_app_id" app_secret = "your_facebook_app_secret" # DO NOT SHARE WITH ANYONE! access_token_fb = app_id + "|" + app_secret # NEVER EXPIRES consumer_key = 'your_twitter_consumer_key' consumer_secret = 'your_twitter_consumer_secret' access_token_tw = 'your_twitter_access_token' access_secret = 'your_twitter_access_secret' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token_tw, access_secret) api = tweepy.API(auth)
Define Page ID
Now we can access public Facebook and Twitter data without limit. Let’s do our analysis on the Manchester United Facebook and Twitter page, which is popular enough to yield good data.
fb_page = "manchesterunited" twitter_page = "@manutd"
Construct URL string (Facebook only)
Change num_statuses in parameters to the number of statuses you want to extract from the page
base = "https://graph.facebook.com/v2.11" node = "/" + fb_page parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) url = base + node + parameters
When scraping large amounts of data from public APIs, there’s a high probability that you’ll hit an HTTP Error 500 (Internal Error) at some point. There is no way to avoid that on our end.
Instead, we’ll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrieval code, so it kills two birds with one stone.
def request_until_succeed(url): req = urllib2.Request(url) success = False while success is False: try: response = urllib2.urlopen(req) if response.getcode() == 200: success = True except Exception, e: print e time.sleep(5) print "Error for URL %s: %s" % (url, datetime.datetime.now()) return response.read()
Extracting Facebook Status
test_status = json.loads(request_until_succeed(url))["data"] print (json.dumps(test_status, indent=4, sort_keys=True))
Processing Facebook Status
The status is now a Python dictionary, so for top-level items, we can simply call the key.
Additionally, some items may not always exist, so we must check for existence first
def processFacebookPageFeedStatus(status): status_id = status['id'] status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8') link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8') status_type = status['type'] status_link = '' if 'link' not in status.keys() else status['link'] # Time needs special care since a) it's in UTC and # b) it's not easy to use in statistical programs. status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000') status_published = status_published + datetime.timedelta(hours=-5) # EST status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs # Nested items require chaining dictionary keys. num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count'] num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count'] num_shares = 0 if 'shares' not in status.keys() else status['shares']['count'] # return a tuple of all processed data return (status_id, status_message, link_name, status_type, status_link, status_published, num_likes, num_comments, num_shares) processed_test_status = processFacebookPageFeedStatus(test_status) print processed_test_status
for x in tweepy.Cursor(api.user_timeline, screen_name=twitter_page).items(1): tweet = x.text print (tweet)
Analyzing data on Posts can be used to quantify the growth and success of your own page, or that of your competitors. Or, like you’ll see in the next blog, to build a WhatsApp bot
The data is easy to get and is very useful.
Originally published at ashishkhan.com.