FBLYZE: a full service Facebook scraping engine in an easy to use Docker container.

PaddleSoft
2 min readApr 30, 2017

--

Social media analysis is becoming increasingly important for a vast variety of companies. As social media sources continue to grow, the need for tools which can easily scrap and analyze the social sphere becomes more and more apparent.

Every social media site is different and some are easier scrap data from than others. Twitter for instance, provides an easily accessible streaming API which connects quite nicely with many applications. Facebook, however, is a bit more difficult. Yet for our purposes extracting meaningful data from Facebook is crucial. There are hundreds of open groups and whitewater dedicated pages across Facebook with frequent discussion of paddling related events. These groups contain data that is necessary for our services, whether it is quickly summarizing what paddlers are discussing, generating flow information for streams without gages, or building river recommendation engines. We played around with a few different ideas including the scraper from minimaxir, however, this did not fulfill our need entirely.

Enter FBLYZE. FBLYZE is an ongoing open-source project aimed at creating a continuous Facebook scraping and analysis engine. We want to make it really simple to scrape a variety of Facebook groups and pages on a regular basis (e.g. every ten minutes, hour, day…etc) and then save/process the scraped content. We are trying to make FBLYZE as robust as possible in order to accommodate a variety of scraping needs. So far we have a created a Docker container which you can pull with the command docker pull paddlesoft/fb_scraper This Docker file can then also be used in conjunction with Apache Airflow in order to continuously scrape data and save it to your desired location.

So far we have added several modes of file saving including saving posts directly to ElasticSearch, uploading CSV and JSON data to S3, and shipping posts directly to Kafka. We plan on adding more ‘connectors’ shortly including HDFS, MySQL, and Neo4j as well as improving our current file save formats.

How to get started

  1. Pull our Docker image docker pull paddlesoft/fb_scraper
  2. Create a variables.list file and include the necessary environment variables. See here for configuration instructions.
  3. Now you can test by running docker run --env-file variables.list paddlesoft/fb_scraper

Using Apache Airflow to schedule jobs

4. Download our template DAG or create your own from scratch.

5. Modify the DAG based on your requirements.

6. Then start up and trigger the DAG.

Analyze your results

We mainly leave this step up to you! However, in the future we will be publishing some articles on how we employ FBLYZE in our production pipeline. We do have a Spark tutorial on analyzing the CSV data that you can check out as well!

--

--