Lately, I have been addicted to opening ProductHunt(PH) time to time and view trending tech products. Sometimes the list is so beautifully curated that my hours just flew away reading post by post. Something new in tech, hitting the market, surely deserves at least a glance. That’s what I like the most about PH, a platform where one can hunt for good technological products.
I, myself, posted one of my side-project Json2Html on PH which luckily got traction and gave me a good surprise on this New Year 🙌. It seriously feels awesome when people are talking about your product(especially positive :P) and there’s a direct session of questions and answers. For those few days, my daily time spent on PH increased exponentially. Umm, slightly a bad habit it was for some time… 😉😉
No doubt PH has a good user interface but a well organized list of posts on weekly / monthly basis is certainly missing. Thus, my search for the same began and I landed on some already created stuff using PH API. Phlics is what I would say is a good use of public API. But my hunt still continued…
Hands on different API for data collection
While I was playing a little with PH API, I got to know that it doesn’t provide much about a particular user, but yes, it surely provides the Twitter handle, which in itself is a good source of a lot of valuable data-set. Of course, my next step was to jump upon Twitter API documentation. Analyzing a sample user twitter data, the first thing that caught my attention was the location attribute, the fundamentally most common segment to analyze first. Geographical distribution of user base is a common metric which helps in future product planning and target the correct audience smartly. 😎
So the hunt nearly slows down as the coding session started dissolving me. Apart from the data-set, I was mostly concerned about the visual representation of the location based data in a more subtle way rather than just a bar-chart. Woot! I could recall google-experiments on demonstrating some geographical depiction of location based data-set.
The Webgl Globe — A Google’s open platform experiment for geographic data visualization.
The good part of webgl globe is scaling of huge data as well as a better 3D modelling of data with which a user can interact nicely. The library assumes the data to be a continuous repetitive Array of three values i.e (latitude, longitude, magnitude). For eg. [23.45678, -123.98765, 0.006428] which means the geographic coordinates(23.45678, -123.98765) actually represent a particular location and the magnitude(0.006428) basically tells the library the normalized magnitude weightage for that location. What I mean by normalized magnitude weightage is: mostly the data-set values cover a wide range, so for it’s representation, it is necessary to club the duplicates and then the magnitude for each location has to be normalized. Normalizing of data simply refers to the mapping of domain range to the desired output range. The library expects the magnitude values to be in [0, 1] range. The simple way of normalizing the values is as follows:
// assuming magnitude Array to be [69, 20, 46, 1 , 8, 15, 96, 100, 2]
1. Find the largest value (in our case, it’s 100)
2. Divide each value(magnitude) of Array with the largest value.
3. So, the Normalized array as per the basic maths would look something like: [0.69, 0.20, 0.46, 0.01, 0.08, 0.15, 0.96, 1, 0.02]
Just for the understanding and co-relation, one can relate these values to the density of population in a particular location assuming data-set to be number of users.
Again, to get the geocodes for different users’ location from twitter data, I had to use some public API. What can be better than using the Google Maps Geocoding API for this purpose.
Product Hunt Geographic distribution of users is now listed on Google Experiments Site also.
Technical Challenges faced
The most common limitation dealing with third-party public APIs is Rate Limiting and that too when different services have their own limits. Below are the different late limiting criteria for three different third-party public API’s I used for generating the final plot-able data-set.
- ProductHunt API (PHAPI) -
Rate limiting is applied per application. You can make up to 900 requests every 15 minutes.
- Twitter API(TAPI) -
It allows maximum 180 API Requests / 15-min window for both: user-auth and app-auth.
- Google Maps Geocoding API (GMGAPI) -
Users of the standard API can make up to 2,500 free requests per day maintaining no more than 10 requests per second otherwise paid plans will do the needful.
Now one can simply imagine the overhead to tackle each API limits separately. Below are the response times along with the list returned by each API tested on a micro Amazon EC2 Instance.
- PHAPI returns 100 users list in approx 1 sec / request
- TAPI returns 100 users list in less than 1 sec / request
I was able to send and get synchronous 180 requests’ data within less than a-minute. But let me correct, one has to sit idle for the next 14.5 minutes.
- GMGAPI returns geocode in less than 1 sec / request.
Let’s do some basic maths here,
- PHAPI returns approx. (100*60*15 = 9,00,00) users per 15-min window.
- TAPI returns approx. (180*100 = 18,000) users per 15-min window.
- GMGAPI returns approx. (2500) free geocodes per day.
One can notice the significant difference among the data returned by three different APIs. This was really a pain in the ass. Oh, God!
What could be done to overcome GMGAPI limit?
Challenges Accepted, My Lord !!
Workarounds / Hacks for the Challenges faced
PHAPI is good enough and I was able to get 4 lakhs users list in a reasonable time span of 1 hour 10 minutes (100*60*60*1.1 ≈ 4,00,000). This was technically good response time for my service, indeed.
Through TAPI, I was able to gather roughly 4 lakhs users list in approximately 5 hours 30 minutes (180*100*4*5.55 ≈ 4,00,000). Somewhat a lot of time consuming task. To tackle this, I created different apps from same account which offers different security keys and tokens to access API, and Voilla! there was a significant drop in the overall time by a factor of n, where n being the number of different apps created.
But interestingly, not all Twitter users have set their locations. Out of valid 4 lakhs twitter accounts, only 2,53,796 users have set their location and have made it public to access. For the reset of them, either the location is not set/not publicly available or it’s not a valid address(For eg. I live in a cool place, X Planet, etc.). Human beings are crazy creatures, isn’t it? :D
Google gives a shit about these types of locations and hence, these locations fall into untracked category by the service.
Another surprising fact was, only 60,953 many locations came out to be unique, after clubbing data based on the location. This was required to save unnecessary API calls and writing my own caching algorithm. Wow, analyzing the data beforehand and pre-processing saved number of geocoding API requests by a factor of 8x. This is great for my eccentric mind :P and saved a lot of valuable time for planning next iterations.
Now, the next big challenge was to hit approx. 61,000 GMGAPI calls with a hard-limit of requesting 2,500 requests per day. No one can wait for (61,000 / 2,500 ≈ 25 days) to get the small task get done. Also, for the initial testing period, I wasn’t in the mood of spending some amount to get the task done quickly. So, on an average I needed 24 such more Google Maps Geocoding API enabled apps to get the work done within few hours. 10 apps can be created per account, and luckily I got 3 personal google email-ids to have such 25 apps in all. Now using the 25 different access tokens being provided by each app, I was managed to get the geocodes within few hours. A real time saver trick and seriously what a drastic change. This is somewhat a very bad hack, but I call it an intelligent act for testing out initial releases. Please don’t use such Google hacks for any misuse, I being an exception :)
The challenges that were scary but accepted, were now successfully tackled and executed. Cheers ☕️ !! (Sorry, I don’t drink 🍺) :P
La La La LaLaLa La La . . . . .
Now moving on to my last task — the data representation, the most awaited and interesting part.
Visual Representation of data
The final task left was to plot the data points correctly on the globe. The data being gathered and processed accordingly was now ready to be served, but of course, after passing it through the Normalizing track, as I mentioned earlier, a requirement for webgl globe.
The final process of Data Visualization consists of series of sub-tasks as explained beautifully by Ben Fry. They are:
Acquire -> Parse -> Filter -> Mine -> Represent -> Refine -> Interact
Yay, Yay and Yay !!!
After following the necessary steps mentioned above, everything was ready to be tasted 😋. I was so excited to see the hard work of many days finally speaking out loud in the form of beautiful and colourful pixels onto my colourful Mac OS X Retina screen. 🤓
It’s being Featured on ProductHunt now :)
The source code for the plotting part has been Open-sourced and is accessible on Github.
If you like the post, please hit recommend / share on Twitter and let your comments keep flooding in regards.