Apps on apps on apps

Exploring the App Store with tSNEJS & Scikit-learn

Veeral Patel
3 min readDec 15, 2015

--

About a week ago, I randomly read Andrej Karpathy’s post about visualizing top Twitter users based on what they tweet about and wanted to do the same thing, but for iOS apps and their descriptions. So, I did.

With help from Andrej’s post, I used d3, tSNEJS, Scikit-learn to represent the similarities and differences between the top 500 free apps on the App Store.

Play around with it here (on a desktop): http://veeralpatel.github.io/app-clustering/

After scraping the names of the top 500 apps from App Annie using Kimono, I used Apple’s iTunes Search API to extract the description for each app. Before sending the descriptions through TfidfVectorizer, I did some preprocessing to remove sentences that would create false positives. For example, the descriptions for Tinder, UFC, and AutoRap all have the same disclaimer copy regarding subscription based in-app purchases (Apple presumably requires it). After that, as explained by Andrej, TfidfVectorizer iterates through each description and “takes note of all words (unigrams) and word bigrams (i.e. series of two words).” It creates “a dictionary out of all the unigram/bigrams” and how often the app description contains each one. Distance similarities were then computed with the dot product between each description vector.

Next, I used d3 and tSNEJS to present everything. If two apps are close distance-wise, “there is a strong attractive force between the apps in the embedding.” If they’re far apart, tSNEJS and its cost function places them relatively freely in the embedding.

Overall, the result is a nice visualization for the top 500 free apps on the App Store. Similar categories of apps end up clustering together, while others hover around.

Some interesting clusters I saw are shown below:

Messaging apps on messaging apps & stream city
Snapchat related apps & video editors

Some next steps:

  • Compare multiple features (ratings, screenshots, color schemes)
  • Build a recommendation engine based on that^
  • Add to the dataset by including apps posted on ProductHunt (and realize how many products do pretty much the same thing)
  • White space analysis?
  • Visualize other things, like similarities between Fortune 500 stocks. Features could include market trends from the past 6 months, backers, future plans, and product release dates.

Thanks for reading and let me know (@vral) if you have questions, find any bugs, or see anything interesting!

Thanks to Divyahans Gupta (@divyahansg), Yonatan Oren (@yonatano_), and Aneesh Pappu (@AneeshPappu).

--

--