Searching for examples
After I got Blockbuilder working I realized that we were going to have a lot more blocks on our hands, and in the months since its existed Blockbuilder users have added at least 3,000 blocks to the ecosystem. With this flourishing of blocks people started to notice it was harder and harder to keep track of new and interesting examples. Enter search:
How it works
The journey to block search started with the amazing bl.ocksplorer project by Irene Ros following an excited discussion at the first d3.unconf. The project showed me that there was a way to index blocks, something non-trivial if you know how bl.ocks.org works.
Blocks are powered by GitHub’s gists service. Each block is essentially a mini git repo, and the only requirement is that it has an index.html. Each gist has a unique id, which is used to view it in the browser like: https://gist.github.com/enjalot/21dd49a90c55348484f3
To view this gist in bl.ocks.org you just change the domain like so:
This is Mike Bostock’s genius, by storing code in gists bl.ocks.org doesn’t need a database, GitHub is the database. When you go to a user’s bl.ocks page:
bl.ocks.org is making a request to the GitHub gist API to get all of that user’s latest blocks.
The thing is, there is no easy way to get just a user’s blocks because GitHub doesn’t differentiate blocks from other gists. To do that, bl.ocks.org makes sure there is an index.html file before displaying it in the user’s profile. This works well when you are just showing the latest 50–100 blocks of an individual user, but it doesn’t scale to showing any block made by all users.
To do that, Irene’s bl.ocksplorer project started collecting usernames of people who made blocks, you can submit one using this form and the username will start being tracked. Every few hours a script will run that pulls down all of the gists for each user in the list and filters them to just blocks. These blocks are then processed and the d3 API functions used in each one are extracted so they can be searched.
I started playing with the blockscanner repo, but I have an allergic reaction to Redis. I also determined that I wanted to download the contents of each gist to disk so I could explore them further, as well as be able to decouple the processing step from the downloading of the gists (to avoid GitHub API rate limits). Finally, I wanted to inject a bunch more users into the list using everyone who has logged into blockbuilder.org. (Keep in mind we only index public gists using publicly available information from the GitHub API, Blockbuilder does not store any GitHub credentials outside of the sessions which have their own database).
After playing around in my own sandbox for a while, I continued to iterate on the process for compiling users, getting the list of the latest blocks and downloading their content which culminated in this repository: blockbuilder-search-index
At the time of this writing there are about 11,200 blocks being indexed from 1,068 users. The total disk space occupied by all the downloaded files (a subset of the gists, we only download text files such as .html, .md, .js, .csv etc.) is 1.8GB. This is completely manageable, and should continue to be as the number of blocks increases.
I have deployed the blockbuilder-search-index scripts to a server that runs them every 15 minutes with a cronjob, ensuring the search stays up-to-date. I have also updated Blockbuilder to reindex a block whenever it is saved. If you save a private block it will attempt to remove that block from the index (in case you decide that a previously public block should become private).
Finally, I decided to implement this feature in its own repo: blockbuilder-search
This allowed me to develop the front-end as well as the API in a sandbox without cluttering or getting tangled in Blockbuilder’s current codebase. I’ve learned a good bit of React since I started and I didn’t want to refactor the whole app just to add this. This also has the added benefit of allowing the search to be an optional add-on, it requires quite a bit more backend setup than the base Blockbuilder project and I don’t want to impose that on people who are just interested in client-side editing of code.
Anatomy of an indexed block
To get a sense for the output of all that indexing work, let’s take a look at how a single block is stored in elasticsearch:
You can see the basic metadata we would expect, userId, description and the created_at and updated_at dates. During processing we pull out any function call that starts with d3 (from index.html, or anything ending in .js or .coffee), resulting in the api array. We also pull out any hex codes we see (from *.html, *.js, *.coffee and *.css) resulting in the colors array. We also pull out the filenames from the gist into their own filterable field. The readme.md is put into readme whil the index.html is put into code. Both of these fields along with the description are analyzed by elasticsearch and used in the full text search. If you are familiar with elasticsearch you can see the mapping here.
Just the beginning
Now that we are able to gather all the blocks in one place, we have a lot of work still to do. First, we need to find all the cool examples no one knew about and share them! Second, we have all kinds of potential for visualizing what’s happening with blocks.
I’ve been playing with this idea with a few friends, and there have already been some really cool blocks coming out of that:
Now that we can dynamically query subsets of blocks, these meta-visualizations can become richer and more tailored.
Something else I’m interested in investigating is files. Starting with file names and file types, we could find out more about what kind of data people are sharing with their examples. We can take it even further, since we download all the data (408MB zipped) we could analyze the types, the columns and the content of the data to see if there are any interesting patterns. I leave this as an exercise for the reader, but if you do let me know on twitter or email!
Irene Ros showed me that this would be possible. Her pioneering work on bl.ocksplorer.org inspired me greatly and this would not have been possible without it. Furthermore, her leadership in the d3 and greater datavis community is something I aspire to. She started OpenVisConf and we were fortunate enough to have her as the keynote at the first d3.unconf where she developed the idea for bl.ocksplorer.org and built it on the plane ride home.
Mike Bostock is the giant who’s shoulders I stand on. He has made not one but 3 software libraries which I base my current career on, and I know I’m not the only one. His genius insight to use gists as the backing for bl.ocks continues to inspire me, my first attempt was with Tributary and now I’m trying to contribute to his momentum with Blockbuilder.
I’d like to thank Zeus Lalkaka for providing key feedback during this whole process and listening to me complain about everything from React to elasticsearch. He also helped me load test with siege, which was easier than I thought it would be.
Additional shoutouts go to Brian Smith, who scraped StackOverflow for bl.ocks.org links so that we could index them. Micah Stubbs, who has a big collection of blocks shared in the Knight d3 course. Christophe Viau who pulled out the blocks users from his database of examples that weren’t already being indexed. They took our collection of users to be indexed from 600 to 1000!