Ever try acceleration?

7 min readApr 9, 2015

I am a cofounder at CURRENT which i’m building with my friend. We are making software that intelligently reads and makes sense of any text content. We feed it as many URLs as it can digest, and it neatly organizes all the pages by assigning them to generated categories. We then feed all this info to a search index available to all. See for yourself here

Our short term goal is simple — To increase the serendipity of finding amazing literature(discussion, blog post, news, opinions, reviews) both long and short(140 char ?) from a big haystack of internet. But it’s not so simple when we actually get to work.

Here’s a recipe i told my grandma when she asked me what i was making-

1. Find and index all the articles in the internet ocean
2. Make a software that reads all this and determine content quality. All while teaching itself
3. Help everyone discover something new

I overhear her asking my brother if i had been drinking lately

X days to launch a search engine.

I keep wondering why i still sometimes read the newspaper, even though i spend most of my waking hours in front of some sort of screen. Though newspapers never get as much screen time.

Newspaper is a dying breed they say, but why is it not dead yet. My guess is because it doesn't have the ‘highly recommended’ stories which are present in all modern & social Channels. Thats what keeps it still relevant, and also no privacy violation.

The behavior of reading a newspaper has now been turned to a feed, mostly about what my network recommends and likes. Another good way to consume content is subscription to sources of my preference(twitter lists, flipboard, nuzzel, feedly ; all excellent at what they do). The only problem being that I am always in a universe of my own creation, this has made access to different viewpoints very slow .

I wanted a modern version of the recycled bunch of pages. But all i got was some social newspapers and the newsfeed of what my followers like and recommend. There’s lots to discover, lots to uncover

A compromise of a solution would be to have regional curators just kicking out the patterns of low quality content from a twitter stream of selected publishers, to leave behind a stream of informative content for the user.

But my cofounder at CURRENT told me that he would like to take a shot at replacing those nicely dressed content editors with an intelligent software, and making that data accessible to all. This is how we took the challenge to build a Search Engine focussed primarily on literary quality. And just for kicks, we upped the difficulty level by setting our deadline 10 days before S15 YCombinator application closing

Another Search Engine?

Making a search engine is a tough bet, specially in a market which is much weighted in favor of a giant(now we truly appreciate what it takes to get there). In order for a new player who has to start from zero, it takes a lot of smart decisions and out of the box thinking to get up to speed in a near perfect monopolist market. And not to forget a lot of compute, lots of.

The state of internet is such that there are standards, and there are more standards. But the publishers dont give a shit about web standards, and this same reason makes internet a beautiful place to live in. But it makes it tough for a software to reliably extract useful information from heaps of html tags.

We cant be creating an index of web today. That game is long won. But with mountains of content all over, there’s still space for many players who provide accessibility to all that knowledge.

In the land of Google vs others, search is differentiated by few different features, but mostly presentation. We approach search from a very different perspective. Why provide (or try to provide) with a zillion results, when most find their answers within the first few results?

So how are things looking

Sometime in January — For experiment purpose, we start receiving twitter data from prefiltered sources I had picked over time. A week later we start to add mixed quality streams. We are doing Language Processing over realtime sources feeding us content. We get to run our software on the excess compute power at one of India’s premier university.

Sometime in February — Varun revives his web crawler he had made during while in high school. We seed the crawler with all the twitter links we have accumulated.

March 15 — 6 million stories processed in 40 days. Still not good enough.

March 23 — We restart from scratch, discarding all data amassed over this time; implementing improvements overtime, correcting previous efficiency issues; implemented a Message queue to keep processes in sync. Downloader now keeps in mind to Map/Reduce on huge url lists. Speed receives a significant upgrade.

March 26 morning— We cant reach our machines. Something beyond our wildest imagination, and we started to speculate on the causes; maybe due to high volume crawling our network was isolated or worse IP blacklisting. The reason given was rather trivial —to handle heavy streaming traffic for India vs Australia Cricket World cup match, there were some routing changes on the Fiber backbone, which were not notified to University, rendered the network unreachable to any outside subnet. India lost the match.

March 26 Late night— Network is back up. We submit the YCombinator application. Submit to ‘Show HN’ . We were up all night watching for activity. Processed URLs — 7million in 3 days

Murphy’s law

It is an experience common to all men to find that, on any special occasion, such as the production of a magical effect for the first time in public, everything that can go wrong will go wrong. Whether we must attribute this to the malignity of matter or to the total depravity of inanimate things, whether the exciting cause is hurry, worry, or what not, the fact remains.
Nevil Maskelyne

March 28 — Different components in our stack are designed for singular tasks in their own containers(Thanks Docker). They all talk to the central MQ. We were keeping too much data in our Message Queue. And so we ran out of RAM, system panic at night(IST). Downtime all night. One simple queue brought the seemingly resilient platform on its knees. We now redesigned keeping in mind to use the queue for just messages.

April foo1 — Due to the poor planning of power in the data centre, UPS failure kills our servers. No heartbeat. While the power was uninterrupted, the “Uninterrupted Power Supply” failed. Disks were XFS formatted , corrupt partition ensures no recovery possible.

Moving fast(er)

April 2 — We migrate our code to google cloud. We setup the best instances ‘Free credits’ can buy. We decide to not run web crawl, and just stay put with twitter stories due to compute requirements. Twitter data accumulated from January is to be left behind. A news feed for any topic you wish to explore is public at http://crrnt.is/

April 7 — After pulling the servers back to life, we implement further learnings. Even though this is our most stable release yet, we’re still calling it beta, because of unforseen 3rd world problems. The speed is now 10x from last production run. We now have 1000 urls discovered per second. Articles processing at 70000 per hour

If you managed to read till this point

http://beta.crrnt.is/

We should have been in panic mode. Instead, we took the opportunity to rethink everything. From our stack to our strategy. But one thing’s clear — what doesnt kill you makes you stronger.

Another major learning during building this Startup is pretty monumental for us — Starting out with a clean slate opens ups so many possibilities. Not being attached to previous achievements(and data) has provided us an edge, and opened possibilities for fast paced improvements.

Coming Up

One of the unique feature of our product is that we do not use(neither we plan to use) any kind of dictionary(Language or otherwise). We’d rather spend our time improving the Machine Learning aspect of this product

You’ll be seeing some amazing new tricks soon. We are excited about showing you Face recognition and automatic tagging of public images on the web.

Our current Language Processing stack is python. We are gradually shifting towards C. The performance improvements are unparallel. Check out https://github.com/varunmittal91/newerahpc

We are already experimenting with a headless-webkit to mimic user flow, while downloading the webpage. Very shortly we will have a visual scrapper that will help us identify the information on a webpage with a much better accuracy.

While doing all this, we are working towards enabling our software to place all the information on the internet into one large graph. No longer will search engines talk about hits and pages.

There is no destination but progressive goals, to make computers intelligent and assist humans in tasks where machines are designed to excel(compute, analysis, research). Our plan is crazy and hopefully it will keep us on the toes for rest of our lives.

Its has been a busy few few months for our little startup chasing a tight deadline. It has been truly fast paced, caffeine fueled and fun days; and we would have it no other way. What a wonderful High! I can’t imagine the places we’d go with some real Acceleration

Mayank & Varun