Infographic: How OnCrawl works

We have decided it was the time to share with you how OnCrawl works. We have developed some cool features since we launched the first crawl. This infographic is introducing you the technologies we have used to build OnCrawl and how the SEO crawler is working.

technologies behind oncrawl

Infographic retranscription: How OnCrawl works

Our SEO crawler OnCrawl has been built on top of open source applications, with love and secret ingredients. Here is a brief presentation of how our crawler is working and which technologies we are using.

CRAWL

Apache Hadoop Map/Reduce jobs: web crawling based on Apache Nutch.
1 Inject start URL
2 Crawl iterations :

  • Generate fetch lists
  • Fetch URLs & Parse HTML pages
  • Update crawl database

ANALYSIS

In house Hadoop and Spark jobs to mine information from gathered data
5 Data analysis :

  • Links enrichment: backport HTTP status of source and target on every single internal link
  • N-Grams (keywords extraction and high level analysis)
  • Identification of duplicated important metadata (title, h1, meta description, …)
  • Near duplicates (Simhash based approach) clustering & introspection (pairwise similarity scoring)
  • Access Logs Analysis (SEO visits, bots activity, orphan pages detection)

EXPLORE

Python/Flask API and Javascript web client
2 Interactive reporting:

  • OQL (OnCrawl Query Language) over Elasticsearch for live reporting / querying / aggregations
  • C3.js based interactive charts with mapping on OQL presets

Elasticsearch: a search engine that have helped us create our URL details
Graphs building: Our talented front end developers have built our dataviz using the javascript library C3.js.

And if you want to know which features we have developed since the beginning, you can still have a look at our last infographic!

Like what you read? Give OnCrawl a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.