Infographic: How OnCrawl works
We have decided it was the time to share with you how OnCrawl works. We have developed some cool features since we launched the first crawl. This infographic is introducing you the technologies we have used to build OnCrawl and how the SEO crawler is working.
Infographic retranscription: How OnCrawl works
Our SEO crawler OnCrawl has been built on top of open source applications, with love and secret ingredients. Here is a brief presentation of how our crawler is working and which technologies we are using.
Apache Hadoop Map/Reduce jobs: web crawling based on Apache Nutch.
1 Inject start URL
2 Crawl iterations :
- Generate fetch lists
- Fetch URLs & Parse HTML pages
- Update crawl database
In house Hadoop and Spark jobs to mine information from gathered data
5 Data analysis :
- Links enrichment: backport HTTP status of source and target on every single internal link
- N-Grams (keywords extraction and high level analysis)
- Identification of duplicated important metadata (title, h1, meta description, …)
- Near duplicates (Simhash based approach) clustering & introspection (pairwise similarity scoring)
- Access Logs Analysis (SEO visits, bots activity, orphan pages detection)
2 Interactive reporting:
- OQL (OnCrawl Query Language) over Elasticsearch for live reporting / querying / aggregations
- C3.js based interactive charts with mapping on OQL presets
Elasticsearch: a search engine that have helped us create our URL details
And if you want to know which features we have developed since the beginning, you can still have a look at our last infographic!