Make it Real Elite — Sixth week — Search engine - first two approaches
This week we define a micro project that was to do a google alike. The idea was to start crawling some web pages and get their links so we can create a huge database with the body of each one so in case someone search any word or phrase we can search on the database and return some results.
To solve this project first, we have to define what we need to get it, so this is our list:
1. Crawl a page and get the body.
2. Save the info on the database, which type of database use and which one?
3. How to get more pages?
4. Which pages return when somebody search something?
Approach 1.0
In this first approach, we define to use mongodb where we are going to save the page with the body and the links that are pending to be crawl. To crawl the body, we used a http request where we send the url and then we get a response with the body, with this body we save it on mongodband then use a library called htmlparser2 that helps to get the tags that are a, with this we can get the attributes like the href and save the urls to crawl again.
The problems:
1. There will be an infinity loop crawling the same pages because after it gets the links on a page, and crawl those links, if those has a link to the first page, the first page will be crawl again.
2. if we try to check if the url already exists to avoid crawling it, there will be the same problem because there’s no queue or promise that prevents to save the data before the response of the database.
3. If the data is stored on mongo how will be the best way to check if the words used on the search are on the body.
4. Which algorithm should we use to return the data of the database after the search so the results must be the best and the ones that the user is expecting?
Approach 1.1
In this approach, we define to use some promises that can help to run everything on time and to run what we need. The first thing was to crawl the url and get the body, when this promise was complete we will pass to a method that handle it saving on the database, after it save it, it will pass to another method that will crawl the body and get all the links that it has and then it will start again with all this links, making these methods to have `Promise.all that will continue after all the promises were solved. With one link, it works perfect, with the first link it got 39 new links to be crawl and to be saved on the database, it took a little more time to crawl those 39 links and to get the links inside them, everything was working fine, a little slow but fine and there was where the problems started those 39 links got 1440 new links to be crawl.
The problems:
1. There will be an infinity loop crawling the same pages because after it gets the links on a page, and crawl those links, if those has a link to the first page, the first page will be crawl again.
2. With 1440 links to be crawl using promises it will take too much time waiting to finish all to continue to the next step. It can take hours or days.
3. Imagine we wait to all the promises to be crawled, how many new links you think we can get from 1440 links? Maybe 54.000 or more
4. If the data is stored on mongo how will be the best way to check if the words used on the search are on the body.
5. Which algorithm should we use to return the data of the database after the search so the results must be the best and the ones that the user is expecting?
Those approaches where not the best ones to this project.
