Make it Real Elite — seventh week — Search engine — final approach
After finding some troubles with the first solutions, I got a recommendation to use AWS Lambda and AWS SNS; the first thought that I had was “what the f**k is that, how does it work”, in that exact moment I started reading about those two strange things, how can I use them and why are they important to solve this problem. The first thing that I noticed and maybe the most important one is that AWS Lambda let you run JavaScript code including dependencies and they can be trigger by AWS SNS which is a service that send notifications to some subscriptions including a Lambda functions.
The solution:
The first thing that we do was to split the code on three lambda functions, one is going to listen to a topic called URL, when something is publish to this topic it is going to crawl the url and get the body, after it get the body it will post to a SNS topic called CRAWLED-BODY where we pass the url, the body and the host; the second lambda function is going to be listening to the topic CRAWLED-BODY and it is going to use this information posted on the topic to save it on elasticsearch (it will help us to solve the problems defined on mongodb to query the data to find the best results to a search and it will give a score defining the best result), after it save the info it is going to post the first message to a new SNS topic called SAVED-INFO; the last function on Lambda is going to be listening to this topic and is going to crawl the body using htmlparser2 to find the links that are on it, and each time it find a link it is going to pass this new link to the first SNS topic which is URL.
Benefits:
1. We can let this process run as a background job that is running on AWS as cloud computing.
2. Using elasticsearch we can set queries that search by word on all the body
3. It will return the best results that matches and it will set some score to it.
4. This combination of Lambda and SNS help control the code making it like a queue, when it ends a task, it continues with the next one.
5. Is an easy way to check if a url exist without waiting the promise to be resolve.
6. It crawl hundreds of pages per minute.
After I finished this project I made some small changes to something that will be a bigger project to spot fake news, just searching some key words.
