Building a Web Search Engine with Django: A Comprehensive Guide

9 min readSep 14, 2023

Introduction:

In today’s digital age, information is abundant and easily accessible online. However, the large volume of data can sometimes make it challenging to find specific information quickly. This is where search engines come to the rescue, helping us sift through the vast ocean of data and locate what we need with just a few keystrokes.

Have you ever wondered how these search engines work under the hood? How do they crawl the web, index web pages, and provide relevant search results? If you’ve ever been curious about building your own web search engine, you’re in the right place.

In this article, we’ll take you on a journey to create a fully functional web search engine using the power of Django, a high-level Python web framework. We’ll leverage my open-source project available on GitHub, called “search_engine_spider” as our starting point. This project provides the essential tools and infrastructure needed to crawl web pages, extract information, and store the results in a database.

Ready to take your web development skills to the next level? Our GitHub repository https://github.com/Eng-Elias/search_engine_spider offers a hands-on opportunity to explore web crawling, search engine development, and more.

Whether you’re an aspiring developer looking to dive into web crawling and search engine development or a seasoned Django enthusiast eager to expand your skill set, this guide has something for you. By the end of this article, you’ll have a solid understanding of how to build a web search engine from scratch, and you’ll be well-equipped to customize it to suit your specific needs.

Let’s embark on this exciting journey to unlock the world of web search engines with Django!

Project Prerequisites:

Before we dive into the nitty-gritty of building our web search engine with Django, let’s ensure we have all the prerequisites in place. To follow along with this tutorial, you’ll need the following:

Python 3.x: Django is a Python web framework, so make sure you have Python 3.x installed on your system.
Django: our web framework of choice, which will provide the structure for our project.
BeautifulSoup: We’ll be using BeautifulSoup to parse web page content.
Requests: This library is essential for making HTTP requests to fetch web pages.
Database: Decide on the database you want to use. We recommend PostgreSQL if you plan to enable parallel crawling due to its support for concurrent access. SQLite is an option too, but keep in mind that it limits crawling to a sequential process.

What We Will Build:

In this tutorial, we’ll start with a solid foundation — the “Search Engine Spider” available on GitHub. This project provides a pre-built Django application that includes a web crawling utility, a Django management command for initiating the crawling process, and a user-friendly web interface for searching the scraped data.

We will explore how to use the included Spider class to crawl web pages, extract information, and store the results in a database. You'll also learn how to configure your database settings and decide whether to enable parallel crawling based on your needs.

The web interface we’ll create allows users to enter search queries and retrieve search results from the database. By the end of this tutorial, you’ll have a functioning web search engine that you can customize and expand to suit your specific requirements. Whether you’re interested in web crawling, database management, or building user interfaces with Django, this project will provide valuable insights into each of these areas.

ScrapingResult Model:

The heart of our web search engine project is the ScrapingResult model. This Django model defines the structure in which we store the information we gather during web crawling. Let's take a closer look at the model's code and its significance:

class ScrapingResult(models.Model):
    title = models.CharField(max_length=200)
    content = models.TextField()
    url = models.URLField()

    def __str__(self):
        return self.title

ScrapingResult is a Django model, and each instance of this model represents a single result obtained from crawling a web page.
It has three main fields:
title: A CharField which stores the title of the web page, which is typically found within the HTML <title> tag.
content: A TextField where we store the text content extracted from the web page. This field captures the textual information from the entire page.
url: A URLField that stores the URL of the web page we crawled.

In essence, the ScrapingResult model acts as our structured data store, allowing us to save the titles, content, and URLs of web pages we've crawled. This structured storage makes it easy to manage and retrieve the information we need for search functionality and display to users in our web interface.

Understanding Views, Templates, and SearchForm:

In our web search engine project built with Django, Views, Templates, and the SearchForm are used to create a seamless user experience. Let’s break down each of these components:

Views: In Django, views are responsible for processing user requests and returning appropriate responses. In our project, we have two key views. The search_page view renders a search form template where users can input their queries. The search_results view handles the search logic, querying the ScrapingResult model to find matching results and rendering them for display. Additionally, this view provides support for AJAX-based pagination, ensuring efficient navigation through search results.
Templates: Templates in Django are used to generate HTML dynamically. In our project, we have several templates, including layout.html, search_form.html, search_results.html, and search_result_item.html. layout.html serves as the base template for all pages, providing a consistent structure. search_form.html presents the search input form to users, while search_results.html displays the search results along with pagination. search_result_item.html is a partial template used to format individual search result items. Together, these templates create a user-friendly interface for interacting with the search engine.
SearchForm: The SearchForm is a Django form class that handles user input for search queries. It is defined in the code and used in the search_page view. This form ensures that user input is validated, and it simplifies the process of gathering query parameters. It's a crucial component for user interaction as it enables users to submit their search queries efficiently.

In summary, views manage the logic behind our web pages, templates provide the visual representation, and the SearchForm streamlines user input handling. Together, they form the backbone of our web search engine, delivering a smooth and intuitive search experience to users.

Spider (Crawler):

Certainly, let’s break down the functionality of the Spider class step by step, explaining each part of the code:

class Spider:
    def crawl(self, url, depth, parallel=True):
        try:
            response = requests.get(url)
        except:
            return

        content = BeautifulSoup(response.text, 'lxml')
        try:
            title = content.find('title').text
            page_content = ''
            for tag in content.findAll():
                if hasattr(tag, 'text'):
                    page_content += tag.text.strip().replace('\\n', ' ')
        except:
            return
        ScrapingResult.objects.get_or_create(url=url, defaults={'title': title, 'content': page_content})

The crawl method initiates the crawling process. It takes three parameters:

url: The URL to start crawling from.
depth: The depth of crawling, determining how many levels of links to follow.
parallel: An optional parameter that enables parallel crawling.

2. Inside the method, it starts by making an HTTP GET request to the provided URL using the requests library. If there's an issue with the request, it returns early.

3. It then parses the HTML content of the web page using BeautifulSoup and stores it in the content variable.

4. The code tries to extract the title and textual content from the web page. It looks for the <title> tag to get the title and iterates through all tags on the page to extract and concatenate their text content.

5. The extracted title and page_content are used to create a new ScrapingResult instance, which is saved to the database using get_or_create. This method ensures that if the URL already exists in the database, it updates the existing record with the new title and content.

        if depth == 0:
            return

        links = content.findAll('a')
        def crawl_link(link):
            try:
                href = link['href']
                if href.startswith('http'):
                    self.crawl(href, depth - 1)
                else:
                    parsed_url = urlparse(url)
                    protocol = parsed_url.scheme
                    domain = parsed_url.netloc
                    self.crawl(f'{protocol}://{domain}{href}', depth - 1)
            except KeyError:
                pass
        if parallel:
            with ThreadPoolExecutor(max_workers=10) as executor:
                executor.map(crawl_link, links)
        else:
            for link in links:
                crawl_link(link)

Next, the code checks if the specified depth has been reached (depth equals 0). If so, it returns, effectively limiting the depth of the crawling process.
It then extracts all the links (<a> tags) from the current web page and stores them in the links variable.
The crawl_link function is defined to crawl individual links. It extracts the href attribute from the link, and if it starts with "http," it recursively calls the crawl method for that URL with a reduced depth. If the link is relative, it constructs an absolute URL using the current page's protocol and domain.
Depending on the parallel flag, the code either processes the links in parallel using a thread pool or sequentially.

In summary, the Spider class crawl method retrieves web pages, extracts their title and content, and stores the results in the ScrapingResult model. It then follows links to other web pages, either in parallel or sequentially, based on the specified depth. This recursive crawling process allows the spider to traverse multiple levels of web pages, collecting valuable data for our search engine.

Understanding the “crawl” Management Command:

In our Django web search engine project, we’ve implemented a custom management command named “crawl.” This command allows users to initiate the web crawling process with specific parameters. Let’s delve into the code, how to use the command, and its significance:

from django.core.management.base import BaseCommand
from scraping_results.spiders.general_spider import Spider

class Command(BaseCommand):
    help = 'Crawl a URL using the Spider class'
    def add_arguments(self, parser):
        parser.add_argument('url', help='The URL to start crawling from')
        parser.add_argument('depth', type=int, help='The depth of crawling')
        parser.add_argument('--parallel', action='store_true', help='Enable parallel crawling')
    def handle(self, *args, **options):
        url = options['url']
        depth = options['depth']
        parallel = options['parallel']
        spider = Spider()
        spider.crawl(url, depth, parallel=parallel)

The “crawl” management command is implemented as a Django management command class. It extends BaseCommand and has a help attribute that provides a description of what the command does.
The add_arguments method allows users to pass arguments and options when invoking the command. It defines three parameters:
url: The URL from which to start crawling.
depth: The depth of crawling, specifying how many levels of links to follow.
--parallel: An optional flag that enables parallel crawling.
In the handle method, the command logic is implemented. It retrieves the values passed as arguments and options, namely the url, depth, and parallel flag.
An instance of the Spider class is created, which is responsible for the actual crawling process. The crawl method of the spider is then called with the provided parameters.

Using the “crawl” Management Command:

To use the “crawl” management command, you can run it from the command line as follows:

python manage.py crawl <url> <depth> [--parallel]

<url>: Replace this with the URL you want to start crawling from.
<depth>: Specify the depth of crawling, indicating how many levels of links to follow.
--parallel (optional): Include this flag if you want to enable parallel crawling. Note that parallel crawling works only with databases that support concurrent connections like PostgreSQL and doesn't work with SQLite.

For example, you can initiate a crawl with the following command:

python manage.py crawl http://example.com 2 --parallel

This command will start the crawling process from “http://example.com" with a depth of 2, and if --parallel is provided, it will enable parallel crawling for more efficient data retrieval.

In summary, the “crawl” management command is a user-friendly way to trigger web crawling in our search engine project. It allows users to specify the starting URL, depth of crawling, and whether to use parallel crawling using command line, providing flexibility and control over the crawling process.

Conclusion:

In the ever-expanding digital landscape, the ability to harness the vast web of information is an invaluable skill. Our “Search Engine Spider” offers you a powerful toolkit to dive into the world of web crawling and search engine development with Django. As you’ve seen in this comprehensive guide, the project comes packed with features, including a robust web crawling utility, a Django management command for easy initiation, and a user-friendly web interface for seamless searches.

But this journey doesn’t end here; it’s just the beginning. We invite you to explore, experiment, and, most importantly, contribute to this open-source project. if you like this kink of projects and content, please support us by starring the repository and sharing the article.

Star, fork and contribute to:
https://github.com/Eng-Elias/search_engine_spider

Whether you’re a seasoned developer looking to enhance your skills, a web enthusiast with a passion for data exploration, or simply curious about the inner workings of search engines, your contributions are invaluable. You can add new features, improve existing ones, or help us refine our documentation to make the project more accessible to everyone.

Join us on this exciting quest to build and expand our web search engine with Django. By working together, we can unlock new possibilities in web crawling and search technology, making the digital world more accessible and manageable for everyone. So, star our repository, get involved, and let’s shape the future of web search engines together!