How SearchNEU Works

Da-Jin Chu
sandboxnu
Published in
7 min readFeb 11, 2020

Throughout the years, thousands of Northeastern students, advisors, and professors have used searchneu.com to find information on classes. Very few of them know what really happens when they type into the search bar. The software powering SearchNEU has changed rapidly since Sandbox took over development, and we aim to give a glimpse into how this tool and many other websites actually work. We’ll cover exactly how search results end up on your screen: from the frontend shown in your browser, to the webservers serving course information, to the databases and web scrapers running behind the scenes.

An oversimplification

Frontend: The Face of SearchNEU

The first thing that happens when you visit searchneu.com is something you’re already very familiar with — a nice, pretty user interface pops up in your browser. This is written with React, a framework for creating complex, interactive websites that renders into HTML and JavaScript, the languages of the web. React draws the search bar, course data, and everything else on the site. It figures out what to do when you type into the search bar, then retrieves and renders the resulting course data.

You may, however, be wondering how exactly this frontend code got into your browser. This leads us to the webserver.

Webserver: The Heavy Lifting

The job of the webserver is to serve our frontend and data. It’s a JavaScript program written with the Express framework running on a computer somewhere in an Amazon Web Services (AWS) data center. This program listens 24/7 for incoming HTTP requests and responds accordingly. When you visit searchneu.com, your browser makes a request to the webserver, which sees the request and sends a response with all of our React frontend code, which then renders in your browser.

In addition to serving the frontend, the webserver is also responsible for serving the course data. That is, when you search for programming, the frontend code running in your browser sends a request to the webserver asking for courses associated with programming. The webserver then sends back a response with a list of course that mention programming in their titles or descriptions, ranked by relevancy. When your browser receives the response, the frontend reads that data and draws all the courses accordingly.

You can see this raw data from yourself by visiting https://searchneu.com/search?query=programming&termId=202030&minIndex=0&maxIndex=5

It looks pretty ugly, right? Luckily, all that matters is that our webserver knows how to write this data and our frontend knows how to read and interpret it. In other words, since our frontend and webserver speak the same language, we’re in good shape. This method of exchanging data is called an Application Programming Interface, or API. Just as a user interface defines the interaction between a user and a program, the Application Programming Interface defines the interaction between two programs: the frontend and the webserver.

If you’ve taken Fundamentals of Computer Science 1 at Northeastern, you probably remember losing points for messing up a function signature. In Fundamentals 1, signatures were a contract that allowed functions to talk to one another and send data around. If you didn’t follow a signature, your code eventually blew up. In the exact same way, the API defines a contract between our frontend and webserver. If the webserver one day sends data that does not conform to the API, the frontend is sure to crash!

Next up, we dig another layer deeper and talk about databases, the underlying storage mechanism for course data.

Database: The Friend that Never Forgets

SearchNEU uses a PostgreSQL database to store all the course data that is sent out by the webserver. Some of you might be wondering what the point of a database is. After all, we could just store the data in a file, right? And you would be right! For many years, SearchNEU did exactly that. Course data was stored in files on the computer running the webserver. When the server started up, it would read that course data from the file, and instantiate some global variables holding all that data. When it received requests, it would simply look through the global variables for the course data it needed.

Though the simplicity of the system allowed the project to develop quickly and acquire its first few thousand users, there were a number of ways that it began to cause issues:

  • It was tricky to update the system with fresh data. Every night, a script would scrape new data from Northeastern, copy files over to the AWS server, and restart the webserver. This would shut down SearchNEU for five seconds every night.
  • Interacting with files systems made it tough to write good tests. Tests that modify the file system are dependent on the environment they are run in and can fail from an unexpected environment configuration. These difficulties caused us to write less tests as whole, which reduced our confidence in the system.
  • It prevented us from adding much requested features like filters, autocomplete, did-you-mean, and fuzzy search. Iterating through ~5000 courses several times to run queries with filters and relevancy score calculations would be too slow to support these advanced search features.

With this in mind, we got a database. PostgreSQL is a popular relational database system. Relational database operate in terms of tables and rows, just like an Excel spreadsheet. SearchNEU has a table for courses, sections, and professors. Each row in the courses table contains information about a specific course, as does the section tables for sections. What makes the sections table unique is the foreign key, which is a reference back to which course the section belongs to. Using this foreign key, Postgres is able to give us all the sections of a given course in less than a millisecond. Further articles on the design decisions made with Postgres, as well as the story of rebuilding much of the SearchNEU backend to come.

With Postgres integrated, we are able to update the search engine with new course data by simply sending SQL queries to Postgres. The webserver does not need to restart, or even know about the update at all. The next time it receives a request for course data, it simply asks Postgres, which seamlessly sends over the new data. Additionally, since databases are widely supported in testing frameworks, we were able to write robust tests that properly interact with the database.

Last of all, there were a host of new features we wanted to add, like did-you-mean and fuzzy search. Unfortunately, Postgres alone would not be enough. For the advanced search functionality that is still under development, we’d have to turn to Elasticsearch.

Elasticsearch: You Know, For Search

Where Postgres is built for storing and quering data, Elasticsearch is not meant for data storage. Instead, Elasticsearch specializes in analyzing and optimizing data to power rich search features. It’s what lets you search for anatmy and get results for PT 5131: Gross Anatomy as well as find CS 4700: Network Fundamentals just by typing fundam.

Before we do any searches, we send a slice of the data stored in Postgres to Elasticsearch, which then runs analysis ahead of time, allowing us to do things that would normally be extremely computationally expensive extremely fast, and return results immediately. In the example of searching for anatmy, normally we would have to look through 32,000 course titles, measure how different every word in every title is from anatmy, and measure whether it could possibly be a misspelling. With Elasticsearch, we can account for misspellings, find classes from just the first few letters, and search by course titles, subject code, course number, and more. Not only that, we get a result back within the tens of milliseconds.

From Elasticsearch, we get the list of courses IDs that match our relevancy criteria, and then go to Postgres for the detailed data and associated sections to be sent back to the user. This process is often called hydration.

One major question remains to be answered. Where does all this data come from?

Scrapers: The Unsung Hero

The final piece of the puzzle is the scrapers. As much as we wish they did, Northeastern does not provide us any special access to course data. Instead, SearchNEU gets course data the same way any student would: by visiting the Northeastern website. Every night, SearchNEU runs a set of webscrapers that act the same way your browser does, visiting each and every one of the publicly available web pages with Northeastern course information. The difference is that instead of rendering the pages on a screen, the scrapers skim through them, searching for data in the exact spot they’ve been programmed to look in. After sending tens of thousands of requests, the scrapers have gathered hundreds of megabytes of data, which is then sent to Postgres and Elasticsearch.

With that, the data is ready in the databases, the webserver is dutifully hosting the frontend and responding to queries for data, and the frontend is helping students on hundreds of laptops, computers, and phones.

Not Included…

There were a number of topics we did not cover or glossed over. These include our infrastructure, continuous integrations systems, and analytics tools.

Still simplified… but better

The infrastructure includes a number of AWS services to power the physical servers that host our servers, as well as other services for DNS and DDoS prevention. The SearchNEU continuous integration pipeline also plays a large part in our ability to roll out updates rapidly and seamlessly. Lastly, analytics from a number of sources help us make informed design choices and alert us when things go awry.

Special Thanks

The software powering SearchNEU has been built by many students throughout several years. It is through their hard work that Sandbox was able to inherit such an impactful and deeply educational project.

They are: Ryan Hughes, who started the project and committed over a million lines of code over three years, as well as Edward Shen, Jennings Zhang, and Sean Hughes.

The Sandbox team working on SearchNEU at time of writing includes Mitch Gamburg, Eddy Li, Daniel Wang, Amiel Monasterial, Rebekah Johnson, Eliza Huang, and myself, Da-Jin Chu.

--

--