Content Quality Monitoring Automation in Shiksha

Shiksha Engineering
Shiksha Engineering
6 min readAug 26, 2020

Author : Vidushi Bhadola

Source : http://www.innovins.com/

Shiksha website is a repository of reliable and authentic information for around 50,000+ Colleges, 3 lakh+ Courses and over thousands of articles. Any student can access comprehensive information on colleges, courses, scholarships, admission notifications, ask questions and get answers from users/experts and view recommendations for various colleges and courses.

Quality Content

Quality Content in terms of digital content includes information which is accurate, credible, useful and informative, comprehensible and engaging. All these factors in combination make a good quality Digital Content. Digital content should be accurate and the sources of the information should be credible. The information should be presented in a very comprehensible manner to the user so that it is engaging.

The content on a web page for a website like Shiksha should be displayed in a manner which satisfies the above conditions to perform as a good quality content.

Need for Content Quality Monitoring

Shiksha is a content driven website which provides extensive information about colleges, courses, exams, scholarships and thousands of articles etc. for both Indian and Abroad institutes. Due to this, Shiksha is expected to provide high quality content to its user base which means credibility of the content and more importantly, the way it is presented to the user.

With the growing amount of content being published on Shiksha everyday, it became imperative to figure out a way to monitor the quality of content which was being posted on the site. Hence, Content Quality Monitoring became essential to Shiksha to ensure a good user experience of the site which primarily included user’s interaction with the content on the website.

How was this achieved?

The Shiksha QA Team along with the Shiksha Content Management Team came up with a project to automate the monitoring of the quality of content published on the Shiksha website. The project titled as — Content Quality Monitoring Automation makes use of a framework written in Java to automate the task in hand. The framework checks all the website pages stored in a database for inconsistencies in the content and logs it in another database from where the result is picked and presented in a Dashboard in a comprehensible format. Every fortnightly, an email consisting of the data in a summarized format is also sent to the concerned stakeholders.

Overview of the project -

The Shiksha Content Management team provided a set of conditions to the QA team, which identified the elements of a web page that provided a bad user experience. Some of these conditions were -

  1. Incorrect Table Template eg. tables with columns greater than 3 or tables without header etc.
  2. Number of H2 tags on Page greater than 7
  3. I-Frame Embedded on Page throwing 404
  4. Last updated date of the content is more than 2 years old
  5. Page consists of wiki content without infographics
  6. Page consists of too much space in the text eg. consecutive line breaks are more than 3 or consecutive spaces are more than 10 etc.

Each page on the website was checked for these conditions and the pages whose content fell into these were added in another database for showing in the dashboard.

Technology Stack used -

  1. Front-end Technology : HTML, CSS, JavaScript
  2. Programming Languages : Java, JDBC, Servlet, Java EE, JSP
  3. Database : MySQL

Steps involved -

In order to achieve the required goal, following steps were followed -

  1. The project was started by creating a new Maven project in Eclipse IDE and installing all the necessary dependencies. Then the framework was designed for optimized code performance and reusability.
  2. The content was extracted from the given database containing all the pages for which content was to be checked. Database connection and data extraction was done with the help of JDBC (Java Database Connectivity).
  3. After that, the parsing of content from string to HTML was done using Jsoup, to make it easy to extract different HTML tags as per need in different conditions.
  4. All the conditions were applied on the parsed content in Java and the automation script was run.
  5. Recurring email was sent to the stakeholders via SMTP.

Data Quality Dashboard

A Data Quality dashboard was developed to showcase the results of the above steps in a comprehensible manner. The dashboard provides different ways to sort the data along with hyperlinks to every page where an inconsistency was detected.

The dashboard provides two views -

  1. Rule View — Under Rule View, the pages are sorted according to the conditions described above.
Fig : Rule View

2. Page View — Under Page View, the data is sorted according to page type as Exam page, Article page etc.

Fig : Page View

The server that is used to host the dashboard is Tomcat 8.5. The making of the dashboard included installing Tomcat server in the Eclipse Enterprise Edition and then creating a new Dynamic Web project from scratch. Then we needed to define our servlets as we progressed in the web.xml file which is responsible for connecting all servlets to the correct HTML call.

Result database that was created during runtime of the content quality check program was accessed through JDBC and the results were summarized according to the rules and pages depending upon which view is clicked (Page view or Rule view).

Each row in the summary page is clickable and is used to redirect the user to the next page where a detailed view for that row is provided along with the URLs to the pages with the issues as can be seen in the below image.

Fig : Detailed view of each row in Page View or Rule View

Hyperlinks were added to the different sections of the pages to provide redirecting functionalities. The title of the above page was linked with the homepage and the row titles were hyperlinked with the corresponding web pages.

AJAX calls were used to interlink the JS and Servlet calls. The need to render a different servlet under the other without reloading the whole page makes it necessary to use AJAX calls in the project. With these we could trigger a call to a particular URL using JS and the listeners of the servlets would be able to pick that URL and open it. Once the URL is opened by a servlet, the required data can be returned in the body of the servlet response to the corresponding JS call and hence can be displayed on the front-end during run time.

Benefits reaped from this project -

  1. The content management team at Shiksha is sent an email update in regular intervals which helps them keep a track of pages which need to be fixed and updated.
  2. This project has enabled improvement of content quality and readability on various pages of Shiksha website and has resulted in better user experience.
  3. The result of this project is also used by various stakeholders to monitor the quality of content being published on the website.
  4. This project has also helped in laying down certain guidelines with respect to type of content being published on the site in the near future.

--

--