This Way I Checked Hundreds of Plagiarized Answers in Seconds.

Rishang
Rishang
Oct 22, 2020 · 6 min read

I recently had a task where I have to check theory short question answers about a test held on google forms for the students. But as it was a theory-based short answers exam, it made things quite manual as you cannot set a key for checking theory answers, so I have to check the answers of each one manually. and this work is too repetitive.

Also, I noticed that a lot of people have just googled for the answers and Ctrl+C, Ctrl+V.

let’s find out those “Ctrl+C, Ctrl+V” people.

So I took a CSV report of the google form response and, made a python script to check plagiarism. here is how.

giphy.com — Run Begin GIF By Crowdfire

The manual checking logic.

First of all, I need to check whether the given answer has been copied from google search or not.

For that I just use to copy user’s answer and directly search same string on google if any of search results are too much similar, using exact words to user’s answer, then it's a high probability of an answer been copied.

So that’s the thing I have to do for all users and each answer submitted. Now imagine you check that way for 500 or 600 users.

https://giphy.com/gifs/bored-sleepy-boring-LTYT5GTIiAMBa

boring and repeated process which we will try to automate same logic using Python.

The code logic

If you want to directly watch Code and how to use it, I have attached my GitHub repo for it at the bottom of this article, you can directly check it out. for understanding how code works, and a little about selenium you can continue reading :)

For performing task of google search, I am using a module call selenium and python 3.

setting up selenium is easy just `` in your terminal and chrome or firefox based web driver needed. here is a reference for setting up selenium for python.

Create a new file let’s say “textGoogleMatch.py” and import selenium webdriver.

from selenium import webdriver
# For handling selenium Exceptions.
from selenium.common import exceptions

webdriver provides us functionality to automate browser-based tasks and scrap data browser response data.

here is the code part…

Collecting google search results.

Create a function for collecting data from google search, having string value as an argument that can be used as a search query.

def googleSearch(query):
driver = webdriver.Chrome('chromedriver')

specifying browser web driver to selenium using webdriver.Chrome(), now “driver” contains all the webdriver.Chrome functionalities we will use further.

The rest of the part further is implemented inside this googleSearch() function.

Performing google search for a query.

# search querysearch_engine = "https://www.google.com/search?q="
query = query.replace(" ","+")
# driver.get opens browser and enters string we provide as argumentdriver.get(search_engine + query + "&start=" + "0")

driver.get() opens browser and enters string we give in argument to it. in our case, we created a google search query string.

further, we need to see what things are needed to collect from google search results in our case, need to collect, and from all web results of the first search result page.

From collected web results we will compare the description string of each result to or user’s answer.

Ok clear, lets google search something and open browser dev-tools (Ctrl+Shift+i) and inspect the HTML elements to find and of search results and copy XPath of each.

to copy XPath just select any element , right-click on it -> Copy -> Copy Xpath

Title element
Url element
Description Element

first result which is of Wikipedia has the following XPath.

title = '//*[@id='rso']/div[1]/div/div[1]/a/h3/span'
url = '//*[@id='rso']/div[1]/div/div[1]/a'
description = "//*[@id=”rso”]/div[1]/div/div[2]/div/span/span'

Next result which is form Tutorialspoint, in this case

xpath_title = '//*[@id="rso"]/div[3]/div/div[1]/a/h3/span'
xpath_url = '//*[@id="rso"]/div[3]/div/div[1]/a'
xpath_description = '//*[@id="rso"]/div[3]/div/div[2]/div/span/span'

Looking carefully XPath of Wikipedia and Tutorialspoint, have only one difference

'//*[@id="rso"]/div[1]' and '//*[@id="rso"]/div[3]'but why "div[3]" ?if its in series it should be "div[2]" !beacuse of  “people also ask”  block having xpath of //*[@id=”rso”]/div[2]/ 
Unwanted element

and this will be same for other results like maps result or some youtube videos results or other which we will ignore in our case through “try and except” in python. other than that rest flow is simple on the part of selenium.

to collect HTML elements of XPath in selenium, we use…

driver.find_element_by_xpath() : find's html element by its xpath.get_attribute(“innerText”): collect's inner Text of elements

example:

xpath_title = '//*[@id="rso"]/div[3]/div/div[1]/a/h3/span'title = driver.find_element_by_xpath(xpath_title)# collecting innerText of html elements in variable “title”
title = title.get_attribute("innerText")

as in our case search results are in series of div[ ] in XPath, we will loop of range(0,15) I am taking 15 as generally search results are around 10–13 per page and provide it as value to XPath, and store data collected to an empty dictionary.

results = {}for s in range(15):
try:
s_block = {} xpath_title = f'//*[@id="rso"]/div[{s}]/div/div[1]/a/h3/span'

title = driver.find_element_by_xpath(xpath_title)
title = title.get_attribute("innerText")
block["title"] = title results[f'{s}'] = block except exceptions.NoSuchElementException:
continue
return results

the loop will find Xpath in our given range and store results to data, I use to try, except because if the Xpath don’t contain those elements, we get “” in this case we will ignore it and just continue, this will be useful to ignore that “people also asked”, “maps result”, “youtube videos” elements as they don’t contain the title, link or description Xpath in them.

That’s it, the way we collect title, same way can be collected from their Xpath inside this loop.

String Comparison

For comparing two strings I am using a module called

Create a function for comparing strings which takes two string arguments

difflib.SequenceMatcher() will compare 2 argument strings and show us in form of float number between 0–1, 0 means no match, 1 means full match, I multiply its ratio to 100 to get a 0–100% kinda look.

import difflibdef compareStr(str1,str2):
return difflib.SequenceMatcher(None, str1, str2).ratio() * 100

Demo:

compareStr("hello world","hello you")output: 70.0

That’s it Done.

In my case, I use those two functions in another script where I read CSV column having answers in rows, use googleSearch() to collect google search result data, and compareSrt() to compare description data with answers of CSV file.

if search result description and answer of user match more than 70 in compareStr() then it very high chances of Ctrl+C, Ctrl+V guy and they are caught, I ran the script and it did the rest of the job in few seconds or minutes this way I caught a lot of Plagiarized answers easily :)

Check out my GitHub repo for it.

Here is the example demo for using the code

The Startup

Get smarter at building your thing. Join The Startup’s +731K followers.