Google Summer of Code: OpenMF Week 4

Swapnal Shahil
SCoRe Lab
Published in
4 min readJul 4, 2021
Google Summer of Code with SCoRe Lab: Week 4

Hey! I am back with another update on Google Summer of Code. In this blog, I am sharing my work which I did in the fourth week of the coding period. In case if you haven’t seen my proposal till now 🤔🤨 here is a link for that — My-GSoC-Proposal.

This week I worked on keyword search to get all the cases that have the searched keyword. Basically main purpose for this was to find a word from the database if the admin wants to search while analyzing the mobile forensic case. For this, I made two APIs that have some similar kind of function but have different use in their own place. One API search the keyword from the whole database so that analyst can relate one case with other cases and another API works within the case to get the files that have that keyword.

Is this a string matching problem? Yes, more or less I have to implement the same thing. Given are some of the algorithms which I read before implementing.

  1. Rabin Karp algorithm → Time complexity: average Θ(n + m), worst Θ((n−m)m) and Space O(1)
  2. KMP algorithm → Time complexity: Θ(n) and Space Θ(m)
  3. Boyer Moore string search algorithm → Time complexity: best Ω(n/m), worst O(mn) and Space Θ(k)
  4. Two-way string matching algorithm → Time complexity: O(n+m) and Space O(1)

where n → length of the searchable text and m → length of the pattern

But later I found from Python 3.10, the inbuilt algorithm was updated to use an enhanced version of Crochemore and Perrin’s Two-Way string matching algorithm which is the optimal and best way to do it. So I implemented my functions accordingly.

Keyword search from the whole database
Keyword search from a particular case

Crochemore and Perrin’s Two-Way string matching algorithm 🤔

In Crochemore and Perrin’s Two-Way string matching algorithm, there is a smaller string (the “pattern” or “needle”) that is searched in a longer string (the “text” or “haystack”), determining whether the needle is a substring of the haystack, and if so, at what index(es). This algorithm runs in O(len(needle) + len(haystack)) time and with O(1) space. Here is the sample code in C.

/* Computing of the maximal suffix for <= */
int maxSuf(char *x, int m, int *p) {
int ms, j, k;
char a, b;

ms = -1;
j = 0;
k = *p = 1;
while (j + k < m) {
a = x[j + k];
b = x[ms + k];
if (a < b) {
j += k;
k = 1;
*p = j - ms;
}
else
if (a == b)
if (k != *p)
++k;
else {
j += *p;
k = 1;
}
else { /* a > b */
ms = j;
j = ms + 1;
k = *p = 1;
}
}
return(ms);
}

/* Computing of the maximal suffix for >= */
int maxSufTilde(char *x, int m, int *p) {
int ms, j, k;
char a, b;

ms = -1;
j = 0;
k = *p = 1;
while (j + k < m) {
a = x[j + k];
b = x[ms + k];
if (a > b) {
j += k;
k = 1;
*p = j - ms;
}
else
if (a == b)
if (k != *p)
++k;
else {
j += *p;
k = 1;
}
else { /* a < b */
ms = j;
j = ms + 1;
k = *p = 1;
}
}
return(ms);
}


/* Two Way string matching algorithm. */
void TW(char *x, int m, char *y, int n) {
int i, j, ell, memory, p, per, q;

/* Preprocessing */
i = maxSuf(x, m, &p);
j = maxSufTilde(x, m, &q);
if (i > j) {
ell = i;
per = p;
}
else {
ell = j;
per = q;
}

/* Searching */
if (memcmp(x, x + per, ell + 1) == 0) {
j = 0;
memory = -1;
while (j <= n - m) {
i = MAX(ell, memory) + 1;
while (i < m && x[i] == y[i + j])
++i;
if (i >= m) {
i = ell;
while (i > memory && x[i] == y[i + j])
--i;
if (i <= memory)
OUTPUT(j);
j += per;
memory = m - per - 1;
}
else {
j += (i - ell);
memory = -1;
}
}
}
else {
per = MAX(ell + 1, m - ell - 1) + 1;
j = 0;
while (j <= n - m) {
i = ell + 1;
while (i < m && x[i] == y[i + j])
++i;
if (i >= m) {
i = ell;
while (i >= 0 && x[i] == y[i + j])
--i;
if (i < 0)
OUTPUT(j);
j += per;
}
else
j += (i - ell);
}
}
}
/*If you want to read more about this algorithm check out this link here.*/

In the coming week, I will be working more on the analytics part of this project and try to complete everything as per my timeline. I will keep updating you about my work through these blogs. So keep reading, sharing, and supporting. 😊

Till then if you haven’t read my previous blog here is a link for that: Google Summer of Code: OpenMF Week 3. 😊😀

See you soon!!😊

Github: swapnalshahil 👈

LinkedIn: swapnalshahil 🤘

Instagram: eulersgamma 😃

--

--

Swapnal Shahil
SCoRe Lab

GSoC’22 Mentor || GSoC’21 @SCoReLab || B.Tech at IIT Guwahati'23