The Need For A FOSS Academic Search Engine That Does Not Track You

Tanveer Salim
Nov 2 · 10 min read
Big Brother Watches You As You Struggle With Your Thesis (Photo by Bernard Hermant on Unsplash)

If you are interested in cybersecurity then you may have used a robust academic search engine in the past to get access to the most up to date research in security, like Microsoft Academic.

You may also have used free and open source software (FOSS) technology that would be less likely to track you, like DuckDuckGo.

Replacing The Big G with a “Quack”!

For those of you who are unfamiliar with FOSS: it is software that the user is free to use, redistribute, read the source code of, reverse engineer, and modify as they please. It does not mean the software is guaranteed to be free of charge. You are free to do whatever you want with the software at any level, whether it be from binary to, yes, even the source code level.

What makes FOSS more secure is that you can verify the secureness of the software. Since the public is free to access the source code, pentesters can easily spot a security bug in the software and report the bug to the publisher quickly.

In addition, FOSS software is far less likely to get away with intrusive features that would upset its users. Due to the nature of the license, the publisher is required to make all the source code that makes up the software available to the public. This allows users to verify that the software does not contain anything malicious in it.

FREEEEEEEEEEEEEEDOOOOOOOOOOMMM!!!

“Intrusive features?” you ask, angling your head at a side angle. “What do you mean ‘intrusive’?”.

I mean the features that allow your lovely federal government track your phone calls. Your emails. Your texts. At every hour. On the bus. At work. At home. Even while your phone is on while you sleep and code out loud in your dreams.

George Orwell seriously predicted the future.

“But WAIT!” you scream. “I am an old timer and don’t have a mobile phone”. Well, even then, you probably used that search engine:

“But I am just looking for cat videos”. (Photo by Benjamin Dada on Unsplash)

Oh, and don’t forget the social network you use to post those cat videos.

But my 200+ remote friends have got to see Simba run after that laser.

Well, guess what. Those free services you get online treat you like the product. The NSA absolutely takes advantage of search engines and social services like these to figure out if you are on their wanted list. This technically breaks the fourth amendment, our right to privacy as long as there is no search warrant.

The “free services” themselves earn their living spying on you, in turn. Google keeps a complete list of everything you searched since the past ten years. EVERYTHING!

They know exactly where you were every day because you use their maps feature to figure out where your next class is.

They know what videos you watch on Youtube.

They know what apps you use at midnight.

You can literally ask Google to send you all the data they have about you.

All that data typically amounts to around < 10 GB of data. That’s enough information to squeeze into 7,000,000 text documents!

What would do you think they would do with all that information?

Use it to make a personalized ad profile of you of course! (You have the right to turn this feature off, by the way).

’Cause your valuable, Timmy!

Oh, and I was just talking about Google. Forget Facebook.

And to top all that, silicon valley companies like Google recently came under fire for being involved in selling AI technology that is used in warfare. I am talking about those drones that spy on people to hunt them down:

I SEE YOU JARED! ( “National Air and Space Museum” by Cargo Cult is licensed under CC BY 2.0)

Google employees have already protested severely over this.

This is the actual reason we security developers prefer using secure search engines like DuckDuckGo:

Literally THE trusted search engine for deep web users.

The main selling point of search engines like these is that they DON’T stalk you every second. They even prevent everyone else from stalking you while you are on their site. And the results they give are around as good as the ones that DO stalk you.

So when I visit such academic search engines it really bugs me when my Brave Browser tells me this:

Arrrggh!!! Even a nerdy site like this tracks you!

After Edward Snowden released his famous NSA leaks, many people switched to DuckDuckGo since it was a search engine that respects the user’s privacy by refusing to allow third-party sites to cross-track you from their website. The search engine is also Free (as in Libre) and Open Source, so users can quickly identify security bugs in the engine.

DuckDuckGo is now a suitable replacement for whatever search engine you are using — except for an academic search engine component:

DuckDuckGo has everything you would want from a FOSS search engine — except DuckDuckGo Scholar

For all my concerned internet users like me who are reading this, I too scrambled on trustworthy websites like reddit to find a suitable FOSS academic search engine that did not track us — only to wind up disappointed. The fourth amendment does guarantee our right to privacy. But it is our responsibility to oversee that it is exercised.

There was one search engine that I commend for being FOSS and that respects the freedom academic researchers should have, and that is Scinapse.

One of Scinapse’s most important goals is to provide a free search engine that gives compensation based on the quality of the individual’s contribution to academia. Like many researchers, the creators of Scinapse are concerned that for-profit academic search engines unfortunately keep the price of quality research way too high (sometimes hundreds of dollars) and that keep almost all the profits for themselves.

I was about to declare Scinapse my perfect academic search engine until the Brave Browser reported this:

D’oh! They-Who-Must-Not-Be-Named Found Us!

“Darn!” I thought. “They still follow us here.” It seems that there are no academic search engines free from the tracking problem.

Scinapse also less citations and reference stats than other big-tech academic search engines. Compare the results between Scinapse and Microsoft Academic for the search query “Return Oriented Programming”:

You can see that Scinapse’s citation count is much less than the academic search engines that you are used to. It also gives previews of figure diagrams from the paper, which does not help in my opinion.

P.S: The User Interface of Microsoft Academic looks more polished.

I admit it. Microsoft’s Academic Search Engine clearly is far more robust.

Its citation count is much more quantified, provides a link to a downloadable pdf of the paper, ranks the best researchers in the field, and even allows you to add a paper to a personal reading list.

Microsoft Academic honestly has features that would leave rival academic search engines with their mouths watering.

So what would a small group of coders do to provide the kind of academic search engine us security-paranoid developers are looking for? It would have to focus on quality of results, not quantity.

One of the main problems I find with academic search engines, while doing security-based research, is that many of the research papers fail to remember if their security solution is deployable.

This diagram from the famous paper “The Eternal War on Memory” will make it clear what I am talking about:

The Eternal War on Memory: We’re losing.

Wow, that diagram has more X’s than the average biochemistry midterm. All the “Policy Types” are unique security techniques. Take a look at the “Dep.” for Deployable. In the security world, deployable is a fancy word meaning whether or not the security technique has become a standard business security practice.

It’s not looking pretty for just about all of the techniques. To see why, take a look at the “Perf. % (avg. max)”. That is the percentage by which the speed of the computer slows down after the security technique is implemented. In the business security world, it is a rule of thumb that any technique that has a performance percentage > 10% is a big no-no.

It especially pains me that just about all academic search engines that I have used popped up results that would technically work but would pay little to no attention to whether it would be realistically deployable. More often than not, there are hidden performance costs with using the security technique in question. I wasted too much time believing many of the security techniques that were discussed in Eternal War on Memory would solve our security problems…until I realized that many researchers seem to be out of touch with how critical preventing loss in speed is in the security world.

Perhaps researchers do not communicate with businesses that are responsible for administrating security techniques in influential software, like the Exec_Shield by Red Hat.

I think it would be great if an academic search engine actually highlighted companies that deployed the security techniques mentioned in a research paper.

Worse still, one technique only often solves a specific problem. Serious security developers must look for an array of techniques that not only each do their job but that are all able to execute with each other at the same time without decreasing overall computer speed. That way, the security techniques can run in parallel on the machine. Bear in mind many of the security techniques used in the business world are for Linux servers, and Linux’s main selling point as a server OS is its speed.

It would be nice if we had a general-purpose academic search engine that not only PageRanked papers, but also would would point out the other related techniques that make up for the deficiencies in the technique of interest, and then show the overall performance costs and benefits side-to-side in adopting those collection of techniques.

Academic search engines also seem to lack an option for showing snippets of relevant source code. For instance, if you type in the search query “Huffman Encoding”, not only would it be great if the academic engine provided relevant papers on the famous algorithm, but it would also be great if the search engine had an additional flag where it would display a preview of the most relevant implementations of Huffman encoding, especially those trusted by businesses everywhere.

To get an idea of what kind of source-code-based search engine I am talking about, take a look at krugle. It even provided a code-snippet preview of Huffman encoding used in Brotli, the famous algorithm that compresses online data servers send to the browsers of their client machines.

Finally, many academic search engines already rank results based on age. Even then, a timeline diagram where the engine clearly shows how up to date the security technique is relative to past and present security techniques in a neat-little table would really help. I mean, if you are looking at papers in descending order of age, then this format:

As easy to read as written by random contributors

…is still more easy to read than this format:

As harder to read as is built by better coders

“Hold the presses!” you scream. “No one will actually care enough about being tracked when you are using an academic engine, not a normal for laypeople?”

My response:

Any Questions?

You can cast your own vote here.

“Wait!”. You still say. “There is no way a pip-squeak student project like yours is going to compete against that of a company with a ton of money on your hands.”:

A Hater’s Actual Response

With all fairness, users really do not care about how many search results you obtain. They care about quality. And by that, I mean: “Your search engine better give me what I want in the first five results on average or yours goes down the drain too.”

Seriously, out of all the millions of search results an engine gives, how many of them do you bother to even click — let alone view?

DuckDuckGo doesn’t even bother calculating search results because:

  1. It’s difficult to do.

They know you don’t care about how many search results you get. You care about getting the right result on the first few ranked hyperlinks. This blog was describing problems in quality of search results, not quantity the entire time.

There is one last caveat I need to make here. And that’s funding the search engine. A person experienced with the business of search engines posted this fair warning:

Note to Self: Do not make this project a startup.

As the fellow redditor admitted, it would be unwise to make the search engine project a startup. If you take a close look at even the academic search engines of powerful big-tech companies, you would quickly realize it lacks the ad-based advertising that is common in standard search engines. Scinapse itself makes it clear it is a FOSS non-profit project. Since it stays a non-profit, the search engine is permitted to use the data sources it now uses.

I believe it is best for academic search engines to remain non-profit. Researchers have become so tired of the outrageously high subscriptions to gain access to journals, websites have been made that illegally bypass these paywalls to gain free access to the papers. But that is a blog for another day.

Tanveer Salim

Written by

Loves bit manipulation, Free and Open Source Software, and search engine technology.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade