Image for post
Image for post

TL;TR

SeekStorm is a Search as a Service.

The search API offers web-scale, real-time, full text, instant search for your data and documents.

SeekStorm is a Crawler as a Service.

A high-performance, focused crawler turns any website into JSON docs with structured data.

20x speed and 200x payload compared to Lucene

30x more queries and documents per dollar spent

Turnkey, affordable, scaling, high performance search.

Why we built SeekStorm

Search is omnipresent in today’s Information Age. The giant amount of data produced makes searching a core part of almost every solution stack.

Whether your customers searching for products or information on your website or research papers, patents, court or patient records need to be searched, the search solution decides whether your business or your research is successful or a frustrating experience.

Three options for search

  1. Build the search…


Image for post
Image for post
Photo by niko photos

The Pruning Radix Trie is a novel data structure, derived from a radix trie — but 3 orders of magnitude faster.

After I published SymSpell, a very fast spelling correction algorithm, I have been frequently asked whether it can be used for auto-completion as well. Unfortunately, despite its speed, SymSpell is a poor choice for auto-complete. The Radix Trie seemed to be a natural fit for auto-complete. But the lookup of a small prefix in a large dictionary — resulting in an huge number of candidates — lacked in speed. …


Exploring the implications of Artificial Intelligence, Consciousness and Free will

Image for post
Image for post
Photo by Franck V. on Unsplash

TL;DR

AI will replace most jobs. Human labor becomes worthless, the biggest devaluation in history. Democracy will crumble as people lost their negotiating power. Consciousness will spontaneously emerge in AI by Darwinian evolution, not by human engineering. Superintelligence will surpass the human and replace our species — in this millenium.

The NASA is working on project HAMMER to protect the earth from an asteroid that in 2175 has a 1 in 2,700 chance to hit us. …


Image for post
Image for post
Vergilius Vaticanus

TL;DR

Faster Word Segmentation by using a Triangular Matrix instead of Dynamic Programming. The integrated Spelling correction allows noisy input text. C# source code on GitHub.

For people in the West it seems obvious that words are separated by space, while in Chinese, Japanese, Korean (CJK languages), Thai and Javanese words are written without spaces between words.

Even the Classical Greek and late Classical Latin were written without those spaces. This was known as Scriptio continua.

And it seems we haven’t yet lost our capabilities: we can easily decipher

thequickbrownfoxjumpsoverthelazydog

as

the quick brown fox jumps over the lazy dog

Our brain does this somehow intuitively and unconsciously. Our reading speed slows down just a bit, caused by all the background processing our brain has to do. How much that really is we will see if we attempt to do it programmatically. …


Image for post
Image for post

Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?

Also in the comments to my blog post on spelling correction the BK-tree has been mentioned as a superior data structure for fuzzy search.

So I decided to compare and benchmark the BK-tree to other options.

Approximate string search algorithms

Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific string metric.

There are are many are different string metrics like Levenshtein, Damerau-Levenshtein, Hamming distance, Jaro-Winkler and Strike a match. …


Sub-millisecond compound aware automatic spelling correction

Image for post
Image for post
Source: https://www.flickr.com/photos/theredproject/3968278028

Recently I was pointed to two interesting posts about spelling correction (and here). They applied a deep learning approach, the philosopher’s stone of modern times. It is really fascinating how universal Deep learning is from AlphaGo winning Go championships, Watson winning Jeopardy, fighting Fake news and threatening mankind with Singularity.

The question is whether the Deep Learning Multi-tool is going to excel and replace highly specialized algorithms and data structures in every domain, if they both deserve their place or if they shine if their complementary strengths are combined. …


Image for post
Image for post
Photo by Matt Artz

Introduction

This post explores the Elias-Fano encoding, which allows as a very efficient compression of sorted lists of integers, in the context of Information retrieval (IR).

Elias-Fano encoding is quasi succinct, which means it is almost as good as the best theoretical possible compression scheme for sorted integers. While it can be used to compress any sorted list of integers, we will use it for compressing posting lists of inverted indexes.

While gap compression has been around for over 30 years, and some of the foundations of Elias-Fano encoding even date back to a 1972 publication by Peter Elias, Elias Fano encoding itself has been published in 2012. Being a rather recent development beyond the papers there is not much actual implementation code available. …


Image for post
Image for post
Source: https://www.flickr.com/photos/theredproject/3968278028

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

The SymSpell algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source on GitHub). …


Image for post
Image for post
Source: https://www.flickr.com/photos/theredproject/3968278028

1 million times faster spelling correction for edit distance 3

After my blog post 1000x times faster spelling correction got more than 50.000 views I revisited both algorithm and implementation to see if it could be further improved.

While the basic idea of Symmetric Delete spelling correction algorithm remains unchanged the implementation has been significantly improved to unleash the full potential of the algorithm.

This results in a 10 times faster spelling correction and 5 times faster dictionary generation and 2…7 times less memory consumption in v3.0 compared to v1.6 .

Compared to Peter Norvig’s algorithm it is now 1,000,000 times faster for edit distance=3 and 10,000 times faster for edit distance=2. …


Image for post
Image for post
Source: https://unsplash.com/photos/S5XON9lNFvo

Lex Google from a search engines perspective — a German law threatening the internet as we know it.

Worum geht's

Leistungsschutzrecht für Presseverlage durch das Achte Gesetz zur Änderung des Urheberrechtsgesetzes

Hier die entscheidenden Passagen:

§ 87f (1) Der Hersteller eines Presseerzeugnisses (Presseverleger) hat das ausschließliche Recht, das Presseerzeugnis oder Teile hiervon zu gewerblichen Zwecken öffentlich zugänglich zu machen, es sei denn, es handelt sich um einzelne Wörter oder kleinste Textausschnitte. Ist das Presseerzeugnis in einem Unternehmen hergestellt worden, so gilt der Inhaber des Unternehmens als Hersteller.

§ 87g (2) Das Recht erlischt ein Jahr nach der Veröffentlichung des Presseerzeugnisses.

§ 87g (4) Zulässig ist die öffentliche Zugänglichmachung von Presseerzeugnissen oder Teilen hiervon, soweit sie nicht durch gewerbliche Anbieter von Suchmaschinen oder gewerbliche Anbieter von Diensten erfolgt, die Inhalte entsprechend aufbereiten. …

About

Wolf Garbe

Founder SeekStorm (Search-as-a-Service), FAROO (P2P Search) https://seekstorm.com https://github.com/wolfgarbe https://www.quora.com/profile/Wolf-Garbe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store