Exposing CSV table data as searchable microservice — including fuzzy matching

Florian Quirin
5Analytics
Published in
4 min readAug 13, 2018

Sometimes you find yourself in a position where your company has a lot of data lying around in files, most prominently table data in the form of CSV spreadsheets and you want to do more with it than just analyzing it locally and writing a static report about your insights.

In this article you will learn how to load the data via the 5Analytics AI Platform, make it available as a RESTful web service and extract information via a fuzzy search for terms inside the columns of your table.

If you haven’t installed the platform yet I recommend to use the docker image of the freely available community edition.

To get started first download the R-script and the data for this tutorial. We will be using a table of soccer results from the German Bundesliga season 2017/2018 (public domain data):

Next we will upload the data to the 5Analytics AI Platform. Unzip the file and simply drag & drop the folder into the corresponding area on the scripts upload page located at:

http://localhost:5050/up/web/file/

If you are running the platform on a different server replace localhost with the server’s IP address. You can already call the REST interface now to access the data and get a result in JSON format. The endpoint defined in our script is called fafun_fuzzy_search_csv and by default it will search the table column “team1” for the term “Bremen” (city in Germany) giving you all matches of the season where “Werder Bremen” was the home team.

Open a new tab in your browser and enter:

http://localhost:5050/if/json/R/v1/fafun_fuzzy_search_csv?_token=test_token 

You should see a nice JSON array popping up on the screen looking like this:

{ 
"data":
[{
...,
{
"date": "(Sun) 19 Nov 2017",
"team1": "Werder Bremen",
"team2": "Hannover 96 ",
"ht": "1-0", "ft": "4-0"
},{
"date": "(Sat) 2 Dec 2017",
"team1": "Werder Bremen",
"team2": "VfB Stuttgart ",
"ht": "1-0", "ft": "1-0"
},
...
}]
}

This data can easily be imported to a web site for example to give a user the option to browse historic league results by team name.

Due to the fuzzy search implementation the result is pretty robust to spelling errors and partial matches. You can try different search terms by adding the following URL parameter to your call: &search=[team name], for example:

http://localhost:5050/if/json/R/v1/fafun_fuzzy_search_csv?_token=test_token&search=Bromin

How does it work?

When you open the R-script you can see that it has just a few lines of code consisting of 3 functions. Let’s take a closer look at the first one. fafun_fuzzy_search_csv This is the main function of our script determining the endpoint you called in the previous example. Because the name of the function starts with fafun_ the 5Analytics platform knows that you want to access it from outside. All the arguments of the function can be used as URL parameters like we have seen before with the parameter &search=…. The function itself consists basically of 4 steps:

1. Load the CSV file and get the column to search:

df <- load_csv(file_name) search_column <- df[, column_name]

In the first step we use the “private” function declared further down in the script to load the CSV file into a data frame. We then select the column of the data frame that we want to use for the term search.

2. Get the generalized Levenshtein edit distance to the search term (partial matching allowed):

all_distances <- adist(search,search_column, partial = TRUE, ignore.case = TRUE)

The second step uses the R-method adist to calculate the generalized Levenshtein edit distance of each column entry to the search term. You can learn more about it here. For now just take it as a form of similarity between two strings using the “minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another”.

3. Sort results and filter by:

threshold sorted_distances <- sort(all_distances, index.return = TRUE)  sd_v <- sorted_distances$x                 #values
sd_i <- sorted_distances$ix #indices
threshold <- ceiling(nchar(search)*0.25) #similarity threshold
sd_i = sd_i[sd_v <= threshold] #indices we want

In the third step we use the similarity to sort the data and then we set a threshold to remove entries that are simply too different to our search term. As threshold we use 25% of the search term length meaning a term with 10 letters is allowed to have not more than 4 insertions, deletions or substitutions to still make it into our results list.

4. Return the result:

# Return empty array if nothing survived the threshold 
if (length(sd_i) == 0){
return (list())
}
# Get entries
top_entries <- df[sd_i,]

# Get top N results
if (max_results <= 0 || max_results > length(sd_i)) {
max_results = length(sd_i)
}
# Get nicely formatted output (rows as JSON objects)
out <- convert_dataframe_to_array(top_entries[1:max_results,])

The last step prepares the data for our web-service. First it checks if we have any matches at all, if not we just return an empty list. If we do have some we extract them now from the data frame using the index vector calculated before. If the user has set a limit for the number of results we apply it next and then convert our data frame to a more web-service friendly format (grouping the rows of our table) using the function defined at the bottom of our script. Done :-)

Where to go from here?

We’ve learned how you can easily transform your CSV data files into an interactive web-service using the 5Analytics AI Platform and how to extract approximate matches from the table columns, but this is just the beginning. You can add arbitrary complex data processing steps to your script and use all the machine learning power offered by R (or any other supported language like Python or Scala) to improve your results or add new parameters to your endpoint that take user specific information into account submitted by the website that runs the service or any other data source in your business logic.

Be sure to check out our other tutorials to learn more about the features of the 5Analytics AI platform and possibilities to integrate AI into your business.

--

--

Florian Quirin
5Analytics

Physicist, data scientist, Java and web developer, AI enthusiast, Raspberry Pi hacker and bot (framework) builder.