Contextual search for datasets in CDAP
When you’re in charge of managing large amounts of data, the data about that data — or, its metadata — can be just as important as the data itself.
Imagine, for instance, sifting through a large cloud drive for a paper you wrote around two summers ago. You might recall that you wrote it between June 2017 and August 2017, and that the title had something to do with grape vineyards. So, you search within those date boundaries, and for the keywords “grape” and “vineyard.” If your search engine of choice supports it, finding that long-lost file may be trivial — assuming you didn’t write very many papers of the same description.
Just as the metadata of files consist of names, creation times, and memory sizes, datasets in CDAP have metadata too, in the form of system- and user-annotated properties and tags. Moreover, to allow users to retrieve datasets based on their metadata, CDAP offers a dedicated metadata search tool.
The preexisting search tool only allowed for text searches on one or more terms, and any entity with at least one match was returned to the user, in the order of relevance. Because of this, you may notice that, in the past, conducting a search similar to the example given above was challenging, if not impossible.
In other words, you could not specify two required criteria for a search. Additionally, you were unable to search by dates or numbers — just strings. So, over the course of our internship, we introduced features for search that allow users to have more control in the queries that they write.
The specific features implemented were:
- Search for one or more required terms
- Search for numeric values
- Search for date values
Users can utilize these features directly in the metadata search bar with some simple syntax. Required terms, for example, are preceded with a ‘+’.
There was no initial infrastructure, however, to easily add new syntax in the search bar. Syntax needs to be first detected, and then, based on the type of syntax, the relevant information must be extracted and stored.
The existing search code deals with space-separated tokens one at a time, making the intuitive choice to then parse each term within the search method itself. Originally, we planned to build upon this behavior within the search method, adding some new logic. However, given the complexity of query parsing, with all its unique conditions and syntax definitions, we thought it better to use a helper class that could parse queries elsewhere. This would abstract a lot of the complexity away from an already-complex method.
We then discovered that CDAP has two different implementations of metadata search: one based on NoSQL and one in Elasticsearch. Both of these implementations need the same information from the user’s query. So, to avoid repeating code, we designed and implemented a query parsing API, called QueryParser, that both implementations could access.
QueryParser takes in the raw form of the user’s entire search query and splits it, according to QueryParser’s self-defined syntax, into its individual search terms whose information is stored in a QueryTerm object. A QueryTerm, then, can be conceptualized as a smarter String, containing relevant search information in an easy-to-access way.
Both implementations can send a raw search query to the QueryParser API, which will create corresponding QueryTerm objects to be sent back to the implementations. The implementations can then use the information stored in the QueryTerms in their unique ways.
Required Search Implementation
When a user conducts a search, they typically aim to filter a large source of information — like a cloud drive — down to a smaller one. To increase the specificity by which users can filter their datasets, our first task was to implement required search terms.
We decided quite early that a ‘+’ prefix would be the most straightforward way to denote a required term. Since the QueryParser API defines its syntax, we had to adjust it to check for a “+” character at the beginning of a search term, and to consider such terms required.
Downstream, utilizing this API in the Elasticsearch implementation proved fairly simple; the Elasticsearch Java API offers its own helper methods specifically for conducting a required search, and the entire implementation required changing a single method. Before, search terms were indiscriminately considered optional — now, they were considered optional or required according to their QueryTerm representation. Handy!
Utilizing the API in the HBase implementation, on the other hand, was a bit more involved. The HBase implementation contained no such helper methods, so we had to perform the filtration ourselves.
The search method itself remained the same in the HBase implementation, without any regard for whether a search term is required or optional. The resulting entities from the search are paired with sets of search terms that were used to find them. These sets are then compared to the set of required search terms, which is information we extract using the QueryParser API, if any set does not contain all required search terms the corresponding entity is filtered out.
Numeric Search Implementation
Of course, users may think of their metadata properties as more than just strings. For instance, a user may annotate a “priority” property to their datasets, with a numeric range from 1–10. Initially, if they wanted to search for, say, all datasets with a priority greater than 6, they would have to search for each number individually: “priority:7,” “priority:8,” and so on. You can imagine the tedium of trying to search through datasets with priorities ranging from 1–100! As such, it seemed important to implement numeric metadata search.
The format for a numeric search query would look like this:
Where a field is any metadata property, a comparison is >, >=, <, <=, or = (corresponding to greater than, greater than or equal to, less than, less than or equal to, or equal to), and a numeric value is some integer or decimal value. That priority search from before would then look like “priority:>6”.
Implementing numeric search presented a new, interesting challenge: it demanded a change to the way CDAP stores its metadata properties. Finding out where this change needed to be made was its own technical adventure, an extended code-reading exercise. Once we reached the core of the issue, the Property class, we had to determine how to make the change in as few lines as possible, like surgeons operating on a digital patient. We knew that a Property was always string-based, and in some cases numeric. How do we represent this in code?
We concluded that a secondary numeric field, that would be populated if the string representation was a valid number, would be most valuable. After altering Elasticsearch’s index mappings and calling the proper Elasticsearch API methods, voila! Numeric search was born.
One complication with numeric search, however, was that we were unable to add it to the NoSQL implementation of metadata search — the supporting functionality wasn’t available.
Date Search Implementation
The implementation of date search is similar to numeric search in many ways, but dealing with time proved itself to be a bit trickier.
The general format for a date search query is the same as for numeric search:
But this is actually where we run into our first issue: what is the format for date-value?
In the United States most people write the month, day, and year all separated by a forward slash: MM/DD/YYYY. But in other parts of the world the standard is to first write the day followed by month and year. Some people write out the full year, some only write the last two digits. Some separate each component with a dash instead of slash, or even sometimes a period. There are so many possible ways to represent the same date!
So, as a start, the ISO 8601 standard for writing dates (YYYY-MM-DD) is the only format currently supported for date search and date definition. Support for other formats will be added in the future.
When a query is detected to have a comparison operator following a field name, the QueryParser checks whether the value that follows can be parsed as a date. If it can, we convert the date to a unix timestamp with an adjustment for the users timezone as well.
Even though the unix timestamp can technically be treated as a number, we cannot simply apply numeric search for these date values. This is because of the way the code must deal with assumptions people make when making a date search with a comparison operator.
For most people, searching for a document that was created after (“>”) June 1st should not include objects that were created on June 1st at 4pm, but a search for objects created on or after (“>=”) June 1st would include those objects. But when a user defines these two different queries, they use the same date format, 2019–06–01, which is converted to time 00:00:00 on June 1st.
The challenge is to therefore build a query that does not consider time 00:00:01 on June 1st to be after June 1st. This was achieved by adjusting the query values by 24 hours so that date searches are dependent on the day and not the time. In the future, if users are able to specify the exact hour or minute in their query, this format may need to change. Until then, however, these adjustments give users the results that they would intuitively expect to get.
These rich search queries will soon be part of an upcoming CDAP release. Try out CDAP today, and if you’d like to explore such exciting challenges, consider contributing to CDAP.
The process of implementing these features was much more iterative than we first expected. What at first seemed like a simple task ended up requiring a whole new API and some code refactoring and overhaul. This meant that we had to scrap some of our first ideas and redesign our solution. Additionally, adhering to the preferred practices in industry meant revising our code several times for readability, documentation, and fine tuning. Along the way, we grew as programmers, developing a finer sense of code health and design principles. This could not have been achieved without the encouragement and guidance of our internship hosts — Vinisha Shah and Andreas Neumann — and the CDAP community.
Authors: Jordan and Violetta