I don’t always attend conferences. But when I do, I prefer conferences full of search practitioners focused on solving real-world problems.
I just came back from the Lucidworks Activate conference on search and AI, where I had the privilege of participating in a panel on “AI In Practice: the Good, the Bad and the Ugly”. It was a fun and energetic discussion with Github’s Kavita Ganesan, Slack’s Josh Wills, and Reddit’s Anupama Joshi, led by Lucidworks CTO Grant Ingersoll. Hopefully they will post the video and slides from the other session on their YouTube channel soon.
Activate was a lot of fun. I mused on Twitter that it’s like SIGIR but for people working in the real world. That was a bit snarky — and if you work on search and not familiar with SIGIR, you should certainly check it out. Nonetheless, I appreciated the focus of speakers and attendees on practical concerns about implementing search applications, something that I often miss in gatherings of more academic information retrieval researchers.
Some takeaways from the conference:
- Everyone is trying to work with embeddings, whether word embeddings like word2vec and GloVe, paragraph embeddings like doc2vec, or character ngram embeddings like fastText. The challenge is using these vector-based models in combination with an index representation designed for words. There’s a yawning gap between a Boolean query to retrieve documents containing a set of words and a nearest-neighbor query to retrieve the documents closest to a high-dimensional vector. Fortunately, there are practical tricks for doing so, but they all involve trade-offs.
- Deep learning for search feels like the new teenage sex: everyone talks about it, nobody really knows how to do it; everyone thinks everyone else is doing it; so everybody claims they’re doing it. OK, it’s not that bad. But it does feel like deep learning is top-of-mind for almost everyone working on search. And deep learning is genuinely exciting, but some of the more senior folks there, myself included, urged caution and patience before diving into complex approaches that don’t optimize for explainability.
- Lots of folks were talking about entity recognition in documents and queries. One of my favorite presentations, delivered by Radu Gheorghe and Rafał Kuć from Sematext, showed how easy it was to use open-source entity recognition software with Solr through a rapid-fire series of live demos. In general, it was great to see how many people were focused on query understanding problems.
- The most intriguing takeaway for me came from a presentation by Lucidworks Chief Data Engineer Jake Mannix: he made a compelling case that working with character sequences instead of words reduces risk of unexpected tokenization differences among different components of search stack that often use different languages and libraries. While I’m not ready to give up all the benefits of working with tokens, I see the merits of his argument. Indeed, I’ve already seen that query classification works very well on character sequences, e.g., using a tool like fastText.
Overall, it was great seeing a snapshot of search in practice across a wide variety of domains and use cases. It’s an exciting time to be working in search, and I’m delighted to see how quickly ideas are spreading from research labs to real-world applications through easy-to-use open-source products.