Rebuilding the user typeahead
Devin Finzer, Jiacheng Hong and Kelsey Stemmler | Pinterest engineers, Discovery
A core part of Pinterest is giving people the ability to discover content related to their interests. Many times, a Pin can be discovered when people share it with one another, through features like Group Boards and messages.
To make the experience of finding a person as seamless as possible, we recently rebuilt our user typeahead. The goal was to quickly surface the most relevant contacts for the user given what we know about that person’s social connections and following activity on Pinterest.
The legacy typeahead was often slow and unresponsive, and limited the number of contacts a user could store. Additionally, all of the logic for the typeahead layer resided in our internal API. We set out to not only make the typeahead faster, but to also split it into a separate service that could be deployed and maintained independently of our API and web application. Building our typeahead in line with our service-oriented architecture would improve testability, ease the process of adding and deploying new features, and make our code more maintainable.
Developing the Contact Book signal
To surface the most relevant contacts, we leveraged the Pinterest following graph, and social graphs from Pinners’ social networks, such as Twitter, Facebook, and Gmail. Still, many users don’t link their social networks to Pinterest, so additional signals based on mobile contacts were used.
Asking for permission to access contacts on mobile can be tricky if intentions aren’t clear. Simply showing users a native dialog can result in a confusing user experience and a low acceptance rate. To build a seamless permissions experience, we integrated learnings from other companies, such as that the best way to ask for permissions is to explain the value of connecting upfront before asking for native permissions.
We also leveraged the Experience Framework for our mobile permissions flow, which allowed us swap out the text of the mobile permissions flow without changing client code. It also meant we could easily experiment with different flows to further optimize our success rate.
Building the new backend
We built a separate backend service (we call Contacts Service) to store the data for the new user typeahead. The new system consisted of three major components:
- A modifiable, real-time online “index” for fast access of any user’s contacts from different sources.
- A Thrift server (which we call Contacts Service) exposing interfaces to manage this index. This allows clients to update the index and look up top contacts using prefix-matching.
- A set of PinLater tasks that keeps the index in sync with the contacts sources using the Thrift interface.
We chose HBase as the storage solution for the contacts index because it met all of our requirements:
- Speed: The primary requirement of the typeahead is to look up names quickly. If names are sorted, the operation is simply a binary search using the prefix combo to locate the position then scan until we have enough results. This is a typical HBase scan operation. Our performance data shows that a scan of 20 rows in HBase takes less than two milliseconds — fast enough to meet our requirements.
- Scalability: HBase is horizontally scalable, and adding more machines to a cluster is easy with minimal manual intervention, and gives linear throughput increase.
- Fault tolerance: HBase supports auto failover. If a box dies, the HBase cluster moves the data to other live servers and the whole process takes a few minutes to finish with no change required on client side.
- Writable: We chose to maintain an updatable index instead of two-layer system solution involving base index and fresh index to simplify the implementation and maintenance. HBase provided us with great write performance.
Supporting millions of contacts for one user
We support and aggregate contacts from various sources for each user in the contacts service. Therefore, it’s extensible if we want to add new sources, and flexible if we want to query contacts from certain sources. In most cases, the number of contacts from each source is under two hundred. However, some Pinners have millions of followers, so to tackle this challenge, we used two kinds of schemas to store contacts: wide schema and tall schema.
Wide schema: This is the default schema, which is expected to fit most sources. Contacts in one source for one user are stored together in one row. Each name token is stored as the prefix of column name. With our own implementation of ColumnPaginationFilter (which provides similar feature as Scan but inside one row), we are able to batch these GET requests to all sources in the wide schema in one RPC to do a prefix lookup.
Tall schema: This schema is specifically designed for sources (e.g. Pinterest follower) with potentially large number of contacts. The wide schema cannot support this use case because data in one row cannot be split across regions in HBase. In the tall schema, for each user we store contacts from one source in nearby rows, where the name token is part of the row key. Then for each source, one Scan request can achieve the prefix lookup among the contacts in this source.
Ranking and de-duping contacts
As we provide contacts lookup from different sources, it’s important that we display the most relevant contact at the top. It could also be annoying if we returned the same contact multiple times because it appears as a contact in multiple of your sources. We found an accurate de-duping logic to filter out contacts was highly beneficial to a user’s typeahead experience. To further improve on the above two areas, we relied on many of our existing services to provide real-time data access and enable Contacts Service to return de-duped contacts in the pre-defined ranking order.
- Ranking: We have configurable ranking order based on different sources. For example, mutual followers is a strong signal of relevancy, so we always want to boost mutual follower contacts at top. This ranking order is configurable, which can be easily replaced with another one.
- De-duping: Contacts Service talks to Friend Service for social network (including Facebook, Twitter, G+) id to Pinterest id lookup, and Data Service (with our own positive and negative caching) for email to Pinterest id lookup. With this information, we can easily tell whether these contacts are actually the same person based on their Pinterest id.
In order to store the data for efficient lookups, we tokenized each of the contacts names, and store each token with a reference to the original contact. We use the same algorithm to tokenize the contact’s full name, and the query string. This guarantees consistent results. We were able to use this tokenization to help score each match based on the proximity of the terms within the contact name and query string. For example, searching for “John Smith” should yield “John Smith” as a stronger match (higher in the results) compared to “John Michael Smith.”
How to enable real-time results?
We considered running an update job daily, to refresh the index with all of the new connections, but decided real-time results were far more useful. For example, when you connect your Facebook account to your Pinterest account, you should be able to send pins to your Facebook friends almost immediately.
In order to have real-time results, we hooked in updates to the new backend service for every time a user’s contact information changes. There are a few places this can happen:
- Pinterest name change: When a Pinner changes his/her name, it’s reflected in all of their contacts. That way, others can search based on this new name, and the old name will no longer yield this contact as a result. It can get tricky for those with millions of connections for a particular source, though. We wrote a chained PinLater task that uses pagination to help fan out the updates without overloading the system.
- Connecting / updating social network source: When a user connects their Pinterest account to Facebook, Twitter, etc., we want them to be able to send pins, and so we created a new PinLater task for this user, to lookup their new connections and update the backend Contacts Service accordingly. We also hooked up these updates in the existing social network refresh tasks.
- Updating follower relationships: When a Pinterest user follows or unfollows another Pinterest user, those changes should be reflected immediately.
Getting all that data loaded
Since we wrote a new backend service for the new user typeahead, we had to populate it somehow. The initial set of data was not small by any means.
To upload the initial data, we wrote a task to upload all of a user’s connections for each supported source type. We ran this task for every Pinterest user, which took about three days to complete, even with a dedicated PinLater cluster.
Timing was critical. We needed to write all of the real-time updates before doing the initial upload so the corresponding updates would be done on top of the base set of data. So, first we added the real-time updates logic, and then we did the uploaded all of the data. Any actions, thereafter, that changed a user’s contacts updated the backend automatically.
Introducing a faster typeahead
The new user typeahead is markedly faster than the original implementation, with a server-side p99 of 25ms. When we A/B tested our new typeahead implementation against the old version, we found that message sends and Pinner interactions increased significantly.
The next improvement is to compute second-order connections to expand the breadth of the typeahead. Stay tuned!
Devin Finzer, Jiacheng Hong, and Kelsey Stemmler are software engineers at Pinterest.
Acknowledgements: Dannie Chu, Xun Liu, Varun Sharma and Eusden Shing