Understanding Pinners through funnel analysis
Changshu Liu | Pinterest engineer, Data
Our number one value as a company is putting Pinners first, and putting that into practice with data requires us to deeply understand how Pinners navigate and use our product. A common analysis pattern for Pinner activity is sequentially analyzing their movement from one logged event to another, a process called funnel analysis. In addition to helping us better understand Pinners, we also use funnel analysis to inform engineering resources. Here we introduce Pinterest Funnels, a tool that enables interactive visual analysis of Pinner activity.
Our funnel analysis platform
We define funnel as a series of ordered steps within a session of Pinner activity. Each step is made up of at least one action a Pinner takes, like viewing a Pin or sending a message.
For example, one might define a funnel (conceptually) like the following to understand invitation conversion ratio:
‘name’: ‘invitation conversion’,
‘creator’: ‘Changshu Liu’
‘click invitation’: [‘email_invite_click’, ‘fb_invite_click’, ‘twitter_invite_click’],
‘visit landing page’: [‘langing_page_visit’],
‘registration’ [‘email_reg’, ‘gplus_reg’]
‘experiments’: [‘exp_name1’, ‘exp_name2’]
Funnel definition example
To simplify the task of creating funnels, we provide a web-based funnel composer with predefined action name auto-completion. This enables non-engineers to adopt funnel system easily and also reduces action name typos thereby improving user productivity.
After defining a funnel, we use one of the two funnel analyzers (detailed later in this post) to generate results visualized in the web portal, including:
- How many Pinners reached which step defined in the funnel (for Pinners at step N, they’ve already reached step 1, step 2 … step N-1).
- A segmentation feature that lets us divide the total count into different segments. (We now support six segmentations: gender, app, app version, country, browser and experiment group.)
- Results from different segment combinations.
- The history of the result.
Behind the scenes
There are three main subsystems powering the funnel analysis platform:
- Action sessionization pipeline, which collects original data sources, annotates them with proper meta information and groups them into per-user sessions that will then be consumed by the following two analyzers.
- Hive funnel analyzer, which consumes the session data using Hive UDF, generates a session number for each funnel step with each segmentation combination and feeds these data into our Pinalytics backend.
- ElasticSearch funnel analyzer, which translates funnel definition into ElasticSearch query operator trees and queries against indexed session data to verify funnel definition correctness. It can also serve interactive ad-hoc funnel analysis requests.
Action sessionization pipeline
Pinner action data used in the funnel analysis platform comes from three sources. The first two sources are frontend event and backend events which are logs with predefined schema. The third source is freeform logs that any developer or team can add to. Each log entry may contain many fields, but here we mainly care about who (a unique_id representing a registered or unregistered Pinner) did what (a string representing Pinner action) at what time (a standard timestamp).
To make Pinner action data easier to use by the funnel platform, we did some special handling to the original action name.
- A short prefix is added to each action to avoid any conflicts For example, we might use _f_ for frontend event and _b_ for backend event.
- We also track where (such as view, component and element info) the action happened for a frontend event. The final action of this kind would look like _f_action@view@component@element.
The three sources are then unioned together, spam filtered, sessionized into a series of Pinner sessions and annotated with experiment/segment information. There are multiple ways to group actions into sessions. In our case, we care most about actions Pinners did within one day or one week. We group daily action and weekly action tables by unique_id. A typical row in the final raw session table looks like:
(uuid_xyz, [‘action_1’, ‘action_2’, ‘action_3’ ...], [‘exp_1’,...], ‘Android’, ‘3.3’, ‘US’, ‘Chrome’, ‘Male’)
After the raw session tables are generated, the analyzer logic is applied to it. We have two analyzers for different scenarios, one Hive-based offline analyzer and the other is an ElasticSearch-based online analyzer.
Hive funnel analyzer
In the Hive funnel analyzer, we built a Hive UDF to process the session table. For each Pinner session (a row in raw session table), the UDF matches it with all funnels defined in our funnel repository and generates the total number of sessions that match the step actions defined in each funnel.
For example, suppose we have a Pinner session that looks like this:
(‘uuid_1234’, [‘fb_invite_click’, ‘action_1’, ‘langing_page_visit’, ‘action_2’], [‘exp_1’], ‘US’, ‘iPhone’, ‘5.0’, ‘Female’, ‘Safari’ )
The UDF will generate the following records after processing this session against the funnel we defined in the beginning:
(‘invitation_conversion.step_1’, ‘exp_1’, ‘US’, ‘iPhone’, ‘5.0’, ‘Female’, ‘Safari’, 1)
(‘invitation_conversion.step_2’, ‘exp_1’, ‘US’, ‘iPhone’, ‘5.0’, ‘Female’, ‘Safari’, 1)
As you can see, there’s no record for step three since the given Pinner session didn’t reach any actions defined in step three of the invitation conversion funnel.
Next, these session counts are summed up, and we now have the total session count for each step in each funnel for each segment combination. But most likely, a user only cares about the total sum or the sum of some segmentations. For instance, a user might want to know the session count for step one and two in the invitation conversion funnel, only in the U.S. We use Pinalytics to implement on the fly “rolling up” functionality using HBase coprocessor. This eliminates the need to pre-compute rolling up numbers which would cost a huge amount of space given our segment cardinality and makes backfilling relatively easy.
As mentioned previously, we included view, component and element in frontend event actions. Sometimes users want to build a funnel based on the pattern of such actions. For example, a user might want to know how many Pinners did a ‘click’ action on an ‘invite_button’ element, no matter what the view or component is. We provide a special ‘pattern matching’ action syntax to express this semantic: _f_click@*@*@invite_button. The tow ‘*’ chars the mean view and component attribute could be any value, and the ‘@’ char is used as a field separator.
ElasticSearch funnel analyzer
If you’re familiar with how a search engine works, you might have noticed the session/funnel matching logic is very similar to how a typical search engine matches documents against a query operator tree. If we model a Pinner session as a search document, each action as a positioned term and segments as document fields in ElasticSearch, then the relationship among actions in the same step of a funnel can be expressed using OR operator and the relationship between consecutive steps can be represented using APPEAR AFTER (a NEAR operator with in_order property set to true in ElasticSearch) operator.
For instance, the funnel definition in the beginning example can be translated into the following three query operator trees:
The returned doc count for these queries will be the session count of each step in the funnel definition.
In order to support the “pattern matching” action syntax like: _f_click@*@*@invite_button in the ElasticSearch funnel analyzer, we use technologies from the search community called “query expansion.” We pre-build a trie from the concrete action dictionary and use it to expand the ‘pattern matching’ action to a list of concrete actions during query time. If there are too many expanded concrete actions, we weight them according to term frequency and choose the top K actions, where K is a configurable parameter.
As an example, _f_click@*@*@invite_button might be expanded to the following four concrete actions according to our action dictionary:
As you can see, the two analyzers have different characteristics.
- Hive Analyzer is slower than the ElasticSearch Analyzer. If the funnel definition changes, we would need to rerun Hive queries to update the results. However, it covers more historical session data since there’s no need to load them into ElasticSearch cluster. The result is more accurate, because there’s no approximation logic.
- ElasticSearch Analyzer is super fast (at sub-second level) but it covers less data as we need to index the session table into ElasticSearch cluster which has limited capacity. It’s less accurate in some cases since we ignore some expanded actions if there are too many candidate actions.
In practice, we use ElasticSearch for funnel preview, which helps verify whether the funnel definition is what we want before materializing the definition into funnel repo. We also use it for ad-hoc funnel analysis, an interactive and on-demand funnel analysis of those recent sessions available in ElasticSearch.
The process above enables our team to look at navigation on a predefined path. We’re currently considering how to allow easy comparisons across multiple funnel paths. To extend the example above, we might compare data between Pinners that sign-up via different types of invite emails. Secondly, there are tradeoffs we’ve had to make as a result of the data size with respect to what’s available in the ElasticSearch index. Optimally, we would provide greater flexibility for users to navigate live funnels rather than wait for the MapReduce funnel analyzer job to populate their data.
Acknowledgements: Funnel analysis project is a collaboration between Data team and Growth team. Thanks to Ludo Antonov and Dannie Chu on the Growth team for feature suggestions and implementation discussions and for the initial efforts on funnel analysis, as well as Suman Jandhyala, Shuo Xiang and Jeff Ferris on the Data team.