Upside Engineering Blog - Medium

Five Key Security Considerations for Your Remote Workforce

Kris French — Thu, 09 Apr 2020 15:00:44 GMT

5 Key Security Considerations for Your Remote Workforce

If you’re anything like us here at Upside, this is the first time your organization has had such a large percentage of its workforce working remotely. Securing remote teams and (newly) remote assets comes with its own distinct set of challenges. Doing it during a pandemic — when most of the rest of the world is doing the same — brings even more.

In the spirit of getting through this crisis as a community, here are five things you and your security team should be considering right now.

1) Make sure everyone knows where to go if they have problems.

This was always important, but with a highly remote workforce, it’s essential. With most of your organization now remote, no one’s going to run into you at the coffee machine and let you know about that weird email they got. You’re not going to hear through the grapevine that Jake in Accounting downloaded some weird browser plug-in. And worst of all, if someone accidentally clicks on something and later realizes they shouldn’t have, then embarrassment, worry, and confusion could keep them from reaching out through an official channel.

To remedy this, you need to apply a foundational security principle in a new way: availability. Be available.

Know which communication platforms your organization uses and make sure that the security team is easily reachable on all of them.
Make yourself visible.
Post reminders.
Respond quickly.

Here at Upside, we use Slack for almost everything. To make sure we’re available, my team has created a security channel where we post security news relevant to everyone’s professional and personal lives, retell security stories, answer questions promptly from everyone in our organization, and, most importantly, celebrate members of our organization who bring us things that look weird to them — even if the reports turn out to be nothing at all. We’re also available by email and a number of other methods. All of these communication channels can be found on our internal wiki page and we post frequent reminders to highly populated channels.

2) Reduce friction caused by security.

“Security that’s hard to use gets turned off” is a mantra that’s served me well over the years. It’s held true in every security domain and environment I’ve ever worked in. If your security controls get in the way of work getting done, they’ll be bypassed or removed sooner rather than later. This is compounded by the first point — if people don’t know how to reach you, they’re just going to bypass whatever’s causing problems instead of reaching out for a solution. This is more relevant than ever, as the increased personal and professional stresses each member of your organization is feeling during this pandemic are only going to lower their tolerance for friction.

Security is a customer service job. Your customers are the people in your organization. If you’re not working in their best interest, you’re not doing your job properly. And working in their best interest means more than just providing the maximum security possible. It means finding solutions that work for your teammates — ones that don’t hinder employees’ work while still providing them with solid security. With every step you take on your security journey, you need to consider how it will affect your customers.

Whenever my team is planning a new project, we always consider what issues our customers (the rest of our team at Upside) could run into as a result of our actions. These friction points are always evaluated to see if they can be smoothed or removed entirely. We take a hard look at whether the security we’re adding is worth the friction that comes along with it. Both sides of the equation are important to a strong security program.

3) Secure your remote communication platforms.

In the last few weeks, it’s likely that remote communications platforms in your organization went from a nice-to-have supplementary tool to a mission critical tentpole. This is the case all over the world and with this explosion in popularity has come an equal expansion in scrutiny of those platforms. New vulnerabilities and new attack vectors are in the news every day. They’ve very quickly become a lucrative target in your infrastructure where they might have previously been ignored.

These platforms have undergone an unprecedented shift in risk. When a shift in risk happens to an asset in your infrastructure — especially at this speed and magnitude — you need to be ready and able to act. First in your ability to detect that shift in risk, and second in your ability to adjust security controls to compensate. In this instance, it means shifting resources to improve the security of these communication platforms which have suddenly become critical to business operations.

In the last few weeks, the members of my security team have practically become experts on the inner workings and security features of our messaging platforms. We’ve worked to harden this now-essential piece of our business and we’ve done it while keeping the first two points in mind — being available and keeping friction low. But we were only able to do this because we have a robust understanding of our internal landscape and we were willing to make the hard choice to shift resources away from some long-planned projects to adapt to the new threat.

4) Shore up inventorying and central management.

Any introductory security text will tell you “you can’t secure what you don’t know exists.” Now, with your organization’s endpoints scattered to the four winds, this maxim holds particular relevance. The best time to have taken a robust inventory of your assets was as you acquired them. The second best time is now.

One of the keys to being able to adapt to risk level changes in your environment is to have a strong familiarity with it. In the example above, if you weren’t aware of the remote communications platforms your organization was leveraging, you wouldn’t be able to appropriately apply hardening measures and stay on top of new security information affecting your assets. Without this familiarity, it becomes more likely attackers will find some forgotten, under-protected corner in which to establish a foothold.

My team works to solve this by keeping a living inventory of both the hardware and the software we rely on. We do this with little in the way of expensive tooling, instead relying on a strong internal communication network and easy-to-use processes for the whole organization. It’s imperfect, but I’ve never seen an asset management program that was 100% all of the time and this one has got it where it counts. This up-to-date reckoning of our environment is one of the most important tools in our toolbox when it comes to our ability to respond to new and shifting threats.

5) Stay on top of the news and keep communicating.

Staying informed is a huge part of any security professional’s job. Threats emerge by the minute. Landscapes change. New exploits are published. Pandemic statistics and airport closures probably weren’t sources of information for you before this. I know they weren’t for me. But we adapt. We open our view to include new information that might be used to keep our organizations safe.

As with asset management, I’m of the belief that expensive tooling isn’t needed to stay competently on top of the news, but that you should find what works for you. Information gathering is probably one of the most flexible and personalizable parts of any security program. Find the information sources that produce information that’s useful to you in a format that’s useful to you, and then put the time in to actually consume it!

Personally, almost every piece of urgent threat information I’ve ever acted on in my career has come from the community, not from threat feeds. In fact, most times, the threat feeds (even the expensive ones) have been several (or more) hours behind the community when it comes to this information. More important than the gathering is how we react to the information. If a threat is serious enough, my team posts in a company-wide channel a brief summary of the issue (with references), what our users can expect to see from us in the way of a response, and how to get into contact with us with any questions or comments (even though they should already have that information). We also keep everyone up-to-date as the situation progresses so that they’re not left wondering.

What I’d like you to take from this is that what makes a security program truly successful has very little to do with technical security controls, which tools you choose, which standard you follow. Security is a people job, and if you want to be truly successful you need to understand your environment,your business, your team, and remember to communicate.

Five Key Security Considerations for Your Remote Workforce was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building better machine learning feature validation using Pydantic

Ian Cassidy — Tue, 21 Jan 2020 14:42:29 GMT

In the 2+ years that I’ve been at Upside, I’ve continually struggled with productionalizing machine learning models. Like most data scientists, I prefer to perform feature engineering and develop my models in a Jupyter Notebook while relying heavily on pandas and scikit-learn. Depending on the application, the features that I generate could come from multiple external APIs and have multiple preprocessing steps. Rarely am I thinking, “How am I going to re-create this feature in a production environment and check to make sure that the value I’ve produced isn’t totally nonsensical?”

When building machine learning APIs to serve predictions, I’ve typically defined some variant of an OrderedDict¹ to keep track of feature names and values during feature engineering. I refer to this object as a “features container” and use it to eventually feed the values in the form of a NumPy array to the .predict method of a pre-trained machine learning model. This is less than ideal for many reasons and has driven the Python engineers that I work with crazy on several occasions. Usually, I just wave my hands and say, “Look I have a unit test for the output of my feature engineering/model prediction pipeline and that is sufficient to say that everything is working as expected.”

My statement above isn’t entirely wrong, but we as data scientists can do better!

Recently, I’ve started to really embrace the concept of typed variables in Python — even when I’m developing models in a Jupyter Notebook! NamedTuples and dataclasses (available in Python 3.7+) are two of my favorite things to use when defining the inputs and outputs of various functions that I write. In the past few weeks, I completely re-wrote a backend service that I built about a year ago to predict flight delays using typed variables. But, when I got to the part of my code that was creating and storing features, something was still bugging me. I converted my features container from an OrderedDict to a dataclass², but the process of typing out every field and datatype seemed overly tedious and verbose, especially if I ever need to go back and change anything about my model. It also lacked the ability to type check and validate the value assigned to the field out of the box.

Thankfully we can use Pydantic³ and some simple helper functions to alleviate the concerns I mentioned above. The Upside Labs team has been using Pydantic for several months now to type check and validate data models in several backend services and it’s been another extremely useful library. Let’s work through a simple example of how we can use it to create a more robust features container.

Example

Let’s say I’ve trained a model using the dataframe below (it’s probably not a very good model). The features cat_1 and cat_2 are one-hot-encoded categorical features while dog and bird are continuous features.

The code snippet below shows how we can represent this dataframe as a features container using Pydantic.

https://medium.com/media/e3dd273b1a78f2b79b3fc6689c3c12ef/href

I’m not going to walk through every line of code above, but here are some of the main points I’d like to highlight:

The Pydantic conint and confloat wrapper functions are used to create constrained integer and constrained float fields, respectively, using the inclusive bounds specified by the ge (greater than or equal to) and le (less than or equal to) parameters. If we attempt to assign a value to a field that does not match the defined datatype or is outside the specified bounds, a Pydantic ValidationError error will be raised. The values that I have defaulted to come from inspecting the original dataframe.
The fields are all defined as T.Optional with a default value of None so that the features container object can be instantiated and then passed from one function to another and feature value(s) can be assigned sequentially⁴.
The set_categorical_features helper method can be used to one-hot-encode a set of categorical features all at once! All you need to do is specify the feature prefix and positive_category (see the code snippet below for an example of this in action). I often have categorical features with tens or hundreds of categories so this method has been extremely handy.
The set_bulk_features helper method can also be used to set many feature values at once. Just pass it a mapping dictionary of field names and values and the method will take care of the rest (including the type checking and validation).
The numpy_array property can be used to create a NumPy array with the correct shape and order of field values specified in the original training dataframe. It has a check to make sure that all of the values have been assigned and will raise a custom FeatureIsNoneError error if any of the fields are None. This property would be used to pass the feature data to the .predict (or .predict_proba) method of a pre-trained machine learning model.

Next, we can write a simple features container and model prediction pipeline using the AnimalFeatures container we defined above.

https://medium.com/media/c062d8f098bb6fc90d3df957447b68a8/href

To protect against predictions made with erroneous feature values, I’ve added some try/except logic to catch any validation or missing feature errors, log a warning, and return a default prediction. If this happens I don’t want the code to blow up, I just don’t want the model to output a misleading prediction. Similar logic exists in the production services that I’ve recently worked on and it has been immensely helpful in finding bugs and features that aren’t working as expected.

Finally, we can convert the original dataframe to a features container using a simple script to automatically generate the field names, datatypes, and bounds by inspecting the values in each column of the data. I am constantly tweaking my models and don’t want to have to worry about how adding or moving around columns in my training dataframe will affect the production feature pipeline. The output of the function below can be used to produce the code that would become the features container object — the AnimalFeatures class defined in the above snippet was actually generated using this script.

https://medium.com/media/0b0e8aa192e270d94bac755859a0758c/href

Closing

I would love some feedback from other data scientists, machine learning engineers, or anyone else about whether you think the above concepts might be useful for your own applications. Thinking through the process of taking a machine learning model from a Jupyter Notebook into a production microservice has been a great learning experience for me and I hope the tips and tricks outlined in this article are valuable for other developers.

Oh… and we’re hiring — check out our careers page!

Footnotes:

[1]: Using an ordered dictionary instead of just a standard dictionary has the advantage of returning the values of the dictionary in a specific order. This is critical in machine learning applications because the model expects the features in the order that was used during training.

[2]: I initially used a dataclass instead of a NamedTuple for my features container because dataclasses are mutable. This allowed me to pass the container object through a series of functions that assign values to the various fields I’ve defined.

[3]: Huge shout out to Rami Choudhury for introducing me to Pydantic and for helping me develop the code in this blog post!

[4]: This is not meant to be prescriptive, but is merely how I generally write production feature pipelines. Feel free to adapt any of this logic to fit your own needs.

Building better machine learning feature validation using Pydantic was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Playbook for practical product experimentation

Chris Poirier — Fri, 20 Dec 2019 14:06:07 GMT

A Playbook for Practical Product Experimentation

Building an experimental engine inside your product.

In our last article on bringing a culture of experimentation to your startup, we talked about some techniques and strategies for using an iterative, experiment centric approach to hone in on product market fit. Importantly, we discussed the process for hypothesis-driven development and the need to trust the process to hunt signal and optimize learning rate.

Here, we fill in the details of what that process looks like. We discuss the practical aspects of a culture of experimentation and offer some advice on when and when not to run an experiment, the considerations of how to set up the experiment for optimal data collection and customer experience, as well as how to select the audience and interpret the results.

So, why might you want to run an experiment?

There are many reasons why you might want to run an experiment. You have a question or curiosity, you believe something can be improved, or perhaps you think you can just make an aspect of your customer’s experience better. Regardless of the reason, it’s important to ground yourself in the hypothesis-driven mindset to set yourself up for success and maximize your learning opportunity. In order for you to formulate your hypothesis, it’s important for you to do your homework and learn as much about the problem space as possible. The ultimate goal of any experiment is to learn in a controlled manner.

When is a good time to run an experiment?

Ideally, you’re down the ideation path and have already spent a lot of time in the problem space. Now you find yourself trying to figure out what levers your product has and how moving them impacts the customer. Experimentation is a great way to uncover those relationships. In the past, we’ve used experimentation to test simple things like how headline changes impact conversion rate or how tools tips increase sign up rate, to far more complicated things like how using a personalized machine learning algorithm decreases the time it takes a user to select inventory. Note that in many of these cases, we’re testing things head to head.

Experimentation is commonly used to compare two variations of something to see which one performs better. This is called A/B testing and in formalism, dates back to the 1920s in agriculture, pitting two seed varieties against one another to measure yield. It was adapted to medicine in the 1950s and eventually to advertising mailers in the 1960s and 70s. Today it forms the basis of optimization in e-commerce and product development.

A key consideration is whether it makes sense to run a test in parallel or if complications of the environment force you to run a test longitudinally (for example over time, with one treatment for a period of time followed by another).This approach makes sense when you need to reach entire populations or the treatment is so drastically different that it makes sense to completely switch over for a period of time.

When might you not want to run an experiment?

In many cases, running an experiment just for the sake of an experiment is a waste of resources. If the change you’re proposing is net positive (e.g. adding more inventory), we’d be more inclined to say just do it. Likewise, if you want to experiment in an area of the product that doesn’t get a lot of traffic, consider different ways to get that feedback like calling customers, issuing surveys or even user testing. If the effect size (i.e impact of change and number of users exposed) is small or the risk is low, it often makes sense to just do it. Remember, we’re using experimentation as an efficiency play in learning — often times there are better ways to learn than investing the time and effort to conduct a meaningful test.

What are some of the considerations when running an experiment?

We use the following mental template as a way to get started:

Why are you running this test?
What are you going to test?
Where are you going to run it?
Who is the audience for the test?
How are you going to measure it?
When are you finished?

To put this into practice, let’s take an example that we’ll work through for the rest of this post — say we’re contemplating putting “strike-throughs” on our prices as a way to show people we have and offer discounts.

The Why.

The reason why we’re running this experiment is that we want to understand if there is any relationship between telling our customers they are saving money and the rate at which they purchase discounted goods. To run a good experiment in this case, we need to understand a few things:

1) How many people visit our site and see the prices we’re displaying?

2) How often do we have a price that we’re able to mark down and discount?

3) When we’re able to offer a discount, how big is it?

4) Are there historical purchase patterns where people are more likely to convert on discounted inventory even if the discount is not explicit?

5) Are we tracking what we’re displaying to the customer and the eventual success outcome?

What we’ve done in the above exercise is to understand the impact of the experiment, what you know about the environment you’re going to experiment, how much you might be able to move the needle, and how often your experiment might even be seen.

The What.

In the above case, we want to test the impact of adding strike-throughs to our prices when a discount is available. It’s best to be more specific and isolate the one thing you care about learning. A goal in experimentation is purity of signal and when possible you don’t want to confound your experimental results with multiple effects. Try and break down the hypothesis into a single question that tests an isolated concept when possible. Aiming for purity of signal now will help in both execution and measurement as it quickly limits scope and defines the success outcomes.

The Where.

Generally, we’re trying to formulate an if this, then that. It’s now time to formulate the hypothesis and this is a great place to bring in historical data. You want to answer the question “What do you think your change will do?” This is the time to do your homework and use your knowledge of the product, your customers and prior experiments to estimate the impact of your change. Historical data and experience are also useful here and back of the envelope math is perfectly ok. You’re getting to the point where you’ve defined the what and the why, now you need to figure out how you’re going to measure it and decide on the definition of a successful experiment.

In our example above, where we are considering adding strike-throughs, let’s say we know from past data that our current conversion rate is roughly 10%. We anticipate adding strike-throughs will result in a 20% increase in conversion rate. Back of the envelope analysis would put us at around 3000 samples to reach statistical significance. Now let’s say our site gets 1000 unique customers per day, and that on average, we have the ability to show discounted prices to half of those. We’d need about 12 days of volume in an A/B test scenario to measure that estimated effect. Do you want to measure a smaller impact? Well you’ll need more samples and that means a longer run time or more customers. Statistical significance and its tradeoffs are covered much better elsewhere, it’s just important to think about what variables we have in play and how that impacts our experimental design.

More important is how we arrived at that change estimate of 20%. In our example, we analyzed the historical purchase data and saw people were more likely to purchase discounted rates. We ran a historical experiment that saw a 10% increase in conversion rate when we badged things as a sale. The best experiments come from when you use your corpus of historical data and experimental findings to bootstrap your hypothesis development.

Part of your experimental design will be around what the outcome is — really phrased as “what do you care about?” This is often where people go wrong, as it turns out the thing they care about is either 1) not measurable or 2) a slow to acquire metric like lifetime value or retention rate. These are okay if you’re willing to run an experiment for 90+ days, but since we’re talking to a lot of folks looking to experiment in market fit, 90 days is an eternity. When we set up an experiment, we think it’s key to have a primary metric that you’re using to measure success. It’s fast moving, measurable and one that you really care about. It’s also ok to have secondary metrics, often good for checking adverse effects (e.g. your primary metric of time on site goes down but in reality it also drove your conversion rate and profit/sale down). It’s good to keep an eye on what you care about, even if it’s not your primary experimental focus.

The Who.

This is probably the single most important factor in what you learn from your test. Here we define who is going to be a test subject. Are we going to limit the test to returning customers? To new prospects? To mobile only? Generally you want to be as nonrestrictive as possible in order to maximize the generalization of your learning. However, we need to be careful as this is not only a numbers game but a bias game. We also need to determine how we’re going to assign people to the control (what we’re measuring against) and the treatment (what we’re testing) and if that’s every time they come to the site, or if they stay in a particular path for all time.

For the target audience, you want to selectively choose this group so as to understand the bias they bring in terms of point of view. The group you select will condition and limit the learning from the experiment going forward. It’s important to define up front and communicate your target audience and what biases you might have. The bias your audience has is going to directly impact what you learn. When you ultimately present your findings, you’ll present them as “when we test this on this group, we learned that…”

In our example, we’re going to enter the experiment every time we have a customer shopping for a hotel and we have at least one hotel we can show a discount for. The experimental group will be all shoppers who come to the site and are eligible to see a discounted hotel, the control group will be those who don’t see a strike through and the treatment will be those people who come to the site and are eligible to see a discounted hotel and see a strike-through on the price. To maximize our learning, customers will get a chance to equally see both every time they shop.

When designing your experiment, factors in your ability to run a good test are:

Time
Audience size
Effect size
Observability of the dependent variable
At what level you do the randomization

Time is crucial to consider when designing an experiment and is why we included it in the back of the envelope calculation above. Imagine you increase the specificity such that your audience size is halved — are you willing to wait twice as long? This assumes you can run things in parallel. What if it’s a pricing experiment and you need to run the test longitudinally? Do you have sufficient data to remove seasonality?

Audience size is also very important — the more people who see your experiment, the faster you can learn. You can tailor what you learn (audience bias) and at what rate (effect size) but you’ll be limiting the generalizability of your findings. That might be perfectly okay but at the end of the day realize this is a game of tradeoffs.

Effect size is the impact of your experiment. You’ll need fewer samples to prove a drastic change than a small improvement. It’s why we often suggest high impact experiments for startups and smaller, more incremental tests for operational companies that have the impression volume and time to optimize.

To reiterate, make sure you can measure your success variable and that it can be measured in a timely manner. Don’t expect to learn much in 14 days if your outcome is B2B signups that occur at a very low conversion rate in a month-long cycle. Perhaps find a leading indicator like conversion rate of form completion that is correlated with a downstream slower moving signal.

Finally, at what level do you split people when testing in digital products? It’s tough as it is a volume vs. quality tradeoff. You should try and optimize for the customer experience. In some cases, to reach significance earlier you might activate the experiment on every new site visit, or even every new search. For other more drastic or visual changes, it might be keyed and unique to the user. While your goal is to maximize statistical power, you need to be a good experimentalist and treat others the way you’d want to be treated. Ultimately, these are your customers — so do right by them.

The How.

How are you going to measure the impact of your experiment? This is the most overlooked part of digital product experiments. In the case of our example, we’d need to build some technology to identify whether or not a user session is eligible for the experiment, which treatment the user will get, if the user saw it (i.e above the fold) and what ultimate action the user took (clicked more info, added to cart, bought, etc.). We’re fortunate in that we’ve built a lot of telemetry into our product for this very reason and that allows us to select and route audiences, measure outcomes, and do some summary reporting on experimental performance. It’s important that when you get to this point that you not only build but also test your telemetry to make sure you trust it. Once you’re confident, you can proceed with turning the experiment live. There are many great experimental platforms out there, both commercial and open source, that take care of many of the above tasks for you.

The When.

How do you know if you have enough information to conclude the test? This is probably the question we as data scientists get asked most often. Sadly, the answer is usually “it depends.” At scale, significance tests are great, but unfortunately we often just don’t have that kind of volume or time in the startup world. In those cases, we like to reframe the question as “when is your result directional?” so we can balance the rigor with the practical. Rephrase the question to the true question statistical significance seeks to answer — “how do we know what we’re measuring is not random or luck but actual signal?”

Cassie Kozyrkov does a wonderful job outlining a mindset and rules around experimentation that our team has really taken to heart and adopted. It revolves around doing your homework upfront, and resisting the urge to move the goalposts during the experiment (analyzing and fitting the data to your desired outcome and then feeling good about it). Doing your work up front as described above when setting the hypothesis of change forces you to make a guess and then wait on the data from the experiment to either prove or disprove your estimate. Either way, you’ll learn in a controlled and deliberate manner. In Cassie’s article, she cites the difference between data-driven and data-inspired decision making. Data-inspired decision making can cause you to miss the ultimate outcome of the process: the pure signal you are after. We can’t do her post justice, so we suggest just reading it here.

A number of factors influence significance — effect size, sample size, confidence and sample variability. A lot of blog posts have done a much better job than I ever will on covering this. Ultimately, it’s important to understand what levers you have and what tradeoffs you’re willing to make — trading confidence interval for time, trading effect size (i.e. big impact for reduced audience size), trading a larger, less specific audience size for time. There is no right answer when it comes to the grey area of taking results prior to significance — but for us, we’ll usually take trends and changes relative to baseline as signal when bootstrapping to the next experiment. We can reserve rigor for cases of operational optimization when we have the time and eyes.

In summary, we built a small playbook for running digital experiments:

Identify the problem area
Learn as much as possible before you experiment
Identify an impactful experiment and hypothesize the impact
Determine the feasibility of the measurement
Understand what biases your audience might have
Wait and be patient for results
Try and remove your own personal bias
Focus on running a great experiment and don’t tie success to your result.

Running experiments is difficult but I hope the above gives you some insight into a mental framework on how you might approach your own experimental design. We would love to hear from you how your approach differs and what you’ve learned along the way.

Oh… and we’re hiring — check out our careers page!

Playbook for practical product experimentation was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Upside Labs: One Year In

Chris Poirier — Wed, 13 Nov 2019 14:26:51 GMT

It may have been a while since our last blog post, but we’ve definitely been busy over here at Upside Business Travel!

A year ago, we launched Upside Labs to the world in the form of labs.upside.com with our first public facing experiment, a flight delay predictor. We got some good press, and even made the front page of Hacker News and Product Hunt. So far, our delay predictor has been used by thousands of people and continues to see active usage to this day. But, little did we know what was to come.

The launch was just the start and turned out to be a very valuable learning experience in how not to do experimentation. It proved to be a foundational example in the development and evolution of our experimental discipline at Upside.

Once we got down to business, we launched nearly 50 customer-facing experiments in six months with a small team of six people. That’s roughly a new experiment every three working days. Getting to 50 was no easy task and required us to be deliberate given the constrained resources we had. What emerged from the lofty goals set for us was the need to define a pipeline and framework that would enable us to scale and productionize experimentation. While we are normally averse to excess process at Upside, we had the stark realization that we weren’t going to fake or hustle our way to success alone. This was much bigger, needed a process stripped of excess, and forced us to execute to get signal.

Process alone wouldn’t get us there. Discipline to stick to the process along with the buy-in from the team and stakeholders was critical to our success. Before I dive deep into the lessons learned over the past six months, I’d like to describe the context around the environment and mission we were given.

The mission was simple: find ways to get novel travel concepts and features in front of business travelers and improve their lives. We were given generous support by our product and engineering organizations to experiment aggressively, even if it meant injecting our experiments into the product flow. For our first 10 experiments, we stayed largely independent of the e-commerce path, but ultimately decided the eye traffic and users were worth the investment in aligning and supporting our production system. This turned out to be a valuable learning experience — bolting experimentation onto a constantly growing and moving platform turns out to be hard.

Road bumps are inevitable, but we worked to set ourselves up for success from the beginning by drawing on past experiences. One of the biggest challenges of running a labs team inside an existing product engineering team is the interface. How do we work together? More, how we keep the experimental machine running whilst we continue to build our core product. We were proactive and worked out a contract with the team prior to experimentation where they kindly offered up 20% of our web traffic in flow for experiment volume. We were often also able to negotiate for full-traffic experiments with individual product owners on a case-by-case basis. In return, our part of the deal was to be deliberate in how we communicated experimental performance and I took it one step further by hosting a bi-weekly product team meeting where we reviewed the experimental roadmap and discussed interesting results, both positive and negative. We would transition concepts from successful experiments into the scrum team backlogs to flesh out and develop.

Earlier, I mentioned the need for process. The impetus for that came when our fearless leader, Scott Case, threw down a super aggressive goal of 25 experiments in the 4th quarter. Often times with big goals, I like to gamify the problem and we went so far as putting up a scoreboard in the Labs team area. The moment a customer was exposed to one of our experiments, we would cross off that experiment’s name and number on our team whiteboard. This ended up being a rallying call for the team. Below are some of our key lessons learned in the process. For more insight into how you can take these lessons and apply them at your company, I recommend reading Building a Culture of Experimentation.

Data Scientist Ian Cassidy crosses out his launch of the inflow hotel recommendation experiment. Being accountable and transparent were key contributors to our success in building a culture of experimentation. Note, our team has a strong opinion of Sir Sandford Fleming and his time zones.

Experimentation is a mindset — now is good

Experimentation is a valuable tool that can be deployed to determine what to do next — use it liberally and use it quickly. A few things we’ve learned along the way:

State the hypothesis and focus on answering that question.
The only correct solution is the one that gets to the answer as efficiently as possible.
Remember that your goal is to focus on the signal — the polish you can skip if it doesn’t help you answer the hypothesis. This means that you’ll need to balance speed and lack of polish with the elimination of “distractions” that may contaminate the experiment. Our goal here is purity of signal in a realistic environment.

The Smaller the Better

Remember, the more specific you can be in defining your experiment, the more work you can eliminate from scope. Experiments serve two valuable functions. First, they help validate and discover how to implement an idea (aka innovation) and can be used to test any idea. Second, experiments should not deliver a product or feature, but should be used as a tool to help you get unblocked on how best to proceed with a solution. You’re not looking for a final deliverable at this point.

Experiment to learn. Small experiments allow you to also be efficient and focused. Remember, experiments don’t necessarily mean writing code or building a product. It could simply be a survey or calling people. Force yourself to think about how to break a big problem into a number of smaller experiments or hypotheses. The goal here is to find the most efficient way to validate a hypothesis. Its best to iterate on successive experiments until you have the necessary answers rather than testing the complex system in one shot.

Set Lofty Goals

This doesn’t necessarily mean set high expectations, but you’re going to want to catalyze the process by giving the team focus and goals to chase. It’ll also help force the definition of the experiment and limit scope by setting boundaries the team needs to adhere to in order for the goal to be achieved. In our experience, it really was a forcing function for us to decide what was needed to test the hypothesis and what was polish. Polish was the term we used to describe great ideas that might improve the experience but were not deemed necessary for the experiment to test its hypothesis.

Lofty goals may have adverse and undesired impacts on the team, so it’s important to coach your team on not settling into a solution before experimenting and to leave options on the table to solve problems in different ways. In other words, give them the room to work the learning process.

Another way to think about lofty goals is much like you would a financial portfolio. Portfolios have short-term and long-term investment and return horizons — with the hope that they will generate a consistent return. Your experiments should help you flesh out some short-term wins and longer term lofty visions. These early wins really helped catalyze our momentum and helped us overcome the hurdle of some out our more ambitious experiments.

Use the Hypothesis to Focus

The hypothesis is your line in the sand. Your goal is to do sufficient work to gather enough information to prove or disprove the hypothesis. That’s it. It’s easy to get distracted with polish and increase the scope — but this works against your goal. Focus on testing the hypothesis or stop the experiment.

Patience is going to be important to you. You’ll need to understand the levers that go into deciding when you have enough experimental data. How many people show up? How are you measuring action? What are you measuring? Jumping to the next iteration can be costly if the hypothesis is eventually invalidated. Having a portfolio of experiments and running multiple experiments in parallel lets you keep learning while you wait.

Lack Data? Get Creative and Get it!

In our case, our customer volume changes over time, and during our launch of Labs, Upside had just transitioned from a B2C player to a B2B seller of business travel to SME companies. To generate statistically significant results, we need a lot of data or an experiment with outsized impact. A few ways around this can be to test outside of the product — both user surveys and back testing have been useful ways for us to learn or test an idea quickly. During customer testing, be as broad in defining the audience as you can be. In low volume, you’re going to have to wait and that’s OK.

We learned a lot about what it takes to bring a culture of experimentation to a startup. We tripped and fell a few times. We were forced to exercise our humility and, above all else, we found that it took the investment and buy-in of the whole company for us to be collectively successful. We learned together, we advanced together, and we believe we became more efficient through the process.

I’m fortunate that I got to be the author of this article, but the true success of our initiative is a result of the team that did the grinding and built the process and learnings above. Michael Bellerose, Frank Abissi, Ian Cassidy, Gilbert Watson, Rami Chowdhury, Dan Riggs, and Boyang Huang were the core team that allowed us to scale. Special thanks to our leadership Emily Dresner, Adam Holmes, and Scott Case for supporting us and encouraging us to keep going.

Above are some of the lessons we learned around experimenting at Upside over the past 12 months. You’ll find a link below to a more comprehensive article on our approach for bringing a culture of experimentation to your company. Here, read about what we learned when bolting an experimentation engine onto a fast moving product — a challenge a lot of you startup founders are thinking about.

These days, the team is heads down building new recommendation and dynamic pricing capabilities for the travel industry. Experimentation is at the core of these products and we’ll update you on our progress when we can. Oh, and that flight delay predictor? It’s still chugging along and getting better by the day. Check out Ian’s blog post on how he built it. Turns out, we didn’t follow our process on that one and built something with a lot of polish before we even understood what people wanted. We probably wasted a few cycles of efficiency, but we’ll count those as tuition towards helping us learn and move quicker in the future. Onward!

Upside Labs: One Year In was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Culture of Experimentation

Chris Poirier — Wed, 13 Nov 2019 14:25:49 GMT

Most startups are not going to find product-market fit by luck. Heck, most startups are not going to find product-market fit at all. What can you do to increase your chances that you find product-market fit? Well, I’d offer the advice to growing startups that in order to differentiate, adopting an experimental mindset early and running an incremental test, learn and evolve model will help you hone in on the signals that ultimately drive innovation and improve your likelihood of product-market fit.

The challenge with the incremental approach is that it is neither sexy nor fast. It is however, reliable and I’d argue, is a great use of resources at a growing company. Being deliberate and chasing signal in an incremental way not only helps you identify customer usage and gaps, but also allows you to evolve the product into a shape that satisfies the customer on the long term.

When bringing a culture of experimentation to your startup, it is important to realize that it’s an investment and you’ll need to trade it off with all of the other competing interests. Recognizing the investment you’re making in the process, it’s important to take it seriously for the best outcomes. Otherwise, you could be left chasing noise and never accomplish anything. When I was a practicing scientist, we had to be deliberate about the experiments we were running and thinking about them as bets we’re placing on knowledge. Funding, timing and resources were always influential, but ultimately we could trade those off in pursuit of pure science. At a startup, you don’t have an infinite well of government grant money and if you’re in the market, chances are other people are racing to get their ideas out and win share.

There are several things you can do when bringing an experimental mindset to you company.

Smaller is better

An experimentation mindset makes you form all questions into a hypothesis that has a measurable outcome. This is much more difficult than it may seem but is the first step into embracing an experimental mindset. When forming your hypothesis, remember the more specific you can be in your definition, the more work you can eliminate from scope. The less work, the quicker you can get it done. Experiments serve two very valuable functions in product development and problem solving. First, they help you validate and discover how to implement an idea. Second, the experiment is a means to an end, but not a product. A well designed experiment should not deliver a product or feature, so don’t worry about the final deliverable, but instead focus on getting an answer to your hypothesis. The outcome to a well run experiment is unlocking the next experiment in the incremental learning step.

Remember why you’re running an experiment — to learn in a controlled manner. Small experiments allow you to be efficient and focused, isolating the signal you’re looking for and measuring it in the most direct way possible. An important lesson that we learned in the process is that an experiment doesn’t need to be writing code or building a product. Send a survey, call people, send some emails. Force yourself to think about how to break a big problem into a number of smaller experiments or hypothesis, each building on the outcome of the last. Your goal here is efficiency, and you’re attempting to get there in the most direct way possible. The outcome is to answer all your unknowns in a deliberate, direct manner rather than trying to deconvolve the results of a complex system launched as a one shot.

Use the hypothesis to focus

Think about your hypothesis as your scope. As a product owner, you’re used to defining scope and defending it. Same play here. Your acceptance criteria is the minimum viable experiment — do only sufficient work to gather information to prove or disprove the hypothesis. That’s it. It’s easy to get distracted with polish and expand the scope — but that’s not why you’re here. You’re here to learn in the most direct and deliberate fashion. It’s at this stage that most experimental programs fall apart — people don’t exercise the restraint and discipline necessary to execute. Focus on testing the hypothesis or stop the experiment. It’s hard, but it’s the only way and it’s what you signed up for. Focus will ultimately be your catalyst to success.

Just as important as focus is, you’ll need to balance that with patience. My dad always accused me of being impatient. “Good things come to those who wait” was a phrase uttered far too often to me as a child. Well, it turns out he was right about that and many other things. You’ll need to understand the levers that you have and when you have enough experimental data to proceed. The challenge is, you’re always fighting against time. So, I think about it, what’s sufficient information to make a decision. While it’s nice to have statistical significance, it’s often a luxury some startups don’t have. You can increase your odds by choosing impactful experiments. Changing the checkout button from gray to purple when you have 10 customers a day is going to take a long time. Perhaps instead, look up funnel and run and experiment on the 100 people visiting your site per day or the thousand people you email every week.

The calculus you’ll always need to be doing is to understand your data rate (the rate at which something you care about changes) and the estimated impact of the change you want to measure. It’s important to understand how many people are showing up. What action are you trying to measure and is it even measurable? Try and choose experiments that will have impact (large audience and a large projected change in behavior) — these will help you get to signal quicker. And for those impatient like me, I’d recommend having a portfolio of experiments and running multiple (non-overlapping) experiments in parallel to help you keep learning while you wait. Remember, jumping to the next iteration at this stage can be costly if the hypothesis is eventually invalidated. It’s worth investing the time now so as not to pay the cost of a dead end in the future.

Set Lofty Goals

This doesn’t mean you’re reaching for the stars, but be aggressive and set out a pipeline of questions to answer. This pipeline will force you to prioritize, and I hate to use the term, but be lean. What you’re really trying to do is catalyze the process by giving the team focus and goals to chase. It’ll force them early on to adopt the above lessons focusing on the hypothesis and setting natural boundaries the team needs to adhere to in order to accomplish the stated goal.

In our experience, a big lofty goal (in our case 25 experiments in a quarter) was really a good forcing function for us to determine what was truly needed to test the hypothesis, and what was polish. It’s a hard skill to learn, and saying no to extra polish is very challenging. It was often a point we had to compromise on when working with external teams to run experiments in certain parts of the product, but when possible, we kept to the bare minimum of work to test the hypothesis.

Use caution when stating lofty goals to the team as there may be adverse or undesired impacts on the team. It’s an important part of your job to state the problem and work with the team to define the hypothesis. Coach the team on not defining the solution prior to posing the problem and defining the hypothesis. This is a great way to incorporate bias. Engineers’ tendencies are often to start solving the problem prematurely, and you’ll need to coach this instinct into a problem solving space where the solutions are part of the experimental and ultimately learning process.

The way we’ve been best able to manage goals is to think about your work as an experimental portfolio. Portfolios have both short and long term investment and return horizons with the hope they generate an ever increasing rate of return. Your experiments should help you flesh out some near term “wins” and longer term lofty visions. These early “wins” helped us catalyze our momentum and helped us overcome the hurdle of some of our more ambitious experiments; so it’s not a bad thing to front load some of the easier work.

Experimentation is a mindset — now is good

I think about experimentation as a very powerful tool. It’s your sidekick that can be deployed to figure out what to do next — you should use it liberally, often and quickly. The biggest lessons we’ve learned along the way:

State the hypothesis and focus on answering that question and only that question.
The only correct solution is the one that gets to the answer as efficiently as possible. It’s also ok to not be perfect.
Remember that your goal is to focus on and maximize the signal. The polish you can skip if it doesn’t help you answer the hypothesis.
Balance speed and a lack of polish with the goal of eliminating “distractions” — those things (UI, UX, functionality) that may contaminate your experiment and detract from an otherwise fair experiment. Optimize for purity of signal.

Investing time in defining the hypothesis now will only reduce future work and uncertainty. If you’ve committed to embracing an experimental mindset, it’s the best investment you can make. Having the discipline to hold back and work on defining the problem and hypothesis before building the solution is going to be the hardest part. Trust the process.

Lack data? Get creative and get it!

We struggled early on with customer volume changes. When we launched labs, we had just completed a pivot from B2C to B2B, where we are a seller of business travel to SME companies. To generate meaningful results against our hypothesis, we need with large audiences or experiments with outsized impact. In our case, we were sitting on a year of B2C data, so we leaned heavily on back testing for some our experimental models. In the case of our flight delay predictor; we opened it up beyond our customer base and just shared it with the world. Often times, we resorted to running experiments via surveys — not only do you learn, but you then open up beta testing groups to those people who respond favorably. When testing with customers, think about placing an experiment as far up in the funnel as you reasonably can. The more eyes, the more impact and the bigger the impact, the quicker you get to an answer. It’s a good time to be creative, and necessity breeds invention — look around and see what’s there, my guess is a lot more than you expected.

Review the data, review it again and accept the results.

This is often the most overlooked part of the experimentation process — the measurement and follow up. It’s important to see if you’ve proven or disproven your hypothesis after you’ve run the experiment and collected the data. Think about the experiment you’ve run and the data you’ve collected. Have you introduced any bias? Has the way you have conducted the experiment cast doubt on your ability to accept the results? Often times, this is the stage where excuses can creep in that cause us to not accept the results. It’s hard to be proven wrong, but part of the process is having faith in it. It’s your job to make sure that you conduct a fair experiment, as free of bias as possible and done in such a way that you can trust the results. There is a reason why peer review is essential to science — it’s that it helps you eliminate the doubt that you’ve done things correctly. In our case, we were fortunate to have several scientists on the team who did a nice job of keeping the experiments honest and the process on track. When in doubt, it doesn’t hurt to slow down and ask someone to double check your work and make sure you’re not missing anything major.

Now that you have your answer, it’s time to decide what to do next. If you’ve set it up correctly, you’ll likely be running another experiment to keep pushing. If you failed to prove the hypothesis, well, you learned something. Document it, share it, learn from it and take that new information and use it to re-evaluate your next move. Building things is hard. Experimenting is hard. Building things people want to use and buy is really hard. Trust the data and keep going.

Bringing experimentation to a startup is by no means an easy or quick task. It takes discipline and investment to do it right, but the results can help you leapfrog your learning curve and your competitors. Having a supportive team that buys into the mission, groks the problem and ultimately embraces the challenge and mindset are key to success. Your job is to set them up for success, give them big problems to go chase, help them define and scope the hypothesis and let them run and build solutions. Moreover, be there to encourage them when their data rates are low to be patient and wait for the results. Be there to help review the results and to nudge them when you feel bias may have crept in. Ultimately it’s your process, you’re the peer reviewer and it’s up to you to build a process that works for you and moves the needle. I hope our findings shed some light onto the challenges and that our advice helps you build and foster a culture of innovation.

I’d like to thank Michael Bellerose and the Labs team for their help in forging this process and the opportunity for us to all to learn together by solving some cool problems.

Building a Culture of Experimentation was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Two Technologies, A Pun And A Desire to Give Back: graphene_pydantic

Rami — Mon, 23 Sep 2019 15:22:07 GMT

There are a lot of things we love here at Upside — making our customers’ travel awesome, vanquishing bugs, using cutting-edge technological tools — the list goes on. But one of the things that we love the most on our Labs team is wordplay — and puns in particular. (No wonder that many of us like to work with the Python programming language, named after a particularly punny comedy troupe. Correlation? Causation? You tell us!).

So when we came across a software library that would help us neatly define and precisely control our data, called Pydantic, we were sold — we had to find a way to use it.

We’ve ended up using Pydantic to manage our data models in a prototype we’re currently working on, where we use Python’s excellent and varied data analysis and machine learning tools to suggest alternate ways for customers to book the flights they want and save money. It’s been a joy to work with. For instance, the code below lets us model a simplified flight:

https://medium.com/media/aaac0aec6541fbceddd68d4bf9d276e5/href

Here, we’re using type annotations (available in recent versions of Python) to clarify that we expect the origin and destination to be short strings like “IAD” or “DCA”, but departure should be a date. The simplicity of the code means that developers, data scientists, and product managers alike are able to understand the model and get a sense of what we can do with it. Plus, because we’re using Pydantic, our program will tell us with a descriptive error when a database query or microservice request returns data in an unexpected form:

https://medium.com/media/d12eaf754bc0ee594ad371f0d9bff8f1/href

With these assurances, we can safely put a bunch of Flights into a DataFrame and do some statistical analysis without fretting about data errors.

In a similar vein, we’ve become big fans of GraphQL for communicating between our servers and our web application “front-end.” While many others have explained why they like GraphQL (the blog post from the GitHub team is particularly good, in my opinion) the biggest benefit to us is communication.

With a defined schema and powerful exploration tools like GraphiQL, we can work on front-end and back-end tasks in parallel, and our team doesn’t need to perform a Vulcan mind-meld to understand what data fields and functionality are available. For instance, here’s the GraphQL schema for our Flight model:

https://medium.com/media/37ecfcaa8a1a8eeb3934b826b087843e/href

This led us to a problem, however. While it’s fairly straightforward to define a schema for Flight, it is repetitive and easy to forget. And once we had forgotten to update a field or two, the benefits of self-documenting GraphQL started to fade away, and we were left wondering whether it was worth it.

Enter graphene_pydantic. It started out as an internal side-project to avoid having to do extra work, but it’s rapidly become how we create our whole GraphQL schema — automatically creating the GraphQL definitions from the Pydantic models we were already using! Using the Graphene toolkit, this is how simple it is to turn our Flight model into a fully-fledged GraphQL type:

https://medium.com/media/20df962931f60c1ab22b04606bccba3f/href

At Upside, we recognize that our business wouldn’t be possible without the enormous power of open-source software, and in the spirit of giving back what we can we’ve released our little library as open-source so that others can use it too — download it from PyPI here! We hope it’s useful!

Do you have a passion for puns, Pydantic and Python? We’re hiring! Check out our careers page to view our open positions

Two Technologies, A Pun And A Desire to Give Back: graphene_pydantic was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From LastPass to AWS Parameter Store

Emmanuel Apau — Wed, 17 Apr 2019 12:56:03 GMT

As with every startup, you eventually reach a point of maturity where enhancing your security practices becomes a necessity. For us, LastPass was a great solution to store our application secrets to be synced with our kubernetes cluster, but as time progressed, missing features such as change management, version control, and auditing became increasing pain points as the engineering team grew.

Our solution: AWS Parameter Store

The Why

We were able to achieve our main goals of versioning and change management by using AWSPS. Some pros were:

We currently host our application in AWS’ cloud, which allows us to integrate with the SDKs seamlessly.
Being a managed service meant we would not have to handle any maintenance or patching.
Lastly, secret storage is currently free.

The Migration

Create a script to export all application secrets via lpass cli and ingest them into AWS SSM Parameter Store.

Now lpass doesn’t make it very easy to export all of your secret notes and the data they contain, but with some UNIX command footwork it is possible:

https://medium.com/media/8535725f8b5f31d2d148dfe1277136ae/href

Once we have exported all of our secrets in all of our environments, we use the boto3 SDK to put them into AWSPS following the filesystem convention of //

2. Create a trigger for an AWSPS update/create/delete secret event to call our internal Jenkins Kubernetes secret synchronization job (covered in a previous post), and notify us in Slack for all environments.

SSM Parmeter Store -> Cloudwatch -> Lambda -> Jenkins/Slack -> #devops-alerts

The Result

We were able to migrate our secrets, maintain our k8s synchronization job (albeit with a few tweaks for the new secret source) and achieve our desired state of secret management and transparency.

Though we did encounter problems along the way. AWSPS has a character limit of 4096 characters which limited us in a few secret use cases. There was also the growing problem of day to day secret creation through the AWS Console left much to be desired, and we didn’t want to keep a mental note of our secret conventions.

Therefore Enforcer was born. By taking advantage of the boto3 SDK, it allowed us to upload secrets sidestepping the 4096 characters AWSPS limitation by chunking, and enforcing naming/tags conventions. This provided an added benefit of giving engineering teams the means to manage their own application secrets without SRE intervention through the CLI.

Even though we’ve been using it successfully at Upside for a few months now, Enforcer still has a lot of feature requests!

Querying Secrets
Deleting secrets
Syncing secrets with Kubernetes

If you’re interested in learning more of the specifics of how Enforcer works, check out its source Github. If you have any question or comments, feel free to reach out via the comments or a GitHub issue!

Want to spend your time innovating on hard problems? Check out the engineering team here at Upside.

From LastPass to AWS Parameter Store was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Culture: It’s more than a buzzword

Danielle Mitchell — Mon, 07 Jan 2019 21:44:13 GMT

Company culture has become a leading factor in attracting and maintaining top talent. Perks and compensation certainly help to attract talent, but what keeps them ultimately comes down to personal development and the people. Do you like your coworkers? Do you like what you’re doing? Do you believe in the company and the leadership? Overall, are you happy? Happy and engaged employees are the blood of every business. They pilot innovation and move the company forward.

The culture at Upside is one where everyone is encouraged to be themselves, speak up and get things done. It’s a fast-paced environment, challenging but also rewarding. You have the support of your team, and frankly the entire company behind you and your ideas, and the people are talented and wonderful! Upside works hard to give you the tools for success, and I enjoy working here because I’ve been allowed to develop this role into something that challenges me while utilizing my skill sets.

Since joining Upside, we’ve talked about working on a culture video for the company. The goal was to show the work environment and hear how Upsiders feel about working at here. Nine Upsiders were asked what motivates them to come into work every day, what they liked about working at Upside and describing the culture here. The results were organic, original and showed how much our team enjoys each other and the culture. Throughout the year, we’ve collected pictures and video clips from events, team days, hackathons, all hands meetings, and everyday interactions to show that working at Upside is fun, energizing and driven in developing their people. See what Upsiders think of working here and judge for yourself, we’re hiring!

https://medium.com/media/ccea8738814916bb1d7a4e91f7a58ab8/href

If you’d like to learn more about Upside Business Travel click here! Even better, if you’d like to join our team, visit our team page here.

Culture: It’s more than a buzzword was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Content-Based Recommendations to Personalization: A Tutorial

Ian Cassidy — Wed, 19 Dec 2018 16:22:14 GMT

Introduction

A primary goal of the data science team at Upside is to present customers with flights and hotels that they are most likely to purchase. Additionally, we want our customers to feel like we know them, so personalizing the presented inventory is a great way to demonstrate this! One technique to personalize inventory is through recommender systems, which can take the properties of the inventory along with customer preferences into account in order to present items that are most relevant.

There are many ways to capture customer preferences (like the ubiquitous “thumbs up/thumbs down” button), but we found that content-based recommender systems, which don’t require direct customer feedback, provide a “quick and dirty” approach to testing personalization in our product. In this blog post, we’ll walk through the steps we took to implement content-based recommendations and personalization. Even though this tutorial is specifically geared towards the travel industry, the techniques presented here can be applied more generally across almost any e-commerce industry.

The Data

For the remainder of this post, we’ll examine real hotel inventory in New York City for a stay between January 7–10, 2019. Both the data and an accompanying Jupyter notebook can be found here.

Let’s start by examining the hotel inventory data by loading it in as a Pandas DataFrame and performing a .describe() on it.

Table 1. Summary of the hotel inventory data

A few things to note about the data:

the hotel names have been removed and replaced by a hotel_id field, which is just an integer assigned to each row of data
the avg_rate field is the minimum average room rate per night returned by the hotel for the duration of the stay; i.e., if the hotel has multiple room rates available, this is the minimum rate
the distance field is in miles and is the distance to the location of the search performed to obtain this data, which is in Midtown Manhattan
the star_rating and user_rating fields are based on a 0–5 scale with .5 star increments and are a measure of the amenities and quality of the hotel

Next, let’s perform some basic exploratory data analysis to visually inspect the data. Figure 1 illustrates histograms of the relevant fields — they seem normally distributed or slightly log-normal, which is good. Figure 2 shows that the avg_rate increases as star_rating and user_rating also increase — you get what you pay for!

Figure 1. Histograms of the hotel inventory data

Figure 2. Average rate vs star rating (left) and user rating (right)

Content-Based Similarity

The next step is to determine how you want to surface content-based recommendations. A simple approach in this example is by embedding the content of the hotel inventory into an N-dimensional space and finding the items that are the most similar to each other. In this case let’s construct our feature space as [lat, lng, avg_rate, star_rating, user_rating] — so the feature space of hotel_id = 10 is [40.714995, -74.015777, 326.4, 5, 4.5].

Now, let’s say a customer clicks on a particular hotel and wants to learn more about the room types/rates, amenities, and location. While they are exploring this hotel, we can present the customer with similar hotels so they can take them into consideration while shopping. At Upside, we use a carousel of 4 options in our hotel rates page to expose the recommendations. This allows customers to view similar hotel options that they may not have the chance to explore otherwise.

Once the customer selects a particular hotel, which we define as the “anchor,” we can provide recommendations of similar hotels by following these three steps:

normalize the feature space by converting each feature (i.e., column) into a standard score
compute the Euclidean distances between the anchor hotel and the other pieces of inventory
sort the hotel inventory in ascending order by the Euclidean distance

Below is some code to sort the hotels by the similarity_distance given an anchor hotel_id.

https://medium.com/media/b5591b06d6a0c51098b286a1722af85e/href

A few things to note here:

Normalizing the feature space is very important since we are dealing with features that have different units. The choice of normalization scheme depends on the problem and the data — in this case we are using a standard score approach because the data is normally distributed — but, min/max scaling or TF-IDF (for comparing documents) may also be useful for other applications.
Choosing the right distance or similarity score can have a big impact on the quality of recommendations. We are using Euclidean distance because we are embedding geo-coordinates in our feature space. Using Cosine similarity instead would be a huge problem for hotels with the same heading that are far apart because they have the same angle in the [lat, lng] space (shoutout to Gilbert Watson for pointing this out). Other types of similarity scores can be investigated using the convenient Scipy function cdist.
It is important to backtest your recommendation algorithm to pick the best normalization scheme and similarity score and tune any other parameters. For example, looking at recent customer searches and purchases and using recall/precision rate at k can help tune the hyperparameters to find the optimal algorithm configuration.

Let’s see what happens when we make the anchor hotel_id = 10:

Table 2. The 5 most similar hotels to hotel_id = 10

These results look promising! The algorithm found hotels of a similar price and rating to the anchor that are also nearby as measured by the distance_from_anchor field (in miles).

Next, let’s look at a second example where we set the anchor to be hotel_id = 21, which is much cheaper and has lower ratings than hotel_id = 10:

Table 3. The 5 most similar hotels to hotel_id = 21

Again, the algorithm is able to find similar and nearby hotels that are much different than the results in Table 2.

Regardless of the anchor, it is important to measure the goodness of the recommendations by tracking the number of customer views, clicks, and conversions and then iterating on the algorithm.

Personalized Recommendations

Building upon the content-based similarity algorithm, we can extend this methodology to create personalized recommendations for each customer. Instead of using the hotel that the customer clicked-on as the anchor and finding similar hotels, we can use purchase history to compute normalized features for each customer. The personalized anchor becomes the “virtual hotel” that we use to sort inventory in new markets based on a similarity score. This is akin to a look-a-like model where we try to recommend new hotels that are most similar to what customers have previously purchased.

Here is some example code of how we can modify the get_hotel_recommendations function from the previous section to provide personalized recommendations given a dictionary of pre-computed user features.

https://medium.com/media/3f5e7bfca665927f055e5c2cb9cba3ec/href

One lever we are using here to bias the model is to artificially set the distance feature equal to the minimum of the normalized distances for the set of available hotels. By doing this, we ensure that the algorithm penalizes hotels that are further away from the user specified search location. This technique can be extended to other concepts like recommending hotels that are more profitable (assuming we know or can calculate the amount of profit we can earn from each hotel). In that case, we can create a feature called profit and artificially set the value for the anchor equal to a high value.

Let’s look at the personalized hotel recommendations for 2 different customers. The first customer has features that are all equal to 0 (exactly at the mean) while the second customer has features that are all equal to 1 (above the mean). As we can see in the two tables below, the sorts are quite different and we are showing the second customer more expensive and higher quality hotels. This is expected because the second customer demonstrated historical affinity for purchasing hotels with prices and ratings above the population mean.

Table 4. Personalized hotel sort for a customer with user_features = {“avg_rate”: 0, “star_rating”: 0, “user_rating”: 0}

Table 5. Personalized hotel sort for a customer with user_features = {“avg_rate”: 1, “star_rating”: 1, “user_rating”: 1}

Summary

In this blog post we have shown how content-based recommendations can be extended to create lightweight personalized recommendations using historical customer search and purchase history. This approach was accomplished at Upside without a large data modeling or engineering effort — we were able to implement both techniques in our product in a matter of weeks! In addition, using the anchor concept and artificially setting or changing various features, we can weight or penalize different attributes of the items to fit any number of objective functions (like distance from search pin and expected profit).

Hopefully this tutorial provides some inspiration to get us all beyond the thumbs up/thumbs down button (but we’ll happily take your claps).

At Upside, we’re always looking to improve business travel. If you’d like to learn more about Upside Corporate, click here. Even better, if you’d like to join our team and work with Ian, visit our careers page here.

From Content-Based Recommendations to Personalization: A Tutorial was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Applying Predictive Analytics to Flight Delays

Ian Cassidy — Mon, 01 Oct 2018 14:07:48 GMT

Motivation

Flying for business is full of uncertainty. For travelers with a tight connection window or an arrival time close to an important meeting, even a short flight delay can cause serious anxiety. Nearly a third of Upside’s business travelers encountered flight delays in the last 30 days alone. Of the delayed flights, 11% were delayed more than one hour and 4% were delayed more than two! Delays aren’t just a nuisance for the traveler either — in terms of lost productivity, companies could see losses into the hundreds of thousands, if not millions of dollars per year.

At Upside, we want to help our customers mitigate these scenarios by predicting flight delays prior to their trips and, when possible, providing alternate flight options to get them to their destinations on time.

According to the Bureau of Transportation Statistics, domestic airlines reported an on time arrival performance of about 80% for 2015–2016. Of the delayed flights, anywhere between 25–35% of those flights (depending on the month of the year) were delayed due to bad weather. Another 5–10% of delayed flights were due to late arriving aircraft. If we can model these delays accurately, we’ll account for almost half of all delayed flights per year!

This is how the Delay Predictor came to be.

Gathering the Data

In order to produce monthly and yearly airline metrics, the Bureau of Transportation Statistics also publishes the underlying data that tracks the performance of every domestic flight operated by large air carriers. We downloaded this dataset for historical flights going back to 2012 and continuously append new data as it becomes available for more recent flights. A similar dataset was published by Kaggle for all flights in 2015. As of last count, we have over 40 million rows of on-time performance data stored in a Snowflake table that is accessible to our entire data science team.

This dataset is amazingly clean in terms of having very few missing or extreme values. In addition to having expected fields such as flight number, flight duration, and scheduled departure/arrival times, it also has the delays broken out by type — like weather and late aircraft.

An obvious place to start the modeling effort was by predicting weather-induced flight delays. In order to do this, we chose to use Dark Sky’s API because it provides both historical and forecasted weather conditions using the same REST endpoint. At this point, I’d like to pause and bow down before whoever built the /forecast endpoint at Dark Sky. It’s super simple to use, relatively inexpensive, and almost always returns a valid response. Also, our team at Upside is obsessed with the Dark Sky smartphone app and I encourage everyone to download and use it!

Building the Model

Originally, the model was constructed as a binary classifier — either we predict that a flight would be on time or that the flight would be delayed. This performed so well both in training as well as in testing on real-time data that we decided to turn it into a multi-class model in order to predict the magnitude of the delay.

Picking the delay classes was a bit tricky, but, with the help of the histogram below, we decided to go with 0–30 minutes, 30–60 minutes, 60–120 minutes, and 120+ minutes where the 0–30 minutes class is essentially “on time.” It’s interesting that there are clear dips in the histogram at 30, 45, 60, 80, and 110 minutes, which could suggest that the airlines are doing something to avoid being late by those exact durations. One thing we do know is most airlines issue travel waivers (whereby you can change your flight without paying a change fee) if your flight is delayed more than one hour. Thus including 60 minutes as one of the class boundaries made sense. Also, it’s worth noting that there aren’t any reported arrival delays under 15 minutes, which justifies the first delay class as being considered “on time.”

Histogram of arrival delays

I’m not going to go into the details of the features or models we are using because, well, that’s proprietary (but hey, we’re hiring). However, I will point out a few non-trivial techniques that we are using to improve model performance, which I think can be generally applied to machine learning solutions:

Remove the effects of seasonality. Weather-induced airline delays are heavily influenced by seasonality; e.g., hurricanes in the fall (h/t Florence) and snowstorms in the winter result in higher occurrences of delays. Machine learning models tend to perform better if you remove these effects and only train on homogeneous data. For the problem at hand, that means using a small window of flight dates to train the model, and then retraining and updating the model in production as time goes on. More about how we’re doing this in the next section.
Collect multiple samples of time-dependent signals. Since the weather is dynamic, we want to include these effects as features to our model. For example, for each flight we query the Dark Sky endpoint at multiple times (say right at take-off and 3 hours before take-off) and derive features based on the temporal derivatives of the weather signals. Luckily, this was easy to do because the Dark Sky API is so awesome (did I mention that already?).
Balance your classes. Machine learning models are highly susceptible to bias and one of the biggest causes of bias in a model has to do with class balance (or imbalance). How to balance the classes in your training data should be done on a case by case basis to reduce bias and overfitting. A great tool that I’ve been using lately for class balance is called imbalanced-learn.

To give some idea of how our model is performing, the figure below shows the results of the test data performance from a model that was trained using flights from late August. In the confusion matrix, the 0–3 labels correspond to our delay classes in ascending delay duration order; i.e., 0 = 0–30 minutes. A weighted f1-score of 0.62 is quite good for a 4-class problem since random guessing would result in a score of about 0.25.

Test performance of the multi-class flight delay model using late August data

The above metrics provide an idea of how the good the model is at predicting the magnitude of the delay. If we “collapse” the delay classes (1–3) into a single delay class and present the above results as if they were from a binary classifier, we can examine the ability of the model to predict any delay. A ROC-AUC score of 0.83 for a binary classification problem of this complexity is pretty good! However, looking at the confusion matrix, there are many more false negatives than false positives. This may not be ideal when it comes to predicting flight delays as we’d like to be overly aggressive in notifying a customer of a possible delay (skew towards having more false positives). In the future, it may make sense to optimize the model for a weighted recall that puts a higher penalty on getting the on time flights incorrect.

“Collapsed” test performance of the multi-class flight delay model using late August data

In testing the model on real-time data where we don’t know the exact cause of the delay, we have seen precision and recall scores around 0.4–0.5. In addition, we have been able to predict delays as far as 24 hours prior to the scheduled departure time! This is because we are relying on Dark Sky’s ability to forecast the weather, which is often very accurate.

Retraining Pipeline

As previously mentioned, we are handling the seasonality effects of weather-induced flight delays by only training the model on small windows of flight dates. As such, the model that is used in our production API for predicting flight delays must be retrained constantly. With the help of our amazing SRE team, we built a worker that is scheduled using a cron job to automatically retrain the model and store the best result as a pickle file in an AWS S3 bucket. The retraining workflow and model API looks roughly like the block diagram pictured below.

System architecture describing the model training worker and flight delay model API

One of the most important steps in this workflow is hyperparameter tuning of the different model architectures. We tune several different types of models using RandomizedSearchCV or hyperopt (depending on the type of model) and then pick the one that gives the best performance. It’s important to note that with any automated machine learning pipeline, logging the inputs and outputs of each step of the process is crucial to monitoring the overall health of the system. We log our results in Splunk and also post the outcome of the training process to a Slack channel.

Ongoing Work

We are actively working to improve the flight delay model to include other types of delays. Late arriving aircraft delays and National Airspace System (NAS) delays are two types of delays that we believe can have a big impact on the performance of our model. Combining the on time performance data with the FlightAware API is the approach we are pursuing to build separate models to predict these types of delays. Once built, we plan to ensemble them with the weather model with goal of not just predicting a delay magnitude, but also explaining the cause of the predicted delay.

Interested in trying out the Delay Predictor? Check it out here and let us know what you think!

If you’d like to learn more about Upside Corporate, click here! Even better, if you’d like to work with Ian & join our team, visit our team page here.

Applying Predictive Analytics to Flight Delays was originally published in Upside Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.