A Reinforcement Learning approach to dynamic pricing in an airline simulation competition

25 min readFeb 11, 2022


I have been working at Transavia for four years now, and there have always been projects that I would like to work on, or techniques that I would like to apply in specific domains to improve as a Data Scientist while delivering more complex and valuable products. Since I have completed the Coursera specialization on Reinforcement Learning a while back, I have been eager to test it out in the real world. But testing this type of machine learning in a business environment can be risky and challenging to get right. I had been looking for simple projects or simulation environments (other than games) to get started, when I read the ad for the Dynamic Pricing Challenge (dpc) in Data Science Weekly. A sandbox environment hitting close to home.

Table of contents

  1. What is Reinforcement Learning (RL)
  2. Key concepts in RL
  3. The dynamic pricing challenge
  4. Baseline strategies
  5. DQNagent (using tf_agents)
  6. Local simulation environment
  7. Experimentation setup
  8. Experiments
  9. Results
  10. Takeaways & resources

What is Reinforcement Learning?

Let’s start with Machine Learning (ML), the general description for ML is to use algorithms that utilize data to make predictions or decisions without being explicitly programmed to do so. There are a couple types of ML, the most common being:

Supervised learning: Defined by labeled data, in which historical data is used to train a ML model, with the goal to generalize to unseen data and predict the target variable for the future.
Examples: Classifying spam, predicting sales, image recognition.

Unsupervised learning: Discovering patterns in unlabeled datasets in order to create groups or clusters based on similarities or differences between instances.
Examples: Customer segmentation, recommender systems, anomaly detection.

Reinforcement learning: Taking sequential actions in an environment to optimize reward over time. There are no labels for right or wrong, but feedback is received after each action. RL learns from trial and error.
Examples: Playing games like pong, autonomous driving, investment portfolio management.

Key concepts in RL

Bandits are arguably one of the simplest implementations of RL, a one-step RL problem. So I will start there.

Every A/B-test that a company performs to optimize their website can potentially increase or decrease the conversion rate and it is common practice to run a test for a couple of weeks in order to get significant results. Perhaps we are experimenting with offering guaranteed cabin space for hold-luggage as an ancillary, and we’d like to try four different visuals to promote the product. It could be that some versions of the test decrease the conversion rate, while others improve the conversion rate (or keep it stable while increasing ancillary revenue). Multi-armed bandits can help to reduce the loss on less performing versions of the test, by gradually moving a larger chunk of visitors to the winning variants. Fewer versions will battle it out, and at the final stage of the test, 100% of traffic is directed to the winning variant. From trying out the different options and accumulating enough feedback, it will learn which options perform best and start employing that information.

Netflix does something similar but uses contextual bandits to recommend you movies and series. The number of products that they can recommend is a lot larger, and in this scenario, there is not one winner like in the A/B-test, but the winner(s) might be different for every user. The context consists of your watch history, perhaps combined with your demographic information and other data. With the continuous inflow of new titles, this is a way for Netflix to distribute the testing of the titles over their large customer base, gather data, and recommend titles to the ones with similar contexts as the people that watched the show. Typically your recommendations will be very different from your partner’s, and occasionally new items will appear in your feed that might be just outside of your regular binge list. The latter also helps to overcome that funnel of one-sidedness, instead helping you to find new genres. Some bandit/RL definitions:

Action: The different options that can be chosen at each point in time. Versions of the A/B-test or titles for Netflix.

Exploration: The (initial) data gathering that is required to make informed decisions. E.g. in the A/B-test for an X amount of time all options are tried out first, before we can start directing more traffic to the seemingly better alternatives.

Exploitation: The counterpart of exploration in which the acquired knowledge is utilized to pick the better action. Typically exploitation increases over time, but some continuous exploration is common, especially in changing environments like Netflix and their new shows.

Policy: The process of selecting actions. To assure a good balance between exploration and exploitation, a policy is a selection process on top of the known information about the environment.
Example: An Epsilon-greedy process with an epsilon value of 0.2, picks a random action 20% of the time, and what is believed to be the best action otherwise.

Agent: The one behind the wheel picking the actions, based on what is learned from the gathered data and the policy.

Environment: This is what the agent interacts with. The action is the input for the environment, a version of a webpage is shown or a movie is recommended. The environment then returns some feedback, a click or view in this case, which is feedback for the agent to learn from and update its understanding of the environment.

Now when it’s not a single action but a sequence of actions that need to be performed, RL’s strength truly comes to light. To stay in the e-commerce and recommendations space, the next-best-action (NBA) problem is a good example. It is like the contextual bandit but with an extra time dimension.

When you visit the website of an airline or any other online business for that matter, browse flights, and leave without buying, you will typically see a lot of retargeting banners. Maybe you are a loyal customer and subscribed to the newsletter. The e-mails that you will receive are usually a mix of generic newsletters, sales, and personalized recommendations. And if you really are a frequent flyer, you might also have the app with push notifications enabled.
At each point in time, a combination of the right channel, the right product, and the right message needs to be picked (assuming price is a combination of timing and product). We all know companies who push way too much and become so annoying you unsubscribe or become averse to the brand. So, based on the current state a customer is in, orientation, actively looking, preparing for a holiday, or perhaps just returned, the next best action can be very different, and sometimes the best action is no action at all.

Testing a NBA-engine is not as simple as an A/B-test. Here you strive to increase revenue over the long run, perhaps even lifetime value (LTV). Thus a direct spike in conversion might not imply an overall improved result, but if you know when to approach a customer for their next summer holiday, city trip, or winter sport and any upsell interactions in between, a higher LTV might just be achieved. Transforming customer context into the right features and having a good policy can be the right ingredients for an agent-based NBA to learn when to take which action.

Transavia customer journey

State: The context about the environment that is required to make a good next action step. In the NBA example, the state should include information about the customer for which an action is provided, i.e. the history of purchases, interactions, etc. But the state also captures info about the environment, e.g. if a customer is not signed up for the newsletter, sending an e-mail is not a valid action.

Reward: The reward is the feedback that an agent needs in order to learn whether a step was good. This could be the revenue of a purchase in the NBA setup. But the reward structure can be customized to speed up learning, e.g. by giving penalties for every e-mail sent or banner shown. This way an agent could learn quicker that balance is important.

Markov Decision Proces (MDP): A sequence of time steps in which an interaction with the environment takes place. At each point in time, the current state is used to take an action step, after which the environment provides feedback in the form of a reward about that step and the new state.

Episode: Some MDP’s are continuing tasks, like the NBA problem, but others are limited by time, a number of steps, or have some natural ending. An episode can have a fixed or variable length. E.g. a chess game ends at checkmate, which happens after a variable number of actions.

Discount: In RL an agent aims for an optimal policy, a set of steps that results in a trajectory (strategy) that delivers the most reward. It does so by estimating the direct reward of each next step plus an estimation of the reward of all future steps when taking that action. When it’s a continuing task, calculating all future rewards is impossible, and estimating the reward for a large number of steps in the future often comes with uncertainty. Therefore a discount percentage will ensure that more weight is attributed to expected rewards in the near future, and less to the more uncertain rewards in distant future.

Transition & termination: In order for an agent to train its network and learn, it is important that the trajectories are captured. When you move from one state to the next, it's called a transition. Training data is then provided to the agent as state -> action -> reward -> next state, which will tell us how the action affected the environment. When the final state is reached, it is called termination. At this point, you truly know the accumulated reward for the episode, and if you don’t provide the agent with the information that this was the final state, the expected future rewards will not be correct and just continue to grow.

In Summary
RL is particularly useful when a problem can be set up as an MDP and bodes well when the environment/reward is non-stationary, i.e. it changes over time. Exploration comes at a cost, but some form of exploration is required to arrive and stay at an optimal policy. The agent will interact with an environment to utilize and continuously update its understanding thereof. The state is input for the agent what the next best action might be, with the goal to optimize reward over a sequence of steps.

The challenge

The dynamic pricing challenge is actually three different competitions. Airline, retail and e-commerce, with slightly different set-ups for each. The airline challenge is the only one in which you have to deal with limited stock and is set up as a duopoly. The idea is as follows:

Each time period, each player posts a price, this serves as input to a simulation environment with an unknown demand mechanism to the players. After serving a price to the environment, each player receives their sales and the competitor's price for the past time period. Based on this information you can then serve a price for the next time period. This process is repeated for 100 selling periods and is called a season (episode). You have 80 stock (seats) per season, if you sell out, no more revenue is generated in the remainder of that season.

Players are randomly selected to play against each other in small competition matches. Each match, the demand mechanism is a bit different, in general, there is a slight increase in demand during the season. If you are priced lower than your competitor, you will generate more revenue. But if you are both overpriced, both won’t be generating any sales. It is your job to get a feeling of the demand curve in the current season plus the strategy of the competitor and act accordingly. The player who generates the most revenue overall selling seasons wins.

One time period in a selling season (source: official DCP website)
A selling season consists of 100 selling periods (source: official DCP website)

Baseline strategies

As with any Data Science solution, it’s a good idea to test simple implementations first so that you have a benchmark to compare more complex solutions to. I tried three baseline strategies:

  • Fixed price: Always use a price of 50 euros.
  • Linear price increase: I’ve tried a couple of start and endpoint combinations, and found that starting at a price of 40 euros and gradually increasing the price to 60 euros at ‘departure day’, worked best.
  • Linear load factor: Start with a price of 50, at every time period, check if we are selling our stock too quickly or too slow. If we have sold more than time_period * 0.8 (80 stock / 100 selling periods), increase the price by 2 euros, or decrease the price otherwise. I’ve added a 5 euro step increase when competition sold out and experimented with some middle ground to keep the price equal within some boundaries.

After each night of simulations, the results are provided in CSV, which enables the players to test and evaluate their strategies. Below is an example of one selling season for the linear price increase baseline strategy. In this example the competitor chose a follower strategy, using our price from t-1. in the image below you can see that this strategy works fairly well in some seasons, in other seasons it either sold out too quickly (spill) or remained with a large stock (spoil).

On the x-axis, the selling period, starting at 0 ending at 99.
Load factor is the ratio of passengers to available seats, starting at 0 and in this case, ending at 100%.
The black lines are the competitor's sell-out period and our sell-out period. In this example, we sell out close to the departure day which is pretty good.
The total revenue is the sum of seats sold * selling price in each selling period. Although we don't know the exact demand at each selling period for the competition, we can tell our revenue in this season is a lot higher since we fill a large part of our plane after our competitor sold out at higher prices.

Visualization of one example selling season


Now to the fun part. If technicalities and code gists aren’t for you, please skip to the experiments & results. The RL implementation that I have used during the competition is a Deep Q-Network agent which uses a neural network for value function approximation. But let’s start with a Q-network.

To start simple, let’s assume we only fly to Alicante, on Saturdays, during the summer months. Demand is very stable, there is no Covid (wouldn’t that be nice). Our only competitor on that route, CheapFly (not a real airline), is not so clever and always uses the same price points. We have 40 different price points which we can choose from at each time period, between 20 and 100 euros in steps of 2 euro.

A Q-table is in essence a lookup table, that tells us the value of an action in a given state. Since we are interested in rewards over a sequence of steps, or perhaps an entire episode, the Q-value represents the expected reward in the next state when taking an action plus a discounted value for expected rewards in successor states. The Q-values can be initialized arbitrarily, and are updated after each time step.

The agent uses the Q-values to apply the policy to (e.g. epsilon-greedy). The policy determines which action is taken, could be the action with the highest Q-value (exploitation), could be some other action (exploration). When taking a step in the environment and determining the next action in that timestep we get a new Q-value. This Q-value is then used with a learning rate to update the Q-value of the previous step (we now know a little more about the environment and whether that previous Q-value made sense).
The closer we get to the terminal state the more accurate the Q-value becomes, and this slowly trickles down to the state-action pairs further away from the terminal state. When repeating this for many episodes, our understanding of the environment becomes better, which results in more relevant trajectories picked by the policy, which in turn results in better data to learn from. The Q-values and policy thus strengthen each other and work together to converge to an optimal policy. By using this strategy, we don’t have to try out all combinations of state actions (which would be 40 (actions) to the power of 100 (time periods) in this example), and can still learn an optimal policy.

Q-values and policy go hand in hand to gradually improve to an optimal policy

However, this utopian world where demand is always stable, our competitor is stupid and Covid doesn't exist, is unfortunately not a good reflection of the real world nor the challenge’s environment. The demand curve is always a bit different, our competitors will use different pricing strategies, and in the real world, there are a lot more variables that need to be considered to reflect the state properly. With these additional variables, the number of state-action pairs increases exponentially, and a Q-value table no longer makes sense.

We can however try to estimate the Q-values using a supervised learning technique. This can be any supervised learning algorithm, but neural networks generally work well here, and so the deep Q-network is a common implementation in many open-source frameworks. Since you got this far reading the article, I’m assuming you have a basic understanding of neural networks, otherwise find a basic explanation here: Understanding Neural networks

The architecture of a q-network is very similar to a multi-class classification neural network. The features that represent the state are on the left-hand side, then there are hidden layers with activation functions like relu, which allow for non-linear relations and complex interactions, and on the right-hand side is the output layer. However, instead of using the softmax function to transform the outputs into a probability distribution, the output will remain absolute (Q-)values. So, you could say it is actually more of a multi-output regression setup.

The q-network of an agent can be a Recurrent Neural Network (RNN). If not, the state features should represent the entire context well. The agent doesn't directly access data from previous steps in the trajectory, so if recent states, actions, and rewards are relevant to make a good value function, then they should be included in the features.

There are many implementations of the DQNagent, and like other open-source ML libraries, the high-level RL APIs consist of building blocks to create your training flow. The main ingredients that I will use are:

  • Environment: A class that can take a step based on a provided action and return an observation (new state), reward, step type, and discount. It is what the agent will interact with.
  • Agent: The agent class itself.
  • Network: A custom neural network, used for function approximation.
  • Policy: Maps the observation of a timestep to an action.
  • Replay buffer: Experience in the form of trajectories is stored in a replay buffer to sample later to train the weights of the network. Batched and shuffled training data improves network convergence and ensures better generalization to unseen data.
  • Driver: Helps in running loops of data collection, performing a specified number of steps or episodes, continuously going back and forth between policy providing action to the environment, and reusing the returned observation. The driver can parallelize and optimize this loop for faster data collection and learning.
  • Checkpointer: The weights of the agent’s network, the replay buffer’s collected data, and state of the policy all change over time and might need to be used at a later stage or in a different environment. A checkpointer saves and restores the training state and makes deployment easier.

Knowing these separate concepts, this is how you can set up a DQNagent using tf_agents, a high-level API built on top of Tensorflow.

Local simulation environment

In the code snippet above you can see I’ve imported a custom environment. The environment expects an action from the agent (and its policy) and will return a time step with the reward, observation, and step type. In this scenario, the custom environment actually consists of a few elements, in order for the environment to provide all that information.

The AirlineEnvironment class inherits from PyEnvironment and will be wrapped in TFPyEnvironment to make use of Tensorflow’s efficiencies. It has the step and reset functionality, which organizes the transition and termination of an episode. It thus holds the main functionality to create trajectories that the agent can learn from.

When taking a step, the reward needs to come from somewhere. In a real-world scenario, the agent might provide actions in the form of a batch process and get feedback from its actions the next day through an operational system. Or in a more real-time setup, the interaction between the agent and e.g. a frontend where users interact with will happen via API requests. To test and improve the RL-system that was being created for the dpc fast, I have created a simulator that hopefully will work similarly to the competition environment. It determines a customer demand curve, price sensitivity of the market, and a competitiveness score, and has the function to return demand for both ourselves and our competitor based on both of our price points. It will reset to a new season with different dynamics after 100 selling periods or when the reset function is called (e.g. when our stock ran out). The demand will be sampled from a poison distribution to assure some randomness (like in the real world). At each time step the AirlineEnvironment will call the demand function of the AirlineSimulation to calculate the reward for that step.

In order to do that without actual competition, a third class is being created that will select different competitor pricing strategies which are initiated according to a probability distribution per season. They include random curve, linear increase, fixed price, follower strategy, random ranges, and high start random ranges and are based on competitor strategies that were found while testing our baseline submission. The competitor strategy is different for each selling season, as we sample the statistics that generate e.g. the polynomial curves or the ranges and slope of the linear increase.

I have created one more class, which is a game object that keeps track of the current state and saves all of the competition results in a dataframe to access later for analysis. I chose to keep this separate from the AirlineEnv as it then remains easily accessible after converting the AirlineEnv to TFPyEnvironment. Together that whole flow looks as follows:

The flow of interactions between different components

The code for the custom environments can be found below.

Experimentation setup

ML is often an iterative process with exploratory data analysis, data preprocessing, feature engineering, experiment design, testing models, and learning from its performance to improve one of the earlier steps until satisfying results are achieved. With RL the flow looks a bit different but the iterative process is surely there. With many hyperparameters to tune, different strategies to try, and long training times, there is a clear need for a decent experimentation setup.

At Transavia we have built our experimentation and production flow on top of Azure Machine Learning (AML) and Azure DevOps. We have created templates for both infrastructure as code deployment as well as CI/CD, making it easy and fast to start experimenting on a new project like dpc.
AML enables us to do dataset versioning, save trained models in a model store, submit and schedule training and scoring pipelines, use different types of cloud compute, save and reuse Docker images, deploy API endpoints, and run and compare experiments. That last one being specifically useful for an experimentation project like the dpc.

It takes a significant amount of time before the agent gets to useful trajectories with the time of experiments ranging from a few hours up to 2 days, running on GPUs. Thanks to AML we could distribute the experiments over a couple of machines and run a separate set of hyperparameters on each node of each machine. Every experiment has its own snapshot of the code that was used, the output like performance plots, a csv with saved trajectories for further analysis, and logging of the main metrics to ensure reproducibility and function as the administration of all experiments. This has helped a lot in keeping everything neat when the number of experiments and the size of the codebase grew, enabling us to move back and forth between experiments and make sense of what works well and what doesn't.

An overview of different experiment categories that were used throughout time. Each experiment folder contains the experiments on a certain topic, with each individual experiment containing child runs with different hyperparameter combinations.


After having a go with the tf_agents library for a while, starting with the famous openAI Gym Cartpole example and step by step transforming that to our custom environment, different state variables, adding competitor strategies, etc. it was time to move away from notebooks and start with submitting experiments to AML.

Since every season has a different demand mechanism that is unknown beforehand, it is important that the agent learns this mechanism as well as possible by interacting with it. The state variables should therefore represent these interactions with the environment in a way that the agent can quickly learn how to best approach this season. Based on some EDA, I chose to include interaction variables from the 10 most recent time steps; our price, our competitor price, our demand, and whether our competitor still has capacity (all t-1 until t-10) and combine that with the current selling period and our current loadfactor. This should give the agent a good sense of the demand trend under circumstances like competitor price and selling period. The idea was to predict the recent sales and current loadfactor for our competitor as well to add to the state variables, but didn’t really get to that (and it worked well without them).

The way I initially set up the reward structure was to solely attribute the generated revenue as reward, naively assuming the agent would figure it out. But since the Q-values represent the expected reward in the next step plus the expected reward in all consecutive steps, and are initialized randomly, the agent started to learn that choosing lower prices would result in more direct revenue. The Q-values from the start shifted towards lower prices as direct revenue + some average of future steps (because of random initialization), is higher than no direct revenue + some average of future steps. So I soon learned I needed to add some penalties to help the agent learn more rewarding behaviour.

Since the first problem was that the agent sold out too quickly, I added a penalty of 10 (euro) per time period between sell-out day and ‘departure date’. E.g. if the agent sold out at selling period 40, this would mean a penalty of 600 euro, which is pretty large if you sold all your stock at 20 euros each.
This, however, shifted the agent to do the exact opposite and prevent selling out as much as possible by setting very high prices. So another penalty was added of 50 euros per unsold seat.

Left: Total seats sold per selling season. Right: Sellout period per selling season

After adding these two penalties and switching from an epsilon-greedy to a boltzmann-temperature policy (simply put, instead of random alternative exploration, explore options that have relatively high q-values), the average reward per season increased during training and the agent learned to sell out close to the end of the season or remain with only a few seats left. It however did so by picking a high price consistently and only every now and then drop the price drastically to pick up loadfactor. It is questionable whether this would result in near-optimal performance, but certainly is not customer-friendly behaviour that you would like to apply in the real world. Therefore, a price difference penalty was added of 0.5 euro for every euro in price difference between time periods. Again, this resulted in undesirable behaviour as the agent now tried to pick one price point (after some initial interaction with the env, ~10 selling periods), and kept that throughout the season. Depending on the competitor strategy this would result in selling few seats early on in the season and more seats later on (when demand and willingness to pay is generally higher). This might work in a simulation environment, but IRL competitors would take notice and increase their price by a bit early on to not sell out too quickly either.
So, more penalties? You see now how this is an iterative process?

Agent starting to learn and increase reward per selling period/episode
Spiky pricing behaviour of agent after adding early sell-out and stock remainder penalties

We don’t want to push the agent into filling the loadfactor linearly throughout the season as different pricing curves might be optimal for each flight/season, but we do want the agent to sell the stock gradually and increase its price marginally throughout the season (although maybe not linearly). So, by now adding an exponentially increasing penalty for deviating from a linear loadfactor, the agent could determine whether keeping a bit more stock for later (or vice versa) might be worth it. Finally trajectories started to look like something that you could see happening in a real scenario and performance increased a some more.

Pricing pattern after applying exponential loadfactor difference penalty

The agent now learned to deal with different competition strategies and funnily enough also learned to do last-minute sales when unexpectedly we neared ending with a bit of stock. Sometimes it even waited for the competitor to sell out before swooping in, thus finding the boundaries of the complexity of the simulation environment. So I felt the agent was now ready to move from our experimentation environment to the dynamic pricing environment.


Doing all of the above experiments while figuring out RL best practices took me the better part of December. In the meantime, I even moved abroad, but one week before the deadline the agent was ready to be submitted. Just enough time to get in some online learning and let the agent learn from real competition interactions. Unfortunately, the environment wasn’t ready for us (read: our agent). Some issues that I already knew existed from submitting the baseline scripts were that nothing is returned after a simulation run. Any online learning is ‘forgotten’ by the agent, and has to be recreated from the official competition results. There also isn’t any feedback online for selling period 100 in each season. This is not ideal, but also not a dealbreaker.

What I came to realize during the submittal of the agent though, is that loading Tensorflow takes ~10s while there is a time cap of .2s per price request with a max of 5s. The organizers were celebrating Christmas with their dear ones and couldn’t update the environment settings remotely so short before the final deadline. So, I worked my way around that by loading packages, the pre-trained agent, etc. in a separate Python thread while returning a constant price of 50 for the first season. After that hack, however, there were no competition results for any player in the last days leading up to the deadline. The submission of my agent passed the submission tests, but this meant no online training and more importantly no testing of the submission script before the final competition simulations (every simulation before that was just for testing and didn't count for the final ranking).

With all of these excuses you probably feel it coming, the final submission did not work in the competition environment. After trying to debug it remotely with one of the organizers, we came to the conclusion that the environment needed some changes to work well with a RL solution and that I was just a bit too late with my solution to join the competition with it.

It would have been nice to benchmark RL against the operations research (OR) solutions that tend to dominate the competition, but of course, the learning experience weighed heavier than that. It was a great way to start testing RL in a sandbox environment, and who knows this might just be the first step in implementing this at Transavia to determine our ticket prices.

Takeaways to implement in real life

The experiments that have been performed show the potential of RL in a scenario where it is the goal to optimize reward in a sequence of steps, just like it is the goal for airlines to optimize revenue between go-live and departure day. The experiments, however, also show (once again), that neural networks are black boxes and a deep RL solution must be very stable before put into production. In order to get there, I see a couple next steps:

Improve state variables: In the real world we might not know the exact demand curve, but a lot is known in advance about booking behaviour for a specific flight. Even by just using dummy variables for route, departure month, day of the week and departure time (bucketed) we might already give the agent useful information to pick pricepoints on. But there is a lot more information to provide to the agent to determine a pricing strategy on (think market share, internal/external alternatives, public holidays, promo’s, etc.).

Learn from real experience: All of the above state variables would be very complex to integrate into a simulation environment. But, we have a long history of pricing and booking data which we can feed as trajectories to the agent to learn from as a way to kickstart the agent before having it interact with either a simulator or the real world. This way, we overcome the cold-start problem and have it start with reasonable exploration that wouldn't cost too much. From there on, it can try and optimize the human pricing strategies that have been applied in recent years.

Improve the simulation environment: Although I just mentioned that I believe it is too complex to mimic real world market dynamics, I do think it would still be useful to try to get as close as possible. This would help us to figure out what levels of exploration results in acceptable trajectories, whether we should add more penalties to our reward structure to prevent undesired behaviour, enable the agent to compete against a clone of itself, and perhaps also serves as a tool to demo to business stakeholders how it would work and behave in a real world scenario.

In summary
RL is a powerful tool that could help optimize business strategies in many domains but is a bit harder to train as it learns from interaction. Simulation environments could help but some gap will likely exist between simulations and the real world, so care needs to be taken when designing reward structures and exploration strategies. Perhaps even more so when historical data is used for training and validation on a simulation environment is skipped or not possible.

I learned a lot and hope to apply it in a use case at Transavia soon. If you liked the code gists but would like to see a bit more, have a look at my Github repo, find the link in the resources below.

If you like the topic and want to see a bit more of it in action while watching a good (and even a little emotional) docu, I can advise you to watch the AlphaGo docu created by Google, which is freely available on YouTube, find the link below.

Thanks for reading.

1) My GitHub repo on the DPC: Github link
2) RL specialization by the University of Alberta: Coursera link
2) A useful tutorial (full example with code) on tf_agents: Medium link
3) Tensorflow series of guides & tutorials on tf_agents: Tensorflow link
4) Useful article series on RL concepts: Towardsdatascience link
5) AlphaGo movie: YouTube link