Simulation of Incentive Design: What is the Most Appropriate Reward System? Part 2

DECON
DECON Simulation
Published in
6 min readJun 17, 2019

Simulation of Incentive Design

Part 1: The Problem of Reward System Design and Simulation Environment Problem
Part 2: Simulation Result Overview via Heatmap
Part 3: Simulation Result Analysis

Simulation Result Analysis

In the previous article, we addressed the problems faced by reward systems and what we set out to resolve through simulation. Also, we introduced what settings and what kind of functions regarding agents and environments were used in the simulation. In this article, we will look into the simulation results.

Learning and Convergence of Agents

Before going into detailed analysis of gain distribution methodologies, we will look at the overall landscape through a heatmap. This heatmap is a visualization of the agents’ learning process and confirms the convergence of agents towards a direction in accordance with the set value.

The following heatmap displays the probability of each action of an agent.

  • The x axis represents 100 agents from 0 to 99, lined up in the order of asset possession.
  • The y axis represents actions with integer values from 0 to 9. 0 means that an agent has not written a review, 1 means that an agent only invested that much level of effort to write a review, and 9 indicates maximum effort in writing a review.
  • Each cell color displays the probability of an agent taking a certain action and is rated from 0 to 1. The sum of probabilities of the actions that an agent can take is 1. The darker the color, the probability is closer to 0, and the brighter the color, the probability is closer to 1.

In the above GIF, the time axis represents the elapse of an episode. As the colors don’t change, we are able to see that each agent has converged to a specific pattern after an episode has elapsed enough. The analysis below is based on the converged conditions after episodes have progressed significantly.

Result of the Proportional Method

The agents studied and converged in the following manner.

We were able to see that depending on the asset size of each agent, the endeavor intensity (action) changed, which is because the return value of the cost function changed accordingly with the asset possession ratio. In other words, for the wealthiest agent #0, it is most likely that the agent would have not written a review (0) and did not write a review. Meanwhile, agents with smaller assets were more inclined to write reviews, and, though subtle, we were able to see that they have higher probability of investing more effort.

Result of the Exponential Method

However, if we switch the distribution methodology from proportional to exponential, the heatmap changes like what is shown above.

The two heatmaps display significant difference. More endeavor can be seen in the exponential method. This is attributable to the fact that relatively more gain can be earned by investing effort compared to the proportional method (notice the brighter color on the lower-right side).

Result of the Uniform Method

We can see a dramatic pattern in the uniform method. Though there is a slight difference based on wealth, all agents invested 1 endeavor.

Because there is no difference in the reward received based on endeavor, agents wrote reviews with minimum endeavor. While it is meaningful that wealthy agents were driven to participate, this method does not guarantee quality of the reviews.

Manipulating the Reward Pool and Number of Agents

The rewards that agents receive are distributed from the reward pool. The reward pool is provided by the entities that seek to promote activity (restaurants or hotels that need reviews). We were curious if we could drive agent participation into the direction we wanted by adjusting the reward pool. For instance, could we ensure a certain level of quality in the reviews?

If that is the case, it would be beneficial for both the agents and the entities by providing a handsome reward pool. Let’s see how the agents learn and converge when the reward pool is tweaked through a heatmap.

Doubling the Reward Pool

The above heatmap shows the reward pool doubled in size. Notice that the lower-right side color has become brighter. We can see that the probability of investing endeavor has grown significantly.

Increasing the reward pool expands the distribution of gain. Therefore, the more effort an agent invests in writing a review, the more gain an agent gets. The cost is also affect by endeavor and becomes convergent at an appropriate level of action instead of growing exponentially. By setting a reward pool of an suitable size, entities can elevate the quality of reviews.

Because the current reward system operates on a set reward pool by the entities, we can achieve a similar effect by limiting the number of participants. Next, let’s look at a case where we reduce the number of agents to 50.

Halving the Number of Agents

We can see a similar convergence shown in the doubled reward pool heatmap. In other words, the reward pool has to be adjusted along with the number of agents in order to achieve the desired effect. If the number of agents were to be half of the original 100, the reward pool also has to be cut in half in order to maintain the convergence pattern.

Interim Conclusion

In this article, we looked at the overall simulation results using heatmaps as a visual aid. We observed the learning pattern and convergence results of agents and analyzed the different patterns per gain distribution method.

We learned that the exponential method is more effective in boosting the level of endeavor of agents than the proportional method. The uniform method drives agent participation but displayed the lowest level of endeavor in writing reviews.

If high participation is favored than quality reviews, the uniform method is the choice to go with. If the quality of each review is more important, the exponential method is the optimal choice. The middle ground of such trade-off would be the proportional method.

We also learned that adjusting the reward pool or limiting the number of participating agents can influence the endeavor level. When all other conditions are the same, the probability of investing more effort rose when the reward pool was increased. Due to the reward system’s innate trait of dividing up the pie, capping the number of agents and increasing the reward pool had the same effect. System designers should take these points into consideration to determine which reward distribution method to use, how big the reward pool should be, and how many agents should be allowed to participate. In our next article, we will conduct a detailed analysis of each gain distribution method using graphs and numbers.

--

--