Orchestrating the Refunds Process at Zulily

By: Nanya Origbo, Senior Software Engineer, Shopping Experience, Zulily

Zulily Tech News
Zulily Tech Blog
5 min readSep 1, 2021

--

As Zulily scaled to serve millions of customers, it needed to modernize its refunds process. Its initial refunds system was highly distributed and complex, which meant it was unreliable and created gaps in customers’ experiences. We will share how our team was able to evolve our refunds technology stack and create a shopping experience that customers trust.

Problem Statement

The organic evolution of Zulily’s refunds system led to increased complexity in its architecture, with multiple teams owning different pieces of the refunds process.

The primary consequence of this evolution was a reduction in the system’s reliability, which refers to the rate at which refund requests were successfully processed. For example, the distributed nature of the system meant that it was difficult to determine when and where refunds failed. In certain cases, customers had to contact customer service in order to move their refunds along.

Furthermore, the system’s complexity reduced developers’ productivity. Debugging issues and adding new features became difficult and time consuming. Also, we did not have robust retry mechanisms which meant that retrying certain failed refunds was manual and expensive.

Proposed Solution

Addressing the above challenges required us to build a service that reliably serves as a single source of truth for anything related to refunds at Zulily.

To accomplish this, we needed to:

  • Define service endpoints that allow clients to create a refund request as well as retrieve details about a refund request.
  • Implement a reliable asynchronous workflow execution mechanism that maintains a history of workflow events and proactively catches and resolves transient errors.
  • Implement a messaging system to alert subscribed consumers of predetermined events relevant to their use case (IE workflow success, failure etc)
  • Implement robust monitoring mechanisms that provide visibility into the overall system health.

At the heart of our solution was the asynchronous workflow execution mechanism. The most suitable “out of the box” implementation of this mechanism that matched our use case was the Simple Workflow Service (SWF), an AWS service that provides simple workflow orchestration management and state persistence via the AWS Flow Framework library.

Other implementations of workflow orchestration and management considered include:

  • Homegrown orchestration engine: Building a custom solution to address the issues we faced was going to take longer and require more effort. Thus, leveraging an existing solution within a cloud environment that Zulily was already a part of was more appealing and cost-effective. This approach of “not reinventing the wheel” allowed us to focus on what mattered which included building a reusable workflow library for other microservices at Zulily.
  • AWS Step Functions: Step functions were considered for the workflow execution mechanism. However, for our use case, we had two requirements that could not be achieved using Step Functions. The first requirement was we needed to implement business logic in our workflows using code instead of JSON. This was easily achieved using the AWS flow framework which provided a better developer experience. Secondly, our refund workflows had a parent-child relationship where a parent workflow could launch one or more child workflows that return a result to the parent. At the time of implementation, this parent-child feature was not available in Step Functions.

Solution Architecture

Fig 1: High Level Architecture Diagram

a. Service Endpoints

For the service endpoints, we defined RESTful APIs for a variety of client use cases. These APIs abstract clients and consumers from the implementation details of the refund workflow and database. We decided to go with REST APIs because it was sufficient for our use case and it is an architectural pattern understood within Zulily.

b. Workflow Execution

When refund requests are received, they are processed in two steps:

  • Request data are persisted in the refunds database.
  • Execution of the Refund workflow is initiated

Refunds Database: This is the single source of truth for Zulily refunds. That is, business and engineering teams can query this database for details regarding a refund request. The base entity in this database is a refund workflow, which could be modeled as a single denormalized table. Therefore, we decided to adopt NoSQL as the database technology for our service. DynamoDB was our chosen NoSQL solution. We picked DynamoDB for the following reasons:

  • DynamoDB is an AWS service and integrates seamlessly with other AWS services used across Zulily.
  • It is a fully managed solution with capabilities such as redundancy and replication provided out of the box.
  • It is a highly available service that offers three-way replication which suited our refunds use case.

Refunds workflow: At Zulily, refunds are a multi-step process that we modeled as a workflow. When a refund request is received, the refunds workflow is started asynchronously. A workflow consists of one or more tasks and its execution involves the coordination and management of these tasks. For a simplified and reliable processing of refunds, we decided on Amazon’s SWF as our orchestration service. For our use case, the benefits of SWF include:

  • Maintenance of a workflow execution history which makes debugging issues much easier for engineers
  • Automatic retries for failed tasks using exponential backoff which helps proactively resolve transient errors.
  • Quick and easy definitions of workflows and their associated activities in code using libraries provided with the SWF flow framework.

c. Workflow Completion

The completion of a workflow involves updating the workflow status in the refunds database and notifying clients. For client notification, we use AWS Simple Notification Service (SNS) to publish the refund workflow entity to an SNS topic. Clients can subscribe to this topic to track the status of their refund request.

d. Workflow Monitoring

Monitoring of our refund requests involved implementing a reliability metric to ensure that refunds are being processed successfully. We also implemented additional metrics such as availability, latency, and throughput to track the overall health of the system.

All metrics were visualized using Grafana dashboards. We also used Grafana to implement alerts that catch spikes or dips in our metrics.

Zulily takes customer trust seriously. By evolving the technology stack that powers Zulily refunds, we were able to significantly improve the reliability of our refund process, and consistently deliver shopping experiences that our customers trust.

--

--