AWS AppSync — the unexpected

As a GraphQL fan (but not only) working mostly on the backend and occasionally on the frontend, I was curious about AppSync service by AWS.

This article is about AppSync and how it tremendously changed our schema and the way we worked with GraphQL and thought about the usability — all because of limitations AppSync has.

Update October 1st, 2018: actually we hit another 2 (undocumented) limitations (listed here as 4 and 5) that diverged us from using AppSync and managing our own GraphQL lambdas. It’s a pity, but keep reading, our case is specific, you may be happy with AppSync.

I’m rebuilding one of the larger projects to AppSync. Previously we had GraphQL endpoint as a lambda behind the API Gateway. We decided to stay serverless as much as possible and at the same time, we needed subscriptions. Having websocket communication and staying serverless is… well… not straightforward.

Therefore AppSync with subscriptions built-in was a logical choice.

I’ll not dive into details about AppSync, I’ll mention only the experiences I have and the limitations I hit. YMMV.

Our GraphQL lambda used shared libraries and was a little bit “thick”. Schema and resolver mappings and most of the resolvers were there. We went the “shared code” way instead of invoking another lambda from this GraphQL lambda (personally I think this synchronous invoking is an antipattern in most cases).

Because the project has thousands of entities and these have hundreds and sometimes thousands of different entities (like devices and their types or their data points for the specified interval). The schema we designed was “semantically” correct — nice tree with the possibility of depth queries. We heavily (and happily) used Data Loader.

Our goals with AppSync were:

  • decouple resolving to specialized lambdas (less code sharing)
  • easier subscriptions
  • split the infrastructure to multiple AWS Stacks (the project is quite large)

The first but easy hurdle was defining APPSync in our serverless yaml because we use it for managing our infrastructure. In the end, we defined all as CloudFormation resources managed by serverless.yaml. This enables separate deploy of data sources (in our cases services). Another reason why AppSync lives in the separate stack is that AppSync endpoint is our central endpoint for multiple different services (used in one frontend UI). The only thing is that we needed to export required values from other stacks so we can import them here to properly define data sources.

It’s not easy to have the working CF template (especially because of all the roles and policies), but it is doable. The schema and mapping templates live in the CF resources — the file may become quite big and messy. I’m little bit conservative with serverless plugins, therefore, we harnessed the power of serverless yaml and came out at the simple solution based on importing files/scripts to serverless.yml. Now we have modularized schema defined in multiple .graphql files (for syntax checking in IDE) and having mapping templates in separate files. Serverless put all that together at the deploy time without plugin magic.

Another easy hurdle were Apache velocity templates. AppSync uses them to map schema fields to the data sources and to transform result to GraphQL answer. Our lambdas are designed to return (in most cases) exactly what we want to send and therefore basic `$utils.toJson` helper was good enough. At first, I did not like to have “another-weird-language” in the stack but it’s really simple (and powerful) and it doesn’t hurt.

It looked like we did all to make it work. Well, nearly…

In fact — it worked BUT the “unexpected” limitations struck back.

Note: By the “unexpected” I mean that I did not see enough mentions about them in the documentation (I’d say I did not see the mentions at all but I may be wrong).

Here are limitations (two documented and one undocumented) that became really important for us (all AppSync limitations are here).

1) Maximum GraphQL query execution time: 10 seconds

Yes, that’s true. Your data sources must respond within less than 10 seconds and if some resolvers run after another… all have to fit in.

Some of our data sources obtain real-time data from real devices located in many different places (yes, it is IoT project). And this may take longer than 10 sec.

Here I want to mention for the first time the usability. GraphQL is mostly about communication between frontend client and server. And within this context, it is really really good idea to be sure that the answer arrives within 10 seconds. Probably even much faster.

The fact that our use case is specific is… excuse. Everyone can say that. The key is to solve the problems, not to find excuses.

We will solve our real-time data problem with subscriptions. Real-time data were not supposed to use subscription in the first phase, but the solution wiped out many tradeoffs we wanted to do.

In the other cases where the user query was so time-consuming so it could even potentially fall out of 10 sec timeframe, we revised every query and made optimizations or found a better solution to get data faster. In some cases we redesigned our data flow, in other we are sending data in chunks etc.

2) Maximum number of iterations in `#foreach…#end` loop in mapping templates: 1000

This means that if your lambda returns an array of more than 1000 items and you need to transform your values in apache velocity template, you are in bad luck — it doesn’t come through. Note: previously we thought that the limitation was applied to any array, even when you want to use $util.toJson but this is probably not the case. I with not to use “probably” in previous sentence but wasn’t able to get the answer.

Usability here, again. You probably should not send so many data points to the client. Does it really need it? (Or you simply don’t want to implement pagination? 😉)

Our example: we wanted to display a graph with data per minute during a week. 10 080 data points. The data are not so big, but the client has to deal with that amount of data points (cache, transform, draw a graph). You cannot even display such data — the screen resolution is not high enough and your graph will not be useful. We accepted the limitation to 1000 data points and it’s usually good enough. We had to ask ourselves how to select the right data points from the larger set (you may want to see spikes or not to see them for example).

This limitation is possible to overcome in many ways. Use the server managed pagination (server doesn’t send all data client wants but the chunk and last cursor). Or you can segment data — it (should be) ok for AppSync to send back 10 items and each item will have 1000 subitems.

But again — the first thought should not be how to bypass this limitation but how to make your API and queries more efficient by sending fewer data. After all, this is one of the selling points of GraphQL.

3) BatchQuery (this is serious, really)!

I mentioned Data Loader. For GraphQL way of resolving data, this is essential. AWS tells you AppSync has “something like Data Loader”. You can define `BatchInvoke` on lambda for example.

But the totally undocumented fact is — it does this only in batches up to 5 items!

What does it mean? If your response has 100 devices and each device has DeviceType and you query the DeviceType.name. Easy. You setup the resolver for Device — DeviceType as BatchInvoke. But hey… the second batched query for DeviceTypes will NOT be one query (one invocation of your lambda) with an array of 100 items. It will be 20 queries (invocations) and each will have an array of 5 items.

Yes, you read it correctly. This may ruin your lambda pool (and budget), your RCUs of your DynamoDB or whatever else you use as data source.

This may be a really big blocker for a lot of schemas.

In our case, we decided to refactor GraphQL schema. We made it much flatter than before (one would say more RESTish). We allowed only shallow and carefully selected nesting.

Interestingly this had some positive effects:

  • much faster responses
  • no need to do the query complexity analysis (because FrontEnd programmer may, and ultimately will, ask very, very expensive query)
  • progressive rendering

Let me elaborate a little on the last one.

GraphQL still doesn’t have partial responses. So client sends one complex query and waits (longer) to get the (bigger) data.

In our case sometimes instead of one query, you have to send two or three. It can use the first result not only for second query, but for rendering, too. It may render empty data grid with filled in rows and columns headers from the first result and the actual data in the grid come with second query. From the user’s perspective, this is positive.

4) Size limitation

This was the real stopper for us. Undocumented. Your lambda cannot supply data with more than 1MB in size to AppSync (this is the outcome of experimental measuring may not be exactly 1MB).

That means: once lambda returns more than 1MB, no matter how AppSync will transform the data or what fields are asked by the user, you are doomed.

Imagine you have a lambda that returns the list of entity X that has 15 fields. You may ask for entities X filtered out and most of the time you are asking for only a small subset of fields (like IDs). When lambda returns the array of entities X that is bigger than 1MB, you will get graphQL error no matter the fact the user wants only IDs which yields on the GraphQL reply od 100KB size. I cannot come up with the good reason why the AppSync input from lambda is measured instead of output. If AWS want to be sure that the resources are not wasted, add a limit on input too, but reasonable one (like 3–5 MB — lambda itself has a limit of 5MB of output — not documented on limitations page either).

You may think that you can do the filtering in the lambda — well you cannot. Not in the velocity templates nor in the lambda you have no (documented) way how to know which fields user is asking for. Yes, the “info” param the GraphQL server supply to the resolver.

It would be sufficient the context object in velocity template to have access to the query itself.

5) Compression

Using API Gateway you can turn on the compression of the requests/response. You cannot with AppSync. So using AppSync for data-intensive cases is problematic. If your response is 500kB of data, you have to transport 500kB of data, instead of a fraction of it, because you cannot turn the compression on. I suppose that internally the AppSync uses API Gateway so I can’t understand why this is possible to configure in AppSync.

The result for us?

AppSync is great.

Update (October 1st, 2018): AppSync is great for small and not data-intensive applications. Once you need more control, I would not recommend the AppSync in the current state. I believe that AWS will improve AppSync in the future and wipe out most of this limitations, but now they are too serious for us. And I cannot understand the fact that the limitations are undocumented. I’m sure that we are not the only ones who have to find them out by ourselves just to figure out that they are stoppers for us.

AppSync limitations are mostly reasonable. In many cases, limitations are good. Really. They may push you toward better usability — like AppSync does by limiting request duration and data sent to the client. Limitations force you to think different. They do not give you space for “technology debt”. You have to solve it better than “it’s good enough for now”.

AppSync lacks Data Loader functionality and I would like to see this clearly stated in the documentation. It may ruin your day and/or your budget. In the end, even this limitation has a positive side. You have to think about the data more thoroughly. And you may see benefits in partial rendering, too.

Some final notes:

  1. We will probably hit more limitations along the path so I’ll update this article.
  2. I know that some of AWS limitations are not based only on usability but because of billing schema. Yet they can be used to optimize your application.
  3. The “ask for lifting limits” form doesn’t have AppSync listed yet. This means that all limits are probably hard limits for now.

Update (August 2018):

We asked to increase limits on BatchInvoke and here is (part of) the reply:

I understand that you use “BatchInvoke” operation for Lambda Datasource extensively and the current batch size limit of 5 is a real blocker for you. I sincerely apologize for this information not being updated in the public documentation. I have brought your request to the attention of the AppSync Service Team. This is one of the most requested feature AppSync customers. The AppSync service team is already tracking it as a PFR. I have done my best to explain this pain point of yours to the AppSync Service Team. Unfortunately, I cannot give you ETA at the moment.