Logging and monitoring

AppSync is integrated with CloudWatch, both Logs and Metrics so it can automatically publish debugging and operational information there. And CloudWatch comes with a rich toolset to analyze and monitor this data, this makes it relatively easy to gain insight of what is happening with the API. And as CloudWatch is a managed serverless solution, you can't overwhelm it with too much data. Provided, of course, that you are willing to pay for that.

In this chapter we'll take a look into how logging and monitoring works for an AppSync API.

Logging

The primary purpose of logging is debugging, and AppSync can publish a good range of information (if configured to do so) to the logs. But, of course, there are edge cases you should be looking out for.

Setup logging

CloudWatch Logs uses a 2-layered system for logs. The top level is the Log Group which is the container for a single entity, such as an AppSync API. It's name is determined by the AppSync service: /aws/appsync/apis/<apiid>.

The lower level is the Log Streams. AppSync automatically creates these streams according to some internal logic, and they contain the actual log messages. In practice, you query the stream with the latest log message to see the most recent log events.

AppSync follows the usual AWS way in regards to permissions: the service needs an IAM role to gain write access to CloudWatch Logs as by default it has no permissions on its own.

The minimal set of permissions give write access to the groups, streams, and messages:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}
Tip

Don't give logs:CreateLogGroup permission to AppSync.

If you use the Management Console, AWS creates the role with these permissions.

This is good for getting started, but if you use some Infrastructure-as-Code solution, such as CloudFormation, Terraform, or the CDK, this is not the way you want to manage logging. The problem with letting AppSync to manage the log group is that if you delete the API, the log group will still be there. And since, by default, the log messages have no expiration, you'll pay for storage even when the API is long gone.

A better solution is to create the log group with the IaC tool and don't give AppSync the logs:CreateLogGroup permission. This way, the API is still able to manage the streams and push the messages, but deleting the stack clears the logs too. And, as an extra benefit, you'll be able to define the retention too.

For example, this is how to create the log group for the AppSync API with Terraform:

resource "aws_cloudwatch_log_group" "appsync_loggroup" {
  name =
    "/aws/appsync/apis/${aws_appsync_graphql_api.appsync.id}"
  retention_in_days = 14
}

Then give a limited set of permissions to the AppSync API:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}
Log levels

AppSync supports various settings for what is logged. There are two controls for this:

  • Whether to include verbose content
  • What field resolvers are logged
AppSync logging settings

When both are turned off, only the time, id, duration, and the amount of tokens consumed (see the Request tokens chapter):

Without verbose content or resolver logging, very little information is included

Including verbose content adds the GraphQL query and the request/response headers:

Verbose logging includes the query, and headers

To get more insight of what is happening during a request, you can enable resolver logging. This shows which resolvers are run, and what their request templates were:

Resolver logging shows what resolvers were run during a request

Then verbose resolver logging also shows the context for each resolver. Since the context is the central part of what a resolver does, this is especially useful for debugging. Usually, it is possible to recreate what happened during an execution with this information.

Verbose resolver logging shows the value of the context
Tip

Verbose resolver logging allows visibility into what AppSync does during a request. Since GraphQL uses its own logic when it runs the resolvers it can be extremely useful.

Analyze logs

Each log entry has a request ID that makes it easy to filter log entries that belong to the same request. This is a UUID that is unique enough that it's not likely that it appears accidentally on an unrelated entry. In practice, whenever you find an entry, you can quickly switch to seeing the whole flow. Just don't forget to use "" in the search field for exact match.

Each log entry has a request ID to allow easy filtering

As these logs are available in CloudWatch Logs, you can also write complex queries to visualize various aspects of the API. This is done through CloudWatch Logs Insights, and AWS even provides some example queries, such as most invoked resolvers, slowest fields, and most errors.

CloudWatch Logs Insights allows complex queries agains AppSync logs
More ideas

For more ways to analyze logs, see this blog post.

Log costs

As we've seen, when verbose logging for all resolvers are enabled, AppSync generates a lot of messages. While it's not a scalability problem, CloudWatch bills for all these messages and the overal cost can be significant. It can even be more than the cost of the API itself.

For example, a simple query with a single level of nesting generated 12 KB of logs. That's $6/million requests at the current price of $0.5/GB. Worse still, without a limited log retention, these messages are stored forever, adding an ever-increasing amount to the monthly bill.

Unfortunately, AppSync does not support log sampling. You can implement something like that by enabling logging for short periods of time and disable it outside these sampling periods. For more info, see this blog post.

Tip

Watch out for logging costs. Don't keep full resolver logging enabled in production.

Metrics

AppSync also pushes various metrics to CloudWatch. They are free of charge and you don't need to enable them. These metrics are a good data source for monitoring, and you can get some visiblity that your API is humming in the background.

The AppSync dashboard shows a few of these metrics, but this is just a sample of all the data published. At this moment, there are 22 different metrics you can watch, such as request latency, messages published to subscriptions, active connections, and more. The documentation has the full list with description of what the numbers mean.

4xx and 5xx errors

The first two metrics AppSync publishes are the 4xx and the 5xx errors. These are HTTP status codes, and codes in the 400-499 range indicates client-side errors, while ones in the 500-599 range indicates server-side errors. At first look, these metrics seem useful: for an HTTP-based server, you could monitor critical and non-critical errors, setting up alarms for each of them.

But this does not apply to GraphQL. The specification defines how to handle when a resolver throws an error: return null for that field and add an entry to the errors structure. Because of this, 4xx and 5xx errors are extremely rare and they provide almost no visibility of the problems with the API or the calls.

Errors in GraphQL still return a 200 status code

Because of this, I don't think monitoring these errors provide much value.

Tip

4xx and 5xx errors are HTTP status codes and they don't apply to GraphQL.

Request latency

AppSync publishes request latency which is a useful metric to monitor. On the AppSync console you can see the 10th, 50th, and 90th percentiles, and you can graph additional statistics.

Percentiles

The "nth percentile" means that the n% of the requests is below that number. So the 10th percentile means the slowest 10% of the requests is below slower than that, the 50th percentile (also called median) means half of the requests are slower and half faster, the 90th percentile is that only 10% is faster than that.

Percentiles are better than averages as a few extreme outliers won't skew it.

Number of requests

AppSync publishes the number of requests for all APIs in a region. This is another useful metric as a sudden rise of requests might indicate a problem.

This metric is region-level. If you have a single API deployed to a region it's not a problem, but with more than one it's hard to get an accurate number per API.

Regional metric

The number of requests is aggregated per region, so you can't get an accurate number for a specific API.

Request tokens

Another regional metric is the number of consumed tokens. A token is a specified amount of processing done to serve a request, so the consumed tokens is based on the number of requests and how much resources each of them requires. Note that it does not take into account other services, such as databases and Lambda functions, this is only for the AppSync-side processing.

Tokens are a way to limit AppSync scalability for the AWS account. It has nothing to do with pricing, as that is still per-request, only to put a cap on the processing power available.

It seems a limitation, but it is actually a good feature from AWS. By default, you get 2000 tokens, which is 2000 units of processing per second, so even if you accidentally have an infinite loop or under attack it puts a limit on your costs. This value is adjustable, so you can open a support ticket and AWS increases it for you.

This metric is per region, so you won't see which API consumes the most tokens.

Tip

Set up an alarm for this metric so you'll know in advance when you are about to reach it.

Subscription metrics

Most of the metrics AppSync publishes are related to real-time subscription updates. The most useful are ActiveConnections and PublishDataMessageSuccess. There are a lot of error metrics too.

Monitoring tips

My recommendation is to set up an alarm for the request tokens, and monitor the number of requests and the 50th percentile for latency.

Unfortunately, there is no good metric to monitor errors. Also, AppSync resolvers do not have a concept of consistency or transactions, so it can happen that a query throws an error but a retry is successful. On top of this, GraphQL converts errors into 200 responses with an error field.

As a result, there is no metric to monitor client-side or server-side errors and alarm ops built-in.

X-Ray

If enabled, AppSync pushes the flow of request execution to X-Ray. This is a service that integrates with many other AWS services and provides visibility into how a request is handled, such as what functions are called, what databases were involved, and other things.

X-Ray shows each service that processed the request

In practice, X-Ray is useful only in a few situations, but since it doesn't need any configuration besides turning it on, it's a nice addition from AWS. Just make sure you don't keep it enabled in production as it might come with a heavy price tag.

A log of the processing is also available
Master AppSync and GraphQL
Support this book and get all future updates and extra chapters in ebook format.