Following up on the post I wrote about what to do to prep your app before going live, today we’re going to dive a bit deeper into monitoring your applications. Monitoring is crucial for understanding how your application is behaving and to get notified of issues before they become apparent to your users. Launching an application without monitoring is like driving a car without a fuel gauge. Sure you can drive the car for some time, but at some point you’re going to run out of gas. Similarly your app can run out of resources, or other issues may arise and you won’t know until it’s too late. There are three areas of monitoring we’re going to cover: infrastructure, external, and application performance monitoring. Let’s dive in!
In order to get visibility in the health of your application, you need to monitor the underlying components within your infrastructure. This assumes you’re using a cloud infrastructure provider like AWS, but if you’re using a Platform as a Service (PaaS), such as Heroku or Supabase, you may not have visibility into some or all of the underlying systems.
The first step to monitoring your infrastructure is to identify all of the components supporting our application and make a list of key metrics you should be watching. Typically SaaS applications will have some form of compute instances, which could be individual virtual servers, like EC2 on AWS or Compute Engine on GCP, but it could also be some sort of container orchestration tool like Kubernetes. In any case, there are core metrics for those instances you’ll want to keep a close eye on. Things like CPU utilization, memory usage and disk space are critical metrics, but you may also consider observing network related metrics like bandwidth usage and latency. Most applications also have some sort of database. In addition to the server metrics above, you’ll want to keep track of things like connection count, write latency, and replication lag (you do have a database replica, right?). Identify the other components in your system and see what types of metrics your cloud provider has and add some of those to your list too.
Once you have a list of what you want to monitor, you’ll need to create alerts. All of the major cloud vendors have alerting tools: AWS has CloudWatch Alarms, GCP lets you create Alerting Policies, and Azure has Monitoring Alerts. All of these allow you to pick a metric on a specific resource and create an alerting rule. Almost all let you set a static threshold or range, like “alert when CPU Utilization is greater than 80%,” or “outside of 20% and 80%.” Some also offer anomaly detection of when a metric goes outside your current operating load by some number of standard deviations. In any case, you should be able to configure these alerts to deliver an email or post a message into a Slack channel so your team can get notified when a monitor goes into an alert state. Make sure you also configure notifications for when your monitor returns to a normal state, so you have visibility into how long your alarms lasted without having to go back into the cloud console to investigate.
Now that you know all of your underlying services are working, you’ll need to make sure your applications are actually available for your end users. While you could host something inside your network to do this, you really want this to be outside of your network to make sure you’re seeing the same thing your users are. It’s possible that due to network issues or incorrect DNS settings, your application is working just fine internally, but it’s not available from the outside world.
There are a number of different tools that help you do this, and these are often referred to as Synthetics. We’ve used Uptime Robot and Pingdom, but there are many others and most include a free tier that might be just enough for you. When setting up an external monitor, make sure you’re not just checking to ensure that your web page loads or an API endpoint accepts a request. Instead, try to execute several different critical user paths so that you have a robust understanding of performance. This could be submitting a login request or creating a record in the system. Doing this verifies that underlying services are functioning correctly in addition to the public facing application.
Set up notifications to go to the same place as your infrastructure alerts. Having this all consolidated in one place can help your team correlate issues and identify root causes much faster.
Application Performance Monitoring (APM) tools provide detailed insight about the performance of your application. Generally these tools are installed as agents inside your codebase. They are able to measure the time it takes to execute parts of your code. For example, given an API request to fetch a user profile, it can measure how long it takes to get the data out of your database, or how long it takes to fetch a profile image from storage. It will add up all the different actions taken and give you a total time for that request. Most of these tools are capable of giving you an average time across many requests, but they also retain detailed information about specific requests as well.
In addition to measuring the performance of your back-end services, these tools can also be installed in your client web or mobile applications. Here they can record what your users actually experience in your application, like how long pages take to load, or performing certain tasks. Inside these tools, they track what’s called an Apdex score. This is an industry standard way to measure how a specific page on your site is being perceived by users. Let’s say your target for satisfied users is to have a page load under 2 seconds. According to Apdex, users will tolerate up to 4 times that duration and be frustrated with anything longer. So your Apdex score is the sum of all satisfied user requests + 0.5 * tolerating requests divided by the total number of requests, including frustrated. This will give you a number between 0 and 1, the higher the better. Usually you can set a different target duration to calculate the Apdex score on different endpoints or pages. For example, an API endpoint might be ½ a second versus 2 seconds for a full web page. Apdex gives you the ability to have a common score across different types of applications/functionality that is consistent and easy to understand.
Most APM tools also have the ability to provide distributed tracing. What this means is that it can paint a complete picture for a request from your web/mobile client to all the backend systems that it touches. Often they provide a visual representation of how the data flows, even when there are numerous back-end services that call other services, as long as they are all being monitored by the same tool. This really helps you understand how things are connected and where problem spots are occurring.
Similar to the other types of monitoring, you definitely want to set performance alarms for key requests in your application and route them to the same channels you’ve already created. APM is useful to not only get alerted about problems, but it’s really useful to help diagnose where the problems are and how to fix them. It’s really hard to debug performance issues in your code without these tools in place. The alternative is to add timers in your code and log the output manually, which is time consuming and is going to take you a really long time to resolve issues.
I didn’t mention any specific APM tools and that’s because it’s hard to find a stand alone APM tool these days. Most of them are bundled with lots of other features, including infrastructure monitoring, external monitoring, and much more. They are shifting to being called Observability Tools to encompass all the things they provide. The three big players in this space are New Relic, DataDog and Dynatrace, and they all can perform all of these tasks, but each have different pricing models and it can get pretty complicated and very costly depending on how big your infrastructure is. New Relic has a decent free tier that can get you setup with APM for one full access user and up to 100GB/mo of data ingestion. It used to be a simple tool to learn, but over the years and as they have added more and more features, it has become quite daunting. There are some open source projects, such as Apache SkyWalking and SigNoz which may be more cost effective in the long run, but might take longer to set up and could be less reliable.
Regardless which tools you choose to implement, you’ll be in a much better position to provide your users with a flawless experience with monitoring in place.