Lies, Networks and Response Times – Why ‘average’ doesn’t cut it
Response times play a central role in performance monitoring and analysis. They are measured to make sure that they are within expected boundaries and that everything is ok. When measurements show the times fall beyond set boundaries then steps to find the root cause are taken or applications are tweaked and optimised.
It’s useful to understand how long it takes for a request to get responded to. It helps to gauge how well a client or server is keeping up with the demands being made. In particular, measuring and understanding performance helps service providers to:
- Recognise issues before they impact business
- Keep an eye on Service Level Agreements (SLAs)
- Determine bandwidth management, server balancing, and general network planning
- Assess the results of any configuration changes
When measuring response times, measurements are often aggregated for a certain time period (sec/min). The problem with averages is they often do not reflect what is happening accurately enough, especially when measurement values fluctuate. So how should response times be viewed to get a better real-world picture?
Triometric’s network based monitoring technology can provide some very sophisticated business insight from reverse engineering network packets into Business Intelligence, but along the way it provides a great deal of IT operational “speeds and feeds” type data.
One the most interesting “speeds and feeds” capabilities is measuring end-to-end response times to effectively rate the customer experience – the tracking of not just a booking platform’s server response time but also the network round trip transit time between the requesting system and server. This is only possible through monitoring of low level network packets and certainly isn’t something that comes from a server based agent using time stamping and logging. Relying on server response times can be very misleading.
Consider the chart below, which shows end to end response times by minute interval:
What we can see is a breakdown of end-to-end responses times into server and network time. It is very clear that we can get quite a wide range in network performances whilst server response levels are unchanged. A poor customer experience for some clients can be down to poor network performance. Whilst the booking platform operator can’t directly control the network performance, there is a need to understand how it can swamp highly responsive servers and be a significant cause of <time outs >. Accurate metrics allows the platform operator to influence a client’s choice of network type and ISP/provider – something else that Triometric’s technology can report on.
In much the same way that tracking server (only) response times leads to a false sense of security and potential frustration when considering time outs, looking at average response times (arithmetical mean) can also be very misleading.
The following diagram, showing a typical distribution of response times overlaid with booking probability, explains why relying on an mean response time that is below a client’s time out level ignores a large number of timeouts and therefore lost revenue:
The critical lesson here is to work with percentiles. It is often much better to use the 90% percentile response time – the time at which 90% of responses have been sent out, as the representative metric than a misleading average value. If the business model is designed to allow for losing up to 10% of opportunities to time outs, then ensuring the 90th percentile end-to-end response time remains below clients’ cut off times is the only way to go.
Understanding why some searches take longer than others is also important. The reason could very likely be based on the product being requested. Consider the searches for hotels rooms in Los Angeles which are taking much longer to process than other destinations in the following chart:
This is where the capability to use business/product information (contained within the XML data of search requests) to diagnose issues can be very important. In general, augmenting IT operation metrics with product/business data can dramatically reduce the MTTR – mean time to resolution.