When the load parameters increase, a scalable system is expected to keep its expected performance. Amazon’s S3 service level agreement (SLA), for example, promises a certain performance benchmark to its users. For instance, the customers are entitled to service credits if S3’s monthly uptime is less than 99.9%.
Let’s consider another example. Instagram has an internal SLA to load a user’s home page in less than X milliseconds; let’s say 5 milliseconds. However, the team can’t guarantee that every home page will load at exactly 5ms. Thus, performances are measured in percentiles. Instagram engineers can set standards such as 99.9% of users will have their home pages loaded in under 10ms, 70% will have their home pages loaded in under 7ms and the median load time would be 5ms. Median implies that half the Instagram users would have their home screens loaded in under 5ms. Theoretically, there could be a very thin minority that sees load times greater than 10ms and due to random events outside the developer’s control. Generally, optimizing for very high percentiles, e.g., 99.99th percentile (1 in 10,000) comes with diminishing returns and is usually not worth optimizing for.