Usually, our service is visited by about one thousand users per hour. To serve the usual demand, we bought a modern computer and ran the Tomcat web server as well as configuring Tomcat's thread pool with 500 allocated threads. The average response time for the majority of user requests is about 250 milliseconds. By doing a naive calculation of the capacity for that configuration, we can be sure that the system can handle about 2,000 user requests per second. According to statistics, the number of users previously mentioned produced around 1,000 requests per second on average. Consequently, the current system's capacity will be enough for the average load.
To summarize, we configured our application with the margin regarding capacity. Moreover, our web store had been working stably until the last Friday in November, which is Black Friday.
Black Friday is a valuable day for both customers and retailers. For the customer, it is a chance to buy goods at discounted prices. And for retailers, it is a way to earn money and popularize products. However, this day is characterized by an unusual influx of clients, and that may be a significant cause of failure in production.
And, of course, we failed! At some point in time, the load exceeded all expectations. There were no vacant threads in the thread pool to process user requests. In turn, the backup server was not able to handle such an unpredictable invasion, and, in the end, this caused a rise in the response time and periodic service outage. At this point, we started losing some user requests, and, finally, our clients became dissatisfied and preferred dealing with competitors.
In the end, a lot of potential customers and money were lost, and the store's rating decreased. This was all a result of the fact that we couldn't stay responsive under the increased workload.