Imagine watching your favourite show on Netflix, and it stops playing due to increased system load. Not only would you get furious, but Netflix will lose a lot of their userbase. 📉
Netflix has about 200 million subscriptions and accounts for 6 billion collective watch hours per month. 🤯
A failure in any system of Netflix can result in unexpected load and therefore hinder the users' viewing experience. Failure can occur due to a lot of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider.
Despite getting tons of requests and unexpected load, how does Netflix ensure that you don't miss out on having a smooth watching experience?
It uses Prioritized Load Shedding - prioritizes its requests and categorizes traffic into different buckets as critical, non-critical and degraded. An API gateway service computes a priority score for each request and adds them to the respective bucket.
So, when in a bad situation like a high load or exceeded threshold, Netflix drops traffic, starting with the lowest priority. These are mainly log and background requests that are non-critical and can be brought back with a retry. This technique ensures that the playback experience remains uninterrupted and you enjoy your show. 😄
image source - Netflix
As you can see from the graph, the API gateway performs a progressive load shedding based on request priority during the incident. The different colours in the graph represent requests with different priorities being throttled.
However, Netflix changes quickly, and non-critical requests can unexpectedly become critical. To make sure dropping non-critical requests does not impact users, Netflix uses a platform to capture live use cases and measure the impact on users' playback experience. They schedule them to run periodically.
It's interesting to see how all of these happen without disrupting your binge time! 🍿