How Netflix Does A/B Testing (And You Can, Too)

Netflix is big. Your company is—most likely—not quite as big. So, when I tell you that Netflix uses A/B testing in numerous ways to improve their service and enhance their business, you might say, “That’s nice for them, but how does that help me and my decidedly smaller business?”

If you said that, I’m glad you did. Because the different ways that Netflix does A/B testing are things that you can do, too—or can inspire you to do something similar.

Hey, real quick: what is A/B testing?

A/B testing is a live experiment that compares two versions of a thing, to find out which version works better. Version A is the one your users normally experience, and Version B is the one you think might more effectively accomplish a particular goal. You run the A/B test on your live product by diverting a portion of your users to Version B, while the rest of your users continue to use Version A. You collect the results from both groups during the test, and then use that data to determine which version performed better.

You’ll see what I mean in the examples below.

1. A/B Testing to increase user sign-ups (and other macro goals)

Netflix wants to add new paying customers to their service (of course), so they consider modifications to their website’s registration page. Whenever their product design team comes up with an improvement that they predict will lead to more user sign-ups, they prepare an A/B test to confirm their hypothesis.

One such hypothesis went like this (paraphrased for our purposes): “If we allow non-members to browse our content before signing up, then we’ll get more people to join the service.” The idea came from actual user feedback, and it made sense to designers at Netflix.

So, in 2013, the product design team tested their hypothesis using an A/B test. They built an alternate working version of the registration page that allowed content browsing (Version B). When they started their test, some visitors to the live site were unknowingly directed to Version B, while the rest saw the original design (Version A), which didn’t allow content browsing. Throughout the test, Netflix automatically collected data on how effective Version A and Version B were in achieving the desired goal: getting new user sign-ups.

Netflix ran this test 5 times, each with different content-browsable Version B designs pitted against the original non-browsable design. All five Version Bs lost to the original design. The design team’s hypothesis turned out to be incorrect. However, Netflix learned valuable lessons from the tests. They also saved themselves bigger problems by testing their assumptions first, instead of simply rolling the change out to everyone. As Netflix Product Designer Anna Blaylock stated at a talk in 2015, “The test may have failed five times, but we got smarter five times.”

How does this apply to me?

Trying to get more users to sign-up for your service is an example of a “macro conversion” goal, and this is perhaps the most common application of A/B testing. Macro conversions goals are the “big” goals that usually represent the primary purpose of the thing you’re testing, whether it’s a website, app, or marketing email.

Examples of macro conversion goals:

  • Get users to use your site’s Contact form
  • Get users to complete a product purchase
  • Get users to respond to your sales email

This kind of A/B testing is something you can do, too. Even though Netflix’s example involved testing a design with a whole new feature (browsing), the design changes that you test will often be much simpler—the text label on your call-to-action button, for example.

Sure, your modest app doesn’t get as many user hits as Netflix does, but you can still set up and run A/B tests of your own. Just like Netflix, you’ll run your test as long as necessary to collect enough data to determine which design is more effective. For Netflix, this is often several weeks. You may need to let your test run for several months.

2. A/B testing to improve content effectiveness (and other micro goals)

Netflix also runs A/B tests to optimize “micro” conversion goals, like improving the usability of a particular UI interaction.

For example, Netflix ran a series of A/B test experiments to determine which title artwork images (a.k.a. “thumbnails”) would be most effective to get users to view their content.

First, Netflix ran a small test for just one of their documentary titles, The Short Game. The test involved assigning a different artwork variant to each experimental test group, and then analyzing which variant performed best—and by how much.

As Netflix’s Gopal Krishnan wrote:

“We measured the engagement with the title for each variant — click through rate, aggregate play duration, fraction of plays with short duration, fraction of content viewed (how far did you get through a movie or series), etc. Sure enough, we saw that we could widen the audience and increase engagement by using different artwork.”

With this initial A/B test, Netflix established that significant positive results were possible by optimizing title artwork. Netflix then went on to run more elaborate tests with larger sets of content titles. These tests measured additional factors as well, to verify that the success of the optimized artwork was actually increasing total viewing time, and not simply shifting hours away from other titles.

I can A/B test like that, too?

Yes! If you have a hypothesis about a design change that might improve your users’ experience, then you can set up an A/B test to try it out.

You’ll want to be careful to:

  • Minimize the differences between Version A and Version B. If there are multiple or unrelated design changes on the same screen, you won’t be sure which design element was responsible for better performance.
  • Make sure you can measure the results before you start the test. Determine what data would indicate success, and make sure you can automatically collect that data. Micro conversion goals are often not as trivial to track as macro conversion goals. You may need to set up a custom Goal in Google Analytics, or what have you.
  • Consider the bigger picture. Your redesigned button might get more click conversions, but perhaps users are also leaving your site sooner. Think about how your design change may affect your other goals, and make sure you’re collecting that data, too. Consider running additional tests.

3. A/B testing custom algorithms

A core part of the Netflix user experience is that much of the UI is customized for each user. Most of that customization is accomplished via algorithms. Algorithms can be rather simple or extremely complex, but their output can still be A/B tested.

Remember a few paragraphs ago, when Netflix tested which thumbnails would work best for all their users? That’s already old news. In 2017, they moved to a system that selects thumbnails for each user based on their personal preferences. For example, if Netflix knows you like Uma Thurman movies, you’re more likely to get the Pulp Fiction thumbnail with Uma Thurman staring back at you, making it statistically more likely you’ll watch it. This personalization is driven by an algorithm that Netflix can improve over time.

Any time Netflix wants to adjust one of their personalization algorithms (or their adaptive video streaming algorithms, or their content encoding algorithms, or anything) they A/B test it first.

Do I need to test algorithms?

You may not actually be employing any custom algorithms in your app or website. If you are, however, you should seriously consider running an A/B test for any algorithm adjustments you make.

Let’s say your company makes a news aggregator app that recommends articles for the user to read. Currently, each article recommendation is based on the user’s pre-set preferences and their overall reading habits.

If you had reason to believe that your users prefer reading articles of the same type in batches, you could modify the algorithm to give higher priority to articles that are similar to the article the user just read. You could run an A/B test to see if your new algorithm is more effective at achieving your goal of increasing app usage.

Before you can run such a test, you’ll need to actually implement the new version of the algorithm (while not interfering with the current algorithm). This may be trickier than your average design change, so tread carefully.

More things to consider:

  • Make sure your Version B implementation has been sufficiently software-tested. Bugs in your new version will negatively affect the results and may require you to throw out the test.
  • Try to ensure that your Version B latency is very similar to the original’s. If your new algorithm takes noticeably longer to process than the original, then your A/B test will largely be testing the user experience implications of the slowdown, and not the output of the new algorithm vs. the old.

Conclusion

Netflix is a monster at A/B testing. They’ve got data scientists, design teams, and engineers working on potential service improvements all the time, and they A/B test everything before it becomes the default experience for their massive user base. They know it’s too risky not to.

You and your company probably do not have the resources to build the impressive experimentation platform that Netflix has. Not yet, anyway.

But you, too, can reap the benefits of A/B testing for your products, at your own scale and at your own pace. And you can take Netflix’s core lesson to heart: always be looking for ways to optimize your products.

Stop Killing your Growth

Don’t miss out the latest news regarding WordPress, Product Management, Product Development, Growth and SEO