Avoid sample pollution
1. Calculate the following before starting your experiment: Your sample size, the current conversion rate of the page you want to test, minimal detectable effect (MDE), significance, and power.
You can use the CXL sample size calculator as an easy way to calculate sample size. MDE is the percentage representing how big of an uplift you want to be able to detect, for example, a 20% minimal, relative, detected change in conversion. Significance and power levels are usually defaulted to 95% and 80%, respectively.
2. Segment your users by traffic types before setting a test timeframe to capture as many unique types of customers as possible.
For example, reduce bias by segmenting and analyzing weekday vs weekend traffic, traffic source, and new vs returning traffic, for purchasing behavior and volume.
3. Run your test for no longer than two full business cycles to make sure your sample is representative, avoids length pollution, and helps to limit sample cookie pollution.
Two business cycles is a minimum of 1 week, or better yet, 2-4 weeks. Four weeks is even better for the sake of validity, especially for more complicated and expensive products. Consistent data is extremely important. If your sample includes anomalies such as holidays or seasonal exceptions, you are acting on inconsistent behavior. Your winning variant likely won’t produce the same results next month. Only run tests for as long as you need to. Once you have collected data from a mathematically significant sample size, call it. The longer your test runs beyond that, the higher the risk of pollution or changes in context. Avoid running A/B tests during major holidays that might skew the results. It’s better to run holiday-specific test campaigns during those times and most likely bandit tests.
4. Use random selection and run tests double-blind to avoid personal bias, favoritism, and sample bias.
Additionally, have someone else on the team change variant names and numbers, so your analysis will also be unbiased.
5. Run separate tests for device-type, to limit sample device pollution.
Do not combine all of the traffic and testing, but separate each device type, for example, mobile, desktop, and tablet. You can use Google Analytics Universal to track the same person from device to device. Additionally, you can use known Twitter and Facebook IDs to track visitors across multiple devices.
6. Open your testing platform and analyze the data personally to look for inconsistencies and high variance that might render the test invalid.
Look also for lack of inflection points, uncharacteristic surges in conversion, and the presence of least 1 week of consistent data. If your data lacks consistency, run the test again.
7. Use a standard deviation calculator and analyze your variance.
Low variance means your data is consistent with the average and puts you at low risk of sample pollution.