Estimate required sample size for an experiment for difference in means
Estimate required sample size for an experiment for difference in means
1. Decide what past period will be used to estimate the baseline and standard deviation of the primary test metric.
Select a reference period that is neither too long nor too short, otherwise, it would capture information with little predictive capacity which might distort the accuracy of the estimation. 2-8 weeks of data is usually sufficient. The chosen period should ideally not include major shifts in the key metric of interest, include the most recent data, and long enough to include enough variability and some seasonality.
2. Estimate the baseline of the primary test metric using historical data from the chosen reference period.
The most straightforward way is to simply predict that the metric will remain the same as it was in the past period used for reference. For instance, if the primary metric had an average value of X during the reference period, you could predict that its average will also be X during the expected test duration. If the metric exhibits seasonality, then more complex time series prediction methods such as ARIMA forecasting may need to be employed to get a good estimate. If the seasonality is minimal, the adjustment for it can be skipped as it will have little effect on the final sample size estimate. Note that individual values of the metric for each user or session over the entire period need to be used. For example, it would be wrong to just take an average of daily average values.
3. Choose the significance threshold for the test.
This is the p-value threshold that will determine how you act after the test has completed. If the p-value is lower than it, you would act as if the variant is better than control and implement it. Otherwise, you would stick with the current experience. A few guidelines that should help: The more expensive it would be to make the wrong choice, the lower the threshold should be. The more difficult it would be to reverse the decision, the lower the threshold should be. The larger the pushback against the proposed change, the lower the threshold should be.
4. Estimate the standard deviation of the primary test metric using historical data from the chosen reference period.
If the data exhibits significantly varying levels of variance over the year, seasonality adjustments may need to be performed just like for the primary test metric. Note that individual values of the metric for each user or session over the entire period need to be used. For example, it would be wrong to just get daily standard deviations and arrive at an estimate by averaging those.
5. Choose the minimum effect of interest by answering the question, What is the minimum effect we'd be happy to detect as statistically significant?
The minimum effect of interest would be the difference you would not like to miss, if it truly existed. It is related to the false negative rate of the testing procedure and only plays into post-test analysis if the test ended up statistically insignificant. It helps to think in terms of risk and reward trade-off due to the feedback loops involving fixed and variable risks, and rewards associated with running the test and making a decision within a given time frame. The minimum effect of interest has to be within what can realistically be expected, use your expertise and experience to judge that. Such as, a small change is unlikely to have a large impact and so setting the minimum effect of interest unrealistically high might in fact doom the experiment to being a false negative if there is some true effect which is, however, much lower than the minimum effect of interest. The minimum effect of interest may need to be reconsidered a few times if the test duration based on the first one or two numbers turns out to be prohibitive due to the time it would take to run the test. For example, a 0.5% effect might sound exciting for a given test, but if it takes a year to run the test with that parameter a compromise might need to be achieved by selecting a minimum effect of 2% in order to run the test in 3-4 months instead.
6. Use a sample size calculator like the GIGA Calculator Power & Sample Size Calculator or the Analytics Toolkit Sample Size Calculator based on your statistical model, to compute the required sample size, and from that the estimated duration using your preferred tool.
Things to be look out for in a sample size calculator: Proper support for more than one variant versus control. Whether it computes sample size for relative difference or just absolute difference* The computation should be for a one-sided (one-tailed) p-value, unless the particular test actually calls for a two-sided alternative. You can then estimate the expected test duration based on the sample size calculated. For example, if the part of the website the test would affect sees 20,000 users per week, a sample size requirement of 120,000 users would mean the test should last six weeks. Always round up the number of weeks or days to allow for overestimation of the expected number of users per week. * The latter could still be used by adding a conservative upward adjustment of 5% to the estimated sample size.