Translate a substantive hypothesis into a statistical hypothesis
1. Write down your substantive hypothesis in plain language to serve as your test's alternative hypothesis.
An example would be the hypothesis, Removing elements E1 and E2 from the checkout page will streamline the user flow and result in an increase in average revenue per user (ARPU). The substantive hypothesis should be measurable by one or more metrics that serve as business or website/app performance indicators, which would be ARPU in this case. Include at least one of the terms increase, decrease, or remain the same to indicate the direction of the hypothesized effect or the lack of it.
2. Determine the primary test metric based on a metric with a clear connection to the business bottom line, like profit, revenue, number of sales, or number of products or services sold, based on their impact on the bottom line.
What is affected by your test should have a clear connection to the business’s bottom line. Examples include profit, revenue, number of sales, number of products or services sold, number of leads, and various engagement metrics, in that order. Review your previous hypothesis to check if it can be reframed using a metric with a more direct effect on the bottom line. For example, use revenue if profit is not relevant. If that’s not relevant either, then use number of sales, and so on. Test metrics would typically be averages per user or per session, with common examples being average revenue per user, and conversion rate per user.
3. Determine secondary metrics based on additional information needed when interpreting the test outcome, to ensure there are no unintended consequences and increase explanatory power.
For example, conversion rate and average order value are often secondary metrics used in tests aiming to improve primary metrics like ARPU. Keeping an eye on secondary metrics ensures that ARPU isn’t being increased at the cost of losing customers, or focusing on cheap low-margin products. Observing an increased conversion rate or higher ARPU, or both, would reveal what the test is achieving; making users purchase items worth more in each order, making a larger proportion of users purchase, or both. A secondary metric, like percentage of unsubscribes following an email campaign, can also be used to guard against being too aggressive and converting visitors at the cost of reduced long term revenue; users unsubscribing from your email list.
4. Decide whether relative, percentage, or absolute difference is of interest when measuring the primary metric.
For most purposes, the primary metric by which the test is going to be measured will be the relative or percentage difference. This is the easiest to communicate to interested parties, as it can easily be translated into an effect on revenue, profit, or other business metrics. In some cases, however, you might be better off measuring the absolute difference. This may be true when the baseline is well-established, but the issue of communication remains even then. Your choice here will affect the statistical model you use, including the way you would later calculate the expected random error.
5. Add a margin of superiority or inferiority to the alternative hypothesis, if it only makes sense to proceed with implementing a solution if it results in a specific increase in the primary metric.
If you need an increase of 1% or more in the primary metric, then 1% is your superiority margin. You would then reframe your alternative hypothesis to include the margin. For example, Removing elements E1 and E2 from the checkout page will streamline the user flow and result in increased average revenue per user becomes, Removing elements E1 and E2 from the checkout page will streamline the user flow and result in at least a 1% increase in average revenue per user, after considering the minimum effect that needs to be proven. Similarly, if you’re testing for non-inferiority, prove that the possible decrease in the primary metric is no larger than a given – usually small – negative margin before implementing a winning variant.
6. Determine the type of random error the primary test metric would be subject to, such as differences in user-based metrics, and choose an appropriate model or solution.
The standard normal model applies for differences in user-based metrics; observations are independent and identically distributed following a normal distribution (NIID). This applies to both binomial and non-binomial data, regardless of its underlying distribution. If the primary KPI is session-based, then the validity of the observation independence is violated if the typical user has multiple sessions. Such KPIs might require more advanced statistical approaches, like bootstrapping, before reliable estimates can be obtained.
7. Write down the full statistical alternative hypothesis as a mathematical expression, and include the model assumptions.
Taking the example above: H1: ((ARPU(B) - ARPU(A)) / ARPU(A) x 100)% > 1%, NIID Where A is the control group and B is the variant.
8. Invert the inequality sign in the statistical alternative hypothesis to get a statistical null hypothesis you can use to perform sample size calculations and statistical estimation.
Inverting the inequality sign causes the null hypothesis to exhaust all other possible values of the primary metric of interest. Continuing with the above example: H0: ((ARPU(B) - ARPU(A)) / ARPU(A) x 100)% ≤ 1%, NIID The statistical null hypothesis can then be tested in an A/B test and possibly rejected, therefore making the alternative hypothesis the established claim.