Correctly applying the hypothesis testing process is extremely essential for drawing out valid A/B testing conclusions. In this post, I have explained the statistical concepts behind conducting a successful A/B test.
A hypothesis is a simple, testable statement that helps compare the two variants. The assumption of a statistical test is called the Null Hypothesis (H0) and a violation of that assumption is called the alternative hypothesis (H1). The null hypothesis is often called the default assumption, or the assumption that nothing has changed. The testing procedure begins with the assumption that the null hypothesis is true. For instance, if you’re conducting an A/B test to know whether the color change that you made to your website resulted in a higher end of visit conversion rate or not, your null hypothesis would state that the change had no effect on the conversion rate.
To lay down a well-formed hypothesis, include the following parameters in your statement.
a) Population (who is being affected by the change?)
b) Intervention (what change are you testing?)
c) Comparison (what are you comparing it against?)
d) Result (what is the resulting metric that you’re measuring?)
e) Time (when are you measuring the resulting metric?)
In the previous example, a well-formed hypothesis would be:
H0: Website X visitors (population) who visit the page with the change in color (Intervention) will not have higher end of visit (Time) conversion rates (Result) when compared to visitors who visit the default page.
H1: Website X visitors who visit the page with the change in color will have higher end of visit conversion rates when compared to visitors who visit the default page.
A statistical hypothesis test may return a value called the p-value. This is the value used to interpret the statistical significance of the test. The p-value is used in either rejecting the null hypothesis or in failing to reject the null hypothesis. This is done by comparing the p-value to a threshold value called the significance level which is set before starting the experiment. The significance level (alpha or Type 1 error) is usually set at 5% or 0.05. The value of alpha depends upon the business scenario and the cost associated with Type I and Type II errors.
The textbook definition of p-value is the probability that the result would be as extreme as the one computed/observed given that the null hypothesis is true. In simpler terms, the p-value can be seen as the probability of getting the observed result simply by chance. For example, if we observed a p-value of 0.02 for our scenario, then we can say that the probability of observing the higher conversion rate (say X %) for the treatment group (visitors who received the page with the color change) by chance is just 2%. Since this p-value is less than the significance level, we reject the Null hypothesis and state that the color change indeed resulted in a higher conversion rate than the default page.
While conducting A/B tests, the two types of errors you should care about are Type I and Type II errors. Type I error is the probability of rejecting the null hypothesis when it is true and Type II error is the probability of failing to reject the null hypothesis when it is false.
The type II error is represented by the symbol β. The Statistical power is then defined as the inverse of that i.e. 1- β. So, if the type II error is 0.1 or 10% then the statistical power of that test is 0.9 or 90% (1–0.1=0.9).
2. Generating neural speech synthesis voice acting using xVASynth
3. Top 5 Artificial Intelligence (AI) Trends for 2021
4. Why You’re Using Spotify Wrong
The statistical power of a test depends on 3 parameters of the design of an A/B test: the minimum effect size of interest, the significance level (or alpha) set as a threshold, and the sample size.
a. Higher the significance level, lower is the statistical power, given that all other factors are constant.
b. Lower the minimum effect of interest, lower is the statistical power, given that all other factors are constant.
c. Larger the sample size, the higher the statistical power, given that all other factors are constant.
Choosing an adequate level of statistical power is highly important while designing an A/B test in order to avoid wasting resources.
In your A/B test, after determining whether the variant had an impact or not, it is also important to measure how much the impact is. The effect size is the absolute difference between the two groups’ resultant metrics (the conversion rates in our case). It can either be expressed in percentage points or in terms of standard deviation.
Estimating the effect size at the start of the test helps in determining the sample size and the statistical power of the test.
Confidence intervals give the range of values where the resultant/observed metric will probably be found. You would need to construct separate confidence intervals for the resultant metric for both the control and the treatment group. For our example, we could say that We are 95% confident that the true conversion rate for the colored page is X% +/- 3%. The 3% here is the margin of error.
One thing to keep in mind is that if the 95% confidence level for the control group’s conversion rate overlaps with the 95% confidence level for the treatment group’s conversion rate, then you would need to keep testing to arrive at a statistically significant result.
1] B.Michael, Data Science You Need to Know(2018), Towards Data Science