I’ve been reading an article at GetElastic blog, one of my favorite resources regarding e-commerce marketing. At the end of the article “A/B Test Case Study: Can Split Test Results Be Trusted?” they showed us a case study in which they tested two exactly the same variations against each other and one of them performed 4.97% better than the other.
The possibility of statistical mistake in optimization tests is a pretty big issue. I’ve seen case studies all over the internet with ridiculous conclusions and ridiculously big improvements all because the performer of the test made a mistake (sometimes I believe they even do it intentionally to show better results to their clients).
There are several things that can go wrong with these tests and here are the most common mistakes:
Low test sample
During the latest presidential elections here in Croatia, some smart people calculated that possibility of statistical mistake with a test sample of 10 000 randomly selected people is ~3%.
They weren’t wrong in all these years that these tests were performed; these tests were always accurate with exactly ~3% deviation.
Please note that this is a test sample of 10 000 people that performed the desired action. If we transfer this knowledge to the world of e-commerce, it would mean that our sample should be 10 000 transactions, not 10 000 visitors.
There isn’t a lot of stores in the world that can get 10 000 transactions in a reasonable time period, so I’m led to believe that most of the conversion rate optimization split tests performed out there are not really accurate.
This calculation (10 000 transactions = ~3% statistical mistake) is true for the election body of 4 402 045 people. In order to get the right calculation for your specific case, you need to calculate the amount of people out there that fit the criteria of your targeted audience. This means that some B2C e-commerce store with wide market of lets say 45 000 000 potential customers would need 100 000 transactions to achieve ~3% possibility of statistical mistake.
Low time period
Let’s say you have a store and you could actually get a relevant test sample of 10 000 transactions within days. You need to extend this test to a longer time period then a few days.
What could happen is, you tested variations during the working days and only a certain population of people with certain behavior comes to your site Monday to Friday. Your test didn’t really capture the behavior of people visiting your store during the weekends and this population might have a completely different behavior compared to your test sample.
Unrepresentative test sample
Choose your methods wisely. I’ve actually explained this in the article before. You could increase the conversion rate of a store by actually decreasing the revenue. It’s highly recommended to read this article to understand how increasing the conversion rate (the percentage) is not the actual goal of conversion rate optimization (I know, it sounds crazy, but just read it).