A/B Testing Forecast: A Solution for Statistical Significance and Small Sample Sizes

Awhile back, the conversion team at my company came to our analytics team about doing testing on our website. We used to have a trial sign up form that was very long, that required 8–10 data points including name, email, industry, etc. This was overkill if you compare it to a lot of digital products today that want to get people into their UI as fast as possible.

The conversion team wanted to test out a much shorter trailer form, and see what that would do to our conversion rate. We were happy to help them out, but there was one big challenge…they were hoping to do a series of 5–6 tests over the next 3 months.

That was extremely difficult to do with our business model. Not only did we usually need to let tests run for a few weeks to get enough sample size, we also offered a 60 day trial period. To get the complete test results and have them be statistically significant, it could take us almost 3 months to get a read on 1 test, never mind 5 or 6. In order to solve this problem, I build a testing forecast that would help us make decisions on test results in 3–4 weeks.

To provide some background on the metrics being measured, the conversion team typically ran tests that affected our website sign ups, which we measured by Visitor to Trial Rate (V:T), as well as tests that affected our customer conversion rates (Trial to Pay Rate (T:P). For this forecast, the primary measurement combined the two into Visitor to Pay Rate (V:P). This would offset differences between the two metrics. For this shorter trial form test, although our hypothesis was that a shorter trial form would lead to a higher V:T rate, we were concerned that T:P rate may decrease because the trialers may not be as invested. V:P let us see the net effect of that, and would help us declare a winner.

As far as the forecast went, The V:T part was pretty straight forward. We would let the test run for a few weeks and track V:T rates for the test group(s) and the control. I included inputs that would allow people to add X number of additional days the test could run for, with Y total visitors being added each day. The forecast would then take those input values, appropriately split them between the test/control groups, and apply the historical V:T rates that we had seen so far. This forecasted visitor data, would be added to the historical visitor data, and we would have a much bigger sample size.

The T:P part of the model was a little more in depth. Now that we had our Total visitor and V:T data (Historical + forecasted), the model next needed to trend out what the T:P data would look like. In order to do this, I captured T:P rate snapshots at 7 day increments up to 56 days (Not a perfect 60, but close enough for my purposes). Because most of our conversions came in the first few days of our trial period, our conversion rate was a logarithmic curve. The logarithmic formula was then applied to the few snapshots we had, so we were able to predict what our 56 day T:P rate would be with only a few weeks of data. Although this was great for the forecast, we would be happy to see the actual data come in as people aged through their 60 day trial. The more actual data we had, the more accurate the prediction became.

At this point, we had our total visitor data that I mentioned before, as well as a predicted 56 day T:P rate. The last piece of the forecast was adding inputs for future T:P performance. As another option, the forecast had adjustment inputs to be either more conservative or more aggressive. If the conversion team thought T:P might be better than what we’d seen so far, they had the option to increase forecasted T:P by 10%. If they wanted to be more conservative, they could also decrease T:P performance by 10%.

To recap, the total visitor data (historical + forecasted), the forecasted T:P data, and the T:P adjustment data (optional) was combined together to come up with a forecasted V:P rate, and statistical significance reading. Although using a forecast like this isn’t always going to be 100% accurate, it definitely saved us time, gave our conversion team more comfort in making decisions on tests, and allowed us to test more in a shorter time span than we previously could.