7 Experimentation Pitfalls
“Experimentation” refers to a set of practices used by companies and institutions to make data-driven decisions on products, features, and processes. It typically involves the administration of different experiences (a.k.a. treatments) to groups of users. The groups are then statistically compared on key metrics of interest to pick the best experience that will be rolled out to all users.
Experimentation requires the integration of several engineering systems and the coordination of business and data functions.
In this article, we focus on 7 common pitfalls that are often overlooked when planning and running experiments. Their resolution contributes to defining the experimentation culture of a team.
1 The results do not generalize
The goal of an experiment is to replicate and compare, in a small scale, the real world scenarios under the different user experiences that need to be tested. However, there are often practical limitations that make those small scale replicas deviate substantially from the real experiences that need to be tested.
Example. A credit card company would like to test a lower interest rate and evaluate the impact on the average usage of its cards. Upon full roll-out, the company will be able to lower the interest rate by 30 basis points. However, in order to facilitate the technical implementation of the experiment, the experiment plan proposes to charge the standard interest rate and give cash back to some users. In particular for the subset of users in the experiment, 50% of them will be assigned to the treatment group, charged the standard interest rate and later given a cash back in the amount equivalent to a 30 basis point drop in interest rate. Their card usage will then be compared to the usage of the other 50% of users that will be assigned to the control group and will not receive cash back. This is a typical A/B test design. However, this experiment does not exactly replicate the experience that should be tested. Although financially equivalent, the cash back solution still requires the company to first charge users with the standard interest rate and then partially reimburse them. This experience could have unexpected effects on the behavior of the users, which might differ from the real effect of a lower interest rate.
A valid experiment should minimize as much as possible the discrepancies between the experiment conditions and the conditions generated by a full roll-out of the features or processes that are being tested. This can be achieved by a rigorous experiment design that selects the right type of test (e.g. A/B test), unit of randomization (e.g. should treatments be administered to users, groups of users, or web sessions?), and avoids deviating from the desired experiences under consideration.
2 Contamination
Contamination (a.k.a. network effect) is a particular case of the previous problem. Contamination occurs when different treatments of an experiment interfere with each other.
Example. A delivery app wants to test a new incentives structure to attract couriers during peak hours. An A/B experiment on a subset of couriers in a city proposes to expose 50% of them to the new incentive (the treatment group) and compare their platform engagement to the engagement of the other 50% of couriers that are exposed to the old incentive structure (the control group). This experiment would not provide valid results, because the entire market can be affected by the new incentives. If these incentives are successful at attracting more couriers during peak hours, most of the couriers in the treatment group will rush to fill the demand orders. As a consequence, the couriers in the control group would become less engaged because they would observe a lower demand than usual. In other words, the behavior of the treatment group affects the behavior of the control group in this example. This experiment does not successfully compare a scenario with new incentives and a scenario with old incentives. Instead, it compares 2 hybrid scenarios.
A common solution to contamination problems is to run a switchback experiment: in its simplest form, all the users in the experiment will be exposed to the same experience, and different experiences will be switched on and off at different time intervals. The metrics of interest will then be averaged across time intervals. This solution works well when users are not explicitly aware of the different experiences that they are being exposed to.
3 Running too few/many experiments
Humans tend to make decisions based on personal preferences and observations, overlooking bias and alternative opinions. We might think that collecting data and waiting for the results of an experiment is a waste of resources and time. In those instances, we fail to recognize that experiments serve as a tool to make our decisions more objective.
Does this mean we should run an experiment before making every single decision? No, in fact we should not run an experiment if its results will not impact our decisions. For example, if some regulation imposes a company to roll out a new process or feature, there is no value in running a costly experiment to test the new process. In this case, the impact of a roll-out can also be measured, albeit less accurately, with a retrospective analysis.
Then what is the right amount of experiments? There is no universal answer. Data teams and business units should agree on a set of principles to determine when it is appropriate to run experiments to improve the quality of their decisions. This can be thought of as a tradeoff between speed (making a decision quickly) and rigor (being more confident about the validity of a decision).
Good reasons to run an experiment include:
There is a need to choose between 2 or more experiences, and it is unclear which one is the better alternative.
A new feature needs to be released, but there is a possibility it could negatively affect key business metrics. A non-inferiority experiment can be used to test whether key business metrics would not adversely vary under the new feature.
Good reasons NOT to run an experiment include:
A system or process is broken and needs to be fixed quickly, with a full roll-out.
The results of an experiment will have no impact on the decision making process.
There is a need to make a simple change to a feature or a process that will not impact business metrics (e.g. legal/compliance updates, small copy changes, etc.)
4 Multiple testing
The multiple testing problem is a well known issue in Statistics: the higher the number of inferences that are made with a fixed margin of false positive error, the higher the number of false discoveries. In experiments, this problem arises when too many metrics are used to compare treatments, or when the dataset is sliced by too many dimensions, dividing the sample over and over, until some signal is eventually found.
Example. We want to test whether a coin is fair, that is, the probability that it lands on either side is 50%. Each day, for 30 days, we toss it 100 times. Suppose that these are the number of heads obtained in each of the 30 days: 49 57 49 49 49 56 54 54 47 51 51 49 43 53 53 53 47 57 56 55 50 56 48 48 46 60 47 48 51 46.
In total, the proportion of heads is 51%. We run a statistical test to determine whether the difference between 51% and 50% is significant and we obtain a p-value of 0.24. We conclude that the difference is not significant because it is higher than the commonly used 0.05 threshold for p-values. We have no reason to believe that our coin is unfair. Let’s dig a bit deeper. If we restrict the attention to the first 15 days, the proportion of heads is 50.9%, with a p-value of 0.47, again not significant. In the last 15 days, the proportion of heads is 51.2%, with a p-value of 0.35, not significant. Similarly, if we look at only odd days of the month we see a proportion of heads of 49.3%, with a p-value of 0.61. However, when we look at even days the proportion of heads is 52.8%, with a p-value of 0.03, finally a significant result. We just found out that our coin is not fair on even days of the month! Or is it? The problem is that we have been fishing for evidence. We sliced the data so many times that eventually we found a result that is significant, but in reality is simply due to randomness - a false positive result.
Experiments are used to make decisions, often binary in nature (roll out a new feature or not) and should be based on the minimum amount of metrics needed to make those informed decisions. This best practice reduces the number of false discoveries. Outside of experiments, it is still possible to analyze historical data for research purposes, and explore different subpopulations and other dimensions to come up with new hypotheses.
There are also statistical solutions to correct for the multiple testing problem. For example, the Bonferroni correction lowers the p-value threshold that is needed to call a result significant. In our coin example, the corrected p-value threshold is 0.05/5 = 0.01, where 5 is the number of tests that we performed. Based on the new threshold, none of the performed tests is significant and we have no reason to believe our coin is unfair. However, the Bonferroni correction tends to be conservative, and makes it harder to detect small but real signals in the data. Another popular alternative to control for false positives is the Benjamini–Hochberg procedure.
5 Peeking & early stopping
When running an experiment, we are often tempted to look at preliminary results while the experiment is still running, and stop it when we see significant results. This is called peeking, and introduces bias. We usually set the margin of error for false positives at 5%. If we stop experiments immediately after seeing significant results we will make the wrong call more than 5% of the time across multiple experiments.
According to best practices, we should determine the length of the experiment before it runs, using power analysis. Data Scientists and stakeholders should agree on the tradeoff between speed and rigor, and if they determine that an experiment is needed, they should run it for its planned duration. Of course some business exceptions are possible, for example when legal needs override statistical rigor.
When peeking and/or running tests at different times are needed, it is best to consider the use of appropriate methodologies such as sequential testing, Bayesian approaches, and multi-armed bandits.
6 Novelty effect
The novelty effect is the tendency of new features to initially skew key business metrics. Users are typically curious about new features and processes, and that initial interest might translate into improved business metrics. However, the novelty effect vanishes after some time. With this in mind, it is often a good idea to establish a “burn-in period” in experiments and ignore the data collected in the first X days.
7 Practical limitations
Finally, every organization faces additional practical limitations in their ability to run experiments. Here is a non-exhaustive list:
Running experiments during holidays is risky. Results typically do not generalize.
It is not always possible to test new processes in every location. Platform limitations, legal constraints, and/or language barriers make it difficult to roll out treatments everywhere. Synthetic control studies are often used as a solution to this problem.
Multiple experiments on similar features could interfere with each other: experiment A which is running at the same time as Experiment B might influence the findings of Experiment B and vice versa. To minimize this issue, coordination and transparency across teams that run experiments are needed.
Conclusion
Setting up and running experiments is not a simple task. The decision of a company to invest in experimentation platforms and techniques involves many factors:
Which decisions need to be guided by experimental results?
Which types of experimentation techniques does the company need?
Which platform(s) and features are needed to run experiments?
Which systems, tools, and data need to be coordinated to run experiments?
Which metrics and statistics should be measured?
How should results be presented, interpreted, and leveraged?
How should metrics and results be standardized to increase speed?
Each of these questions represents a separate topic of discussion, as it can be answered in different ways depending on the needs of each company.
In this article, we covered 7 common pitfalls that are often overlooked, even by experienced teams.
Are you developing your Experimentation platform? Are you trying to improve the quality of Experimentation at your organization? Data Captains can help you develop a technical playbook to run Experiments and contribute to improving the overall Experimentation culture. Get in touch with us at info@datacaptains.com or schedule a free exploratory call.