Lyft’s product manager John Kirn published an article recently about the challenges they face when conducting experiments. Existing experimentation techniques did not fully adapt to Lyft’s real-time business nature or mitigate network effects. Lyft’s Experimentation team deployed new ones, such as time and region split testing, and improved internal experimentation norms and techniques.
The well-known technique known as A/B testing consists of a randomised experiment where two variants are compared. For example, a user interface change may be evaluated by rolling out one version to one part of the users and a second version to another. The user’s behaviour is then analysed to determine the most effective alternative.
However, Lyft’s team concluded that the ride-sharing model is subject to strong interference effects. Given its nature, a test may change the user’s behaviour in a way that interferes with the other user’s choices, effectively violating the statistical assumption that two experiments are different from each other.
Lyft’s experimentations are mostly user split tests, but they also use other experimentation techniques to mitigate undesired network effects.
The second most common test type at Lyft is the time split test. In this test, users in the same geographical area receive the same experience during a specific period.
Time split testing is a powerful way to establish causation in the face of network effects, as it reduces interference between users.
Since it can cause inconsistent user experiences, Lyft tries to use this type of test for experiments that affect parts of their system that are not visible to users.
Early versions of time split tests were vulnerable to interference from force majeure events (e.g. storms, outages). Lyft is currently applying new causal inference methodologies to mitigate these interferences and time split variance reduction techniques to accelerate this type of test.
Another test type in use at Lyft is region split testing. As its name implies, region split testing presents the experiment to all users within specific geographic boundaries.
For control [in region split testing], [Lyft] uses trends from before the test begins to develop counterfactual predictions of what would have happened had [Lyft] not launched a treatment.
In addition, Lyft built a set of norms and techniques designed to enhance its experiment’s quality and the validity of its results.
One is a guided hypothesis workflow. This workflow ensures that the hypothesis is registered before the experiment is launched, reducing the phenomena of hypothesising after the results are known (also known as HARKing). The association between the experiment and the hypothesis and its components is also recorded, allowing quick hypothesis verification and discoverability.
The Experimentation team used the Benjamini-Hochberg method to address the multiple comparisons problem by building a “Multiple Hypothesis Testing correction” technique to adjust p-values when multiple metrics are used. It is, however, still under evaluation.
Lyft also built a “results and decision tracking” mechanism to improve alignment regarding decisions made and their trade-offs.
[Lyft] aims to use this results log to generate consensus on proper investment trade-offs (…) and then feed these back into the hypothesis workflow tool to help coordinate teams working in similar areas.
In response to the very dynamic nature of the world where it operates, Lyft wishes to “test widely and move quickly”. It invests in adaptive experimentation platforms and reinforcement learning approaches to achieve this goal.