Catching experiment curiosity questions
A lightweight framework to triage stakeholder engagement without derailing experiment momentum
I geared my last post toward data leaders with qualitative teams. My hope was to encourage stronger engagement with their product experimentation programs. But while I was writing, the Data Science Project Manager part of my brain was firing off protests. "Experimentation teams won't have time to go deep on all these one-off questions!"
I do feel this concern deeply. But I also feel that a healthy data culture is one that encourages and supports data curiosity. That inevitably means more questions coming in to the Data Science team.
So today I'm coming at the same topic from a different angle, exploring how DS leaders can triage the curiosity questions I teed up in my last post. I'll share the approach I take for these types of questions, and then walk through a popular experimentation case study — Google’s 41 shades of blue test — to apply it.
Triaging experiment curiosity questions
At their peak, experimentation teams aim to run multiple experiments concurrently, enabling A/B testing on the lion's share of changes made to a product. Efficiency is the name of the game.
So how do we cultivate data curiosity without sacrificing efficiency? Here's my 4-part triage framework for Data Science leaders to do exactly that.
(1) Understand your partners' motivations
A shared goal between data teams and their stakeholders is to build stronger, data-driven intuition over time. The more experiments that are run, the more we learn about what supports strong outcomes for users, and what doesn't.
The challenge comes in interpreting the experiment results. A causal link isn't always a helpful why. It's great when an experiment goes against our preconceived notions — it means we learned something new! But if stakeholders can’t adjust their mental model to account for the experiment, future experiments won't benefit from that learning. It’s the data leader’s job to support that mental model adjustment, not just run the experiments and report the results.
(2) Follow-ups default to discussion
Experimentation team leads quickly build intuition around when an experiment will need a follow-up beyond the standard reporting. Results surprisingly good? Twyman's Law suggests something is likely wrong, and more analysis is needed.
Data curiosity questions, by comparison, may not feel as critical — but they still need your attention. I set a practice of defaulting to addressing them with a discussion. To decide whether my team should go further with an analysis, I ask myself these questions:
Will the added context around experiment findings meaningfully unlock future ideation and innovation?
Will this experiment benefit from added context for posterity reasons / adding to an insights repository?
Will the extra analysis notably improve stakeholder relationships, or reduce thrash?
The bar will be different for each team and circumstance; ultimately, it's down to the data leader to set a tone that supports data curiosity without overwhelming their team.
(3) Articulate hypotheses with stakeholder help
Nothing randomizes a Data Scientist faster than an open-ended, unstructured analysis question. So once the call is made to go deeper on a curiosity question, it needs structure, fast. I view it this way: Since I'm stretching my team to help a stakeholder, the quid pro quo is that they need to be deeply involved up front in setting the boundaries on the work.
Here's how I break this down:
Identify the previously unconsidered variable at the core of each curiosity question. Then, determine whether that variable is available, accessible, and measurable. If you don't measure it today, is there a proxy that's workable? Nix any questions that can't be feasibly answered today.
Articulate hypotheses around any remaining questions. The hypotheses should be clear enough to scope the analysis you've signed up for. Clarify any ambiguity to make the work as turn-of-the-crank as possible.
Prioritize hypotheses, and commit to the top 2-3. The goal is to have work that takes one DS no more than half a day, full day max. If the follow-up lasts a week, it's not scoped enough.
(4) Stop after one round
It's possible that none of these hypotheses will bear fruit. The why behind the experiment may continue to be unsatisfying. That doesn't mean analysis continues indefinitely. My stake in the ground is that we stop after one round.
If the outcome causes the stakeholder to thrash even harder — maybe they are stuck on the experiment outcomes feeling unintuitive — consider rerunning the experiment down the road to rule out a false positive result.
I'm on a mission to improve corporate data culture for data professionals of all stripes, in Data Science, UX Research, Analytics, and beyond. If you’d like to join me, you can help by sharing this newsletter with the data culture drivers in your network.
Case study: Putting the framework to use
Let's apply this framework to a frequently discussed experiment: Google's 41 shades of blue test.1 For those unfamiliar, here is a brief rundown:
In 2009, Google was discussing what shade of blue to use for hyperlinks in its search engine results page. Rather than make the call, they ran an experiment where users randomly saw the page with one of 41 different shades of blue based on their condition assignment.
One shade of blue won out in the study. The resulting lift in clickthrough rate (CTR) translated to a $200M/year bump in revenue for Google.
The minute level of data-driven design (articulating particular color choices) led to the Lead Visual Designer for the project resigning publicly, which was subsequently picked up by multiple media outlets including CBS, The New York Times, and Fast Company.
While this case is a quintessential example of using an experimentation program to unlock notable gains for your company, I think it can also serve as an important case study in data culture. It's rarely used that way, however. Data Scientists gloss over the Design side of this story as "unfortunate," if they address it at all. Designers point to their role in driving innovation, which can hard to achieve with the marginal gains of A/B testing, and suggest data driving such minute decisions takes away from Design's ability to innovate.
I have to think there's a middle ground here. Surely we can have data-driven design that yields gains in revenue while also supporting an active Design team with the agency to do their best work. Think about how you'd approach this, if you were kicking off this experiment today. Then, read on to see how I'd approach it using the above framework.
Understand Design's motivations
I don't know the folks involved in this case personally, but we can still seek to empathize with Design. Here's my first pass: Dictating a single color value takes away from a systemic design system. It undercuts the holistic product design that Design is responsible for. Color choices aren't different dessert options on a buffet, available to be swapped out and paired with any meal. They’re an integral part of a larger whole.
Being told to use a particular blue "because it's the winning blue" runs counter to the Design process. In other words, the causal link isn't a helpful why. For me, it'd be the equivalent of being told my team can only use Linear Regression for ML work. Saying it's for revenue may be a good enough "why" for the business, but it doesn't help me or my team effectively understand and extend that ask for future work, or grow as a discipline.
Gauge the follow-up
You probably won't have the foresight to know that someone will choose to leave the company on the heels of your experiment findings, or that their leaving will trigger national media attention. But as a DS leader, you should be attuned to whether a group of your stakeholders are notably thrashing in the wake of your findings.2
While I'd opt to go deeper than a discussion in this case, even pursuing a discussion proactively can go a long way. Think of it like a change management problem: Here, the change you want to bring about is coaching a stakeholder to a more data-driven perspective, and understanding how you might best structure your work to support that journey. Any engagement is better than leaving your stakeholder to thrash.
Articulate hypotheses
This step is a great one to work closely with Design to understand the problem space better. While colors are easy to change and experiment with, that level of one-off change doesn't mesh with Design's process.
My last post can be a good way to start the conversation. In this case, I think a Mediator analysis is warranted:
While there was a winning shade of blue, I doubt that the blue hex value itself is directly causing the change.3
Brainstorming with Design can yield some interesting Mediators that are worth exploring. Off the cuff, I'd consider the contrast ratio of the blue against a white background, as well as the color difference between the blue and other text colors around it. These are values that probably don't exist in the experiment dataset by default, but can be calculated with minimal effort.
Contrast ratio and color difference will in turn be mediated by the end user's monitor and its color profile. Depending on the telemetry you collect, you may or may not have signal here. If you do, articulate a scoped hypothesis. If not, move on without it.
You could build a longer list, but this is already a good start. We’ve evolved the problem from one of choosing individual color values, to one of optimizing contrast. Being data-driven for optimal contrast fits more cleanly into Design’s process, and leaves space for future design work with that constraint dialed in.
Only one turn of the crank
The initial round of hypotheses may or may not yield supplementary findings to explain why a particular shade of blue won out in the experiment. If nothing noteworthy pops, the analysis still stops.
The goal isn't to keep going until we get to that why. Rather, the process of engaging with Design stakeholders is, in and of itself, valuable. Both sides learn more about how the other operates, and future designs — and experiments — will be structured with that shared knowledge embedded.
While the shades of blue experiment may not be representative of experiments in your organization, it functions well as a case study on data culture. Critically, it highlights the role outside of managing the experiment that the data leader should play to bring the full organization along while building a culture of experimentation.
So next time your experiment results cause your stakeholders to thrash, switch gears into change management mode and give this framework a try. The impact of the resulting conversations on your data culture might surprise you.
Color swatch post photo by Clay Banks on Unsplash
This experiment has spawned numerous summaries, think pieces, and discussions over the last 15 years. My goal today isn't to write yet another summary, but if you're curious to learn more, a quick search will get you to them.
The New York Times shared Marissa Mayer's initial approach to Design's thrash: Choosing a blue color value in between the one proposed by Design, and the experiment-winning value. While this approach is a literal meeting in the middle, it neither secures the data-driven gains, nor solves the underlying Design frustration. Digging in to a more useful "why" might have been a more productive way to meet in the middle.
If there were an intrinsically "most clickable" shade of blue, you'd expect that blue to win out each time, across contexts. But that hasn't been the case: On the heels of Google's experiment, Microsoft ran a similar one for Bing. They also found a set of colors that caused a revenue bump. But their optimal blue was different to Google's.