Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python

Rudrendu Paul

Every product experimentation team running causal inference on LLM-based collaborative features eventually hits the same wall: your users aren't independent. Your team ships an AI meeting summarizer to half the enterprise accounts on your platform. The rollout's clean, half on and half off, and you wait for the control group's task completion to stay flat while the treated group's creeps up. Two weeks in, the control group's numbers are moving too. Not as much, but visibly. The feature's confirmed off for those accounts, and you've checked the rollout config twice. Something's still contaminating your control. You know what it is before you dig into the logs. The AI meeting summaries land in shared Slack channels, the AI-drafted docs show up in shared Google Drive folders, and the AI code review suggestions appear in pull requests that both treated and control engineers read. Behavior changes for the treated users, and a slice of that behavior bleeds back into your control group through the collaboration graph. This is the collaborator contamination trap. It shows up in every generative AI product that touches shared artifacts: AI meeting notes that teammates read, AI-drafted documents that coworkers edit, AI code suggestions that reviewers evaluate, AI-generated email threads that the whole team replies to. User-level randomization assumes one user's treatment assignment leaves every other user's outcome alone. In a collaborative workspace, that assumption is wrong by design, and the product experiment folds the feature's real effect together with the spillover it creates inside the control group. Running a collaborative AI feature behind a user-level A/B test is a product experiment that violates the Stable Unit Treatment Value Assumption (SUTVA). The fix is cluster randomization: flip the coin at the workspace level, so entire teams are in or out together, then model the cross-workspace spillover directly. This tutorial walks through the full pipeline (cluster assignment, a biased, naive user-level OLS, cluster-weighted least squares for honest standard errors, a two-exposure decomposition that identifies direct and spillover effects separately, and cluster-bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset in which the ground-truth causal effects are known. You'll estimate them, quantify uncertainty, and see where the approach silently breaks.

Companion code: every code block runs end-to-end in the companion notebook at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization. The notebook (cluster_randomization_demo.ipynb) has all outputs pre-executed, so you can read along on GitHub before running anything locally.

Table of Contents

Why user-level A/B randomization breaks under collaboration

What cluster randomization actually does

Prerequisites

Setting up the working example

Step 1: Build the cluster assignment and spillover exposure

Step 2: Naive user-level OLS (biased and overconfident)

Step 3: Cluster-weighted least squares (honest standard error)

Step 4: Two-exposure decomposition (unbiased direct and spillover)

Step 5: Cluster-bootstrap confidence intervals

When cluster randomization fails

What to do next

Why User-Level A/B Randomization Breaks Under Collaboration The math of an A/B test is elegant because one user's treatment assignment has no bearing on another user's outcome. Flip a coin; half your users get the AI feature, and the coin flip breaks every possible confound by construction. Collaboration breaks that guarantee in three ways. Shared artifacts travel. The AI summary lands in a channel every teammate reads, the AI-drafted doc goes into a folder every teammate edits, and the AI code review suggestion sits on a pull request every reviewer evaluates. Control users consume those artifacts, whether or not the feature is switched on for them, and the behavioral effects of reading AI-assisted content leak into their outcomes. Shared workflows create interference. A treated user who relies on the AI summarizer writes shorter follow-up notes, assuming teammates have read the summary. A control user on the same team receives those shorter notes and spends less time reading them, which changes their session length. That means the treated user's assignment has shifted the control user's outcome, which is exactly what SUTVA forbids. Network adoption follows collaboration. Power users on treated teams experiment with the feature first, then nudge teammates in other workspaces through cross-team channels. If your treated group produces AI-assisted content that your control group reads and copies, the control group is partially treated without ever flipping a switch. All three mechanisms produce the same symptom: the raw user-level comparison understates the feature's direct effect because the control group is no longer a pure counterfactual. On the synthetic dataset in this tutorial, the ground-truth direct effect is +0.80 min of session time for treated users, and the ground-truth spillover effect is +0.20 min for control users who collaborate across workspaces. A naive user-level OLS recovers +0.6723, a 16 percent underestimate of the direct effect, and reports a standard error that is roughly 19 times too small because it treats 50,000 users as independent, even though the treatment was randomized only across 50 clusters. That's not a small error. It's the kind that ships a broken feature launch decision. What Cluster Randomization Actually Does Cluster randomization flips the assignment coin at the workspace level so entire teams land in the same arm, confining most interference to where it belongs and making the residual cross-workspace leakage something you can model directly.

Figure 1(image ab: Schematic of the SUTVA violation that cluster randomization targets. Every user in a treated workspace (top row, red) sees the AI feature. Every user in a control workspace (bottom row) should see nothing, but collaborators (orange) read AI artifacts that travel through shared Slack, documents, and code reviews. Those spillover-exposed users are partially treated. Cluster randomization doesn't make interference disappear; it confines it to within workspace boundaries, leaving the remaining cross-workspace leakage as an identifiable component that a two-exposure model can estimate directly. If a workspace is treated, every user inside it gets the feature. If it's a control workspace, nobody inside it does. Interference within a workspace is fine because all teammates share the same assignment, and the workspace-level mean captures the full treatment package. The design aims to control interference across workspaces. The estimator works under a stack of assumptions, and each one has a name worth knowing because the failure modes at the end of this tutorial map directly to specific violations.

Cluster-level random assignment. Treatment is assigned at the cluster level by a genuinely random mechanism. Which workspaces land in the treated arm is independent of workspace-level potential outcomes.

Partial interference. Interference happens inside clusters but not across them (Hudgens et al.). A treated user in workspace A can affect her teammate in workspace A, but can't affect a user in workspace B. This is the assumption cluster randomization is built around.

Cluster-level SUTVA. A workspace's treatment is a single, well-defined package. There's one version of the feature, and within-cluster heterogeneity in exposure is absorbed into the cluster-level effect.

Exchangeability of clusters. Before the coin flip, the treated and control workspaces are exchangeable. Randomization achieves this by construction.

Sufficient cluster count. Cluster-robust inference relies on a central limit theorem across clusters. Practitioners often use K ≥ 30 as a working floor, though the appropriate threshold depends on cluster-size heterogeneity and the choice of test statistic. Fewer clusters demand a different inference tool, such as randomization inference or a cluster wild bootstrap.

Partial interference is the underlying assumption of load-bearing here. The whole point of cluster randomization is that cross-cluster spillover is smaller and slower than within-cluster spillover, so treating an entire team contains most of the interference where it's supposed to be (Ugander et al.). When cross-cluster spillover is meaningful, a two-exposure model directly identifies and estimates that leakage. Prerequisites You'll need Python 3.11 or newer, comfort with pandas and linear regression, and rough familiarity with ordinary least squares. Install the packages for this tutorial: pip install numpy pandas statsmodels scipy matplotlib

Here's what's happening: five packages cover the full pipeline. Pandas loads the data and builds the cluster assignment. NumPy handles array arithmetic and bootstrap draws. Statsmodels fits every regression: naive OLS, cluster-weighted least squares, and the two-exposure model with cluster-robust standard errors. Scipy supports the kernel density diagnostic plot, and matplotlib renders it. Clone the companion repo to get the synthetic dataset: git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git cd product-experimentation-causal-inference-genai-llm python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the clone pulls the companion repo, and generate_data.py produces the shared 50,000-user dataset used across the series. Seed 42 keeps the data reproducible. The 50,000-user

Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python

Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python

Related Articles

The Singleton Labyrinth

Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes.

Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py

Comments