Research

For working papers and current projects, see Projects. Full list on Google Scholar and arXiv.

Digital Market Power and Architectures

The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation

with Jangho Yang, Tim O'Reilly, Sruly Rosenblat, and Isobel Moure

Data for Policy

Forthcoming

Web-enabled LLMs frequently answer queries without crediting the web pages they consume, creating an "attribution gap"—the difference between relevant URLs visited and those actually cited. Analyzing ~14,000 real-world queries across OpenAI, Perplexity, and Google systems, we document three exploitation patterns: (1) No search: 34% of Gemini and 24% of GPT-4o responses skip web retrieval entirely; (2) No citation: Gemini provides no clickable sources in 92% of answers; (3) High-volume, low-credit: Perplexity visits ~10 relevant pages per query but cites only 3-4. Using a negative binomial hurdle model, we show that citation efficiency—extra citations per additional webpage consumed—varies from 0.19 to 0.45 across models, demonstrating that retrieval design, not technical limits, shapes ecosystem impact. We recommend transparent LLM search telemetry through standardized observability protocols.

Real-World Gaps in AI Governance Research: AI Safety and Reliability in Everyday Deployments

with Isobel Moure, Tim O'Reilly, and Sruly Rosenblat

IEEE Access, June 2026

Analyzing 1,178 safety and reliability papers from 9,439 generative AI papers (2020–2025) across leading AI companies and universities, we find that corporate AI research increasingly concentrates on pre-deployment areas—model alignment and testing—while attention to deployment-stage issues like model bias has waned. Only 4% of corporate papers tackle high-risk deployment domains (healthcare, finance, misinformation, persuasion, hallucinations, copyright) despite these being material business risks. Google DeepMind alone has more citations for AI safety research than the top four academic institutions combined, and research findings are increasingly kept internal. Without improved observability into deployed AI systems, growing corporate concentration will deepen knowledge deficits. We recommend expanding external researcher access to deployment telemetry data and model artifacts.

Beyond Public Access in LLM Pre-Training Data

with Sruly Rosenblat and Tim O'Reilly

AI and Ethics

Forthcoming

Using the DE-COP membership inference method on 34 legally obtained copyrighted O'Reilly Media books—each containing both publicly accessible preview content and paywalled sections—we test whether OpenAI's models were trained on non-public copyrighted content. GPT-4o shows strong recognition of paywalled book content (82% AUROC) versus public samples (64% AUROC), the opposite of what we'd expect if training used only publicly available data. In contrast, the earlier GPT-3.5 Turbo shows the reverse pattern (54% non-public vs. 64% public), suggesting that the role of non-public data in OpenAI's training has increased significantly over time. Testing models with identical cutoff dates (GPT-4o vs. GPT-4o Mini) rules out temporal language shifts as an explanation. These findings highlight the urgent need for corporate transparency on pre-training data sources and formal licensing frameworks for AI content training.

Coverage: TechCrunch | Fast Company | The Register

"Rich-Get-Richer"? Platform Attention and Earnings Inequality Using Patreon Data

with Jangho Yang and Mariana Mazzucato

Industrial and Corporate Change

Forthcoming

Using monthly Patreon earnings data for 104,719 creators (2018–2024), we quantify how platform attention algorithms shape earnings concentration across creator economies. Since Patreon offers little native distribution, its earnings proxy well for attention captured on external platforms (Instagram, YouTube, Twitch, Twitter/X). We find a Pareto exponent around α ≈ 2, placing creator earnings closer to concentrated capital income (α ≈ 1.5) than labor income, consistent with a compounding "rich-get-richer" dynamic. Platforms with stronger power laws (YouTube, Instagram) exhibit weaker creator "middle classes"—gains are drawn disproportionately from mid-tier creators. Over time, we observe growing convergence toward steeper inequality across platforms, with Instagram's α worsening from 2.1 (2021) to 1.84 (2024) as algorithmic recommendations rise in importance relative to user-filtered social graphs.

M&A as Capability Building: Patent Evidence from Big Tech

with Jangho Yang

Applied Economics Letters

Forthcoming

Using a unique dataset covering all patents ultimately owned by Big Tech—including through M&A—we describe the dynamic capabilities acquired and developed by Apple, Amazon, Alphabet, Meta, and Microsoft (2000–2022). Our analysis reveals that M&A has been crucial for Big Tech developing integrated hardware-software ecosystems: at least 10–13% of total patent counts were acquired externally (a lower bound, since Big Tech also innovates internally based on external technologies). Acquired patents receive more forward citations than internally developed ones, suggesting acquisitions target highly productive technology. We show how Big Tech's evolving capabilities closely track their competitive potential and market entry strategies, offering an empirical framework for applying dynamic capabilities analysis to antitrust oversight of serial acquisitions.

Amazon's Algorithmic Rents: The Economics of Information on Amazon

with Tim O'Reilly and Mariana Mazzucato

UC Law Science and Technology Journal, Vol. 15, 2024: 203

Amazon's $37.7 billion advertising business marks a departure from its original 'flywheel' strategy of growth through better user experience. We show how Amazon shifted to a 'quest for profit' by compelling merchants to pay for product visibility that was previously earned through relevance and price competition. This exploits the informationally complex online environment where users rely on algorithmic curation for decision-making—when Amazon places 'Amazon's Choice' badges, sales increase 25%; 'Best Seller' badges boost page views 45%. We challenge the Chicago School assumption that 'competition is just a click away': in practice, users satisfice rather than optimize, enabling Amazon to persistently degrade results quality while maintaining clicks. We define platform dominance as the ability to disregard information relevance within its ecosystem and still profit, with implications for both antitrust and consumer protection frameworks.

Coverage: Cory Doctorow | Financial Times

Behind the Clicks: Can Amazon Allocate User Attention as It Pleases?

with Rachel Rock, Tim O'Reilly, and Mariana Mazzucato

Information Economics and Policy, Vol. 69, 101115, 2024

Can platforms persistently allocate user attention to inferior, paid search results? This empirical companion to our theoretical work tests whether Amazon can profitably degrade its search algorithm through advertising. Analyzing Amazon's product search results, we find that the first position has an 80% probability of being a paid advertisement—yet products in this 'attention-dominant' position still receive disproportionate clicks regardless of relevance. A product in the bottom 10% for relevance but the highest attention position is as likely to be clicked as a top-5 most relevant product in an average position. These findings support our theory that platforms with sufficient market power can extract 'algorithmic attention rents' by substituting paid results for organic ones, capturing value from both users (through inferior results) and merchants (through mandatory advertising fees).

Algorithmic Attention Rents: A Theory of Digital Platform Market Power

with Tim O'Reilly and Mariana Mazzucato

Data & Policy, Vol. 6, e6, 2024

We develop a theory of 'algorithmic attention rents' in digital aggregator platforms. The scarce factor of production that enables platform market power is not data but user attention. Platforms extract rents by deviating from their own best-possible ('organic') algorithmic results to favor paid placements, self-preferencing, or engagement-maximizing content. We show that 78% of Amazon's most-clicked products appear in the first two rows of search results, and 32% of shoppers simply buy the first product listed—giving platforms enormous power over commercial outcomes. This algorithmic authority over attention enables platforms to convert non-pecuniary rents (user time, inferior results) into pecuniary rents from suppliers and advertisers. Our framework shifts regulatory focus from privacy and data to the fairness of attention allocations, and calls for mandatory disclosure of the operating metrics platforms use to allocate user attention.

Regulating Big Tech: The Role of Enhanced Disclosures

with Mariana Mazzucato, Tim O'Reilly, and Josh Ryan-Collins

Oxford Review of Economic Policy, Vol. 39, Issue 1, pp. 47-69, Spring 2023

Current disclosure frameworks (10-K financial reports) were designed for industrial economies based on physical assets, not digital platforms. Big Tech's annual reports reveal little about their globally dominant 'free' services, platform user numbers, or monetization practices. We show how these disclosure gaps hobble antitrust investigations—the FTC had to rely on third-party estimates of WhatsApp and Instagram users in its Facebook filings—impede competitor entry, and distort capital allocation. We propose mandatory SEC-style disclosures covering: (i) operating metrics like monthly active users that underpin market share; (ii) 'free' products that escape profit/loss reporting; (iii) monetization processes connecting users to revenue; and (iv) product-by-product segment reporting. Such disclosures would empower investors, regulators, researchers, and competitors to understand these systemically important firms.

Coverage: Financial Times | Australian Financial Review