On healthy data flow in the AI age
by Nick Vincent @nickmvincent.bsky.social
Why the AI evaluation crisis could force a reckoning on dataset provenance, attribution, and consent.
April 29, 2026
A proposal for interoperable attestation objects that connect training data, evaluation labor, and AI-generated outputs across the AI supply chain.
April 08, 2026
Reacting to a wide-ranging set of policy ideas from OpenAI.
April 05, 2026
AI progress means the "polish" of a figure or website no longer proxies for quality. Can we try to turn this into a good thing for curation, attention allocation, and even AI progress itself?
March 31, 2026
Making an "if you like X, you might want to support Y" argument for data-focused policy
March 08, 2026
Back to the basics of data leverage.
March 02, 2026
How we can understand, and react to, the complicated impacts of AI systems on online communities and knowledge commons
February 16, 2026
On user data control, coding agents as retrievers, and the value of your coding transcripts
January 11, 2026
Sharing an early reaction to recent coding agent discourse and two relevant projects
January 04, 2026
Another recap post for the Data Leverage newsletter! (and a test of using Leaflet for blogging
December 23, 2025
In fact, anyone who doesn't think they will be a "big winner" long term benefits from clear rules, even if it means training data costs more in the short term.
November 25, 2025
Another recap post for the Data Leverage newsletter!
October 10, 2025
New model releases keep (re)sparking discussions about training data. What can we assume is upstream in the data river, and what do we want to see happen?
September 23, 2025
This post was written by Aditya Karan, with support from Nick Vincent and Karrie Karahalios to accompany a FAccT 2025 paper. It was originally published on Jun 19, 2025 via the Crowd Dynamics Lab blog.
June 19, 2025
Reacting to a fresh wave of discussion about AI's impact on the economy and power concentration, and reiterating the potential role of collective bargaining.
June 04, 2025
Connecting evaluation and dataset documentation via the lens of "AI as ranking".
May 29, 2025
It's ranking information all the way down.
May 27, 2025
Google and others solve our attentional problem by ranking discrete bundles of information, whereas ChatGPT ranks more granular chunks. This lens can help us reason about AI policy.
May 26, 2025
Commenting on recent coverage of, and discussion about, Meta's arguments about training data value quantification.
April 20, 2025
A consortium of Public AI labs can substantially improve data pricing, which may also help to concretize debates about the ethics and legality of training practices.
April 02, 2025
Research agents and increasingly general reasoning models open the door for immense "evaluation data leverage".
March 01, 2025
Our AI design choices in 2024 could preclude "Powerful AI" in 2030.
February 11, 2025
There's still incredible tension in the current data paradigm, but sharing "data protection" technologies, like those used by OpenAI to accuse DeepSeek of model theft, can help cut a path forward.
January 30, 2025
There's deep tension in the current ask-for-forgiveness-free-for-all approach to acquiring data for model training. Will "open" models cause this tension to reach a breaking point?
January 27, 2025
The race to produce premiere AI products with high price tags might change the standards around data disclosure.
December 11, 2024
The idea that data-dependent AI systems are ready and willing to crush any leverage from knowledge workers is unlikely to make the AI industry look good to the public.
November 08, 2024
Examining the Meta CEO's claim that the "individual work of most creators isn’t valuable enough for it to matter" in the context of AI training.
September 27, 2024
Interacting with many models and harnessing the power of `diff`
August 07, 2024
Focusing on feedback loops -- connecting modern AI to early cybernetics-style thinking -- could help solve looming challenges and support democratic inputs to AI.
November 12, 2023
How can we start thinking about how opt-out decisions by content-producing organizations will affect LLMs?
September 27, 2023
The New York Times is trying to remove its content from OpenAI models, surfacing tensions around copyright, economic harms, privacy, and the distribution of AI benefits.
August 24, 2023
Could Upcoming Data Legislation Enable a "Right to Data Strike"?
May 04, 2023
Once again, we’ve had an eventful few weeks in the space of data-dependent computing!
April 30, 2023
The Last Three Months in Review: What's New and What's Next
April 17, 2023
The plants in the Gardens by the Bay evoke a sense of flourishing-by-design; photo by Victor from Unsplash.
March 29, 2023
Measuring the Alignment of AI Systems Based on their Data Pipelines
March 01, 2023
Much of my work is in pursuit of “data dignity”, an idea that stems in part from scholars arguing that we should sometimes think of “data as labor”.
February 02, 2023
The public debate over AI has seriously heated up in the wake of new advances in the design and deployment of large generative AI models.
December 15, 2022
More on why you're an expert language model trainer
December 03, 2022
Background
December 01, 2022
The much-celebrated GPT-3 that can answer questions, write poems, and more wouldn’t be possible without content written by millions of people around the world. Shouldn’t they get some credit?
September 21, 2020