Mastodon

AI & open data

A

Luis Villa notes, with some sadness, the closing of yet another door to the open web–occasioned, this time, by creators’ reluctance to make their work available for training AI:

[The open web] was inarguably the greatest repository of knowledge the world had ever seen. Among other reasons, this was in large part because the combination of fair use and technical accessibility had rendered it searchable. That accessibility enabled a lot of good things too—everything from language frequency analysis to the Wayback Machine, one of the great archives of human history.

But in any case it’s clear that those labels, if they ever applied, very much merit the past tense. Search is broken; paywalls are rising; and our collective ability to learn from this is declining. It’s a little much to say that this paper is like satellite photos of the Amazon burning… but it really does feel like a norm, and a resource, are being destroyed very quickly, and right before our eyes.

Perhaps that’s for the best—I really am open to the idea that this particular village needs to be destroyed to save the villagers—but nevertheless it triggers in me a sense of mourning; a window that is passing.

Please do read the whole thing. I am somewhat sympathetic to those closing their sites off from automated crawling… but only somewhat. I have a few reactions:

  1. None of this will stop the rise of AI. I think most of these creators understand that and are pursuing this path as an expressive act.
  2. There are indications that legal restrictions on data collection are having an effect on training data availability. But these should be understood as commercial plays by entities in control of large corpora, who hope to use it to extract some value from the AI wave. Reddit and the New York Times are the most famous examples. This is distinct from the normative shift among creators that Luis describes.
  3. AI disruption of creative industries will be real, though surely different than we imagine. I respect creators who are restricting access to their content out of a strong desire not to be complicit in that change, even though each individual’s instrumental importance to the change is negligible.
  4. While I respect that rationale, I join Luis in lamenting it, in large part because I think it sacrifices potential benefits, such as those he describes, while being unlikely to achieve much.
  5. As is often the case with retreats from openness, much of the impetus for this normative change seems to stem from discomfort with who is benefiting from it. I believe this is because many advocates conceive open data as a revolutionary project to reallocate social power rather than a commitment flowing from moral and practical judgments about how knowledge can and should be restricted.
  6. I empathize, having once held that perspective. But I’ve come to think it is ultimately a juvenile ideology, or at least one that’s been proven to be unproductive. For one thing, people underestimate how quickly, if they did create a new set of winners and losers, they would come to resent the winners. And the perspective is also badly entangled with a press-led narrative about tech companies that frequently edges into hysteria.
  7. But turnabout is fair play: some of the FAANG (AMAGO?) entities on the other side of this are responsible for strangling the open web while building ever-taller legal and technical palisades around the UGC they control.
  8. It’s a little sad to lose fellow open data travelers. On the one hand, it might be for the best: if I’m right and their revolutionary project will never bear fruit, they probably should hop off the bus. On the other hand, I suspect the majority of people on board that bus are there because of an inchoate revolutionary rationale. Those of us riding for abstruse reasons may get lonely.
  9. To the extent that a mass movement to limit the availability of training data has any effect, it will be to entrench the advantage of early movers who have already built their models (though these include open models like Llama).
  10. If successful, online culture will still be used for training by those who don’t respect robots.txt. That means rogue actors: scofflaws without commercial ambitions, gray-market open source projects, hostile foreign powers. This is superficially aligned with the revolutionary outcomes discussed above. But the practical reality will be chaotic and unproductive, with noncommercial aesthetics as the main thing that recommends it over the counterfactual.
  11. All of this may soon be moot, as some analysts estimate that frontier models’ training needs are already on the cusp of expanding beyond the corpus of written language. Video data transcription (custody of which is highly concentrated due to hosting cost) and synthetic data are expected to be the next frontiers.
  12. Declining enthusiasm for openness seems to me to be aligned with a general turn toward conservatism and neuroticism among rising generations.
  13. I remain hopeful that the pendulum will swing back during my lifetime. Will the web bloom again? I suppose I wouldn’t bet much on that. But something will.

About the author

Tom Lee
By Tom Lee