Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grow…
Apparently the dump doesn’t include media, though there’s ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don’t care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).
I want to stress this: it’s not that “tech bros” are just stupid—even though a lot of them are revoltingly unappreciative of the giants whose sholders they stand on—it’s that they don’t care.
To just have the most recent data within reasonable time frame is one thing. AI companies are like “I must have every single article within 5 minutes they get updated, or I’ll throw my pacifier out of the pram”. No regard for the considerations of the source sites.
There’s a chance this isn’t being done by someone who only wants Wikipedia’s data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.
Doesn’t make any sense. Why would you crawl wikipedia when you can just download a dump as a torrent ?
AI bros aren’t that smart.
Apparently the dump doesn’t include media, though there’s ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don’t care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).
I want to stress this: it’s not that “tech bros” are just stupid—even though a lot of them are revoltingly unappreciative of the giants whose sholders they stand on—it’s that they don’t care.
To have the most recent data?
To just have the most recent data within reasonable time frame is one thing. AI companies are like “I must have every single article within 5 minutes they get updated, or I’ll throw my pacifier out of the pram”. No regard for the considerations of the source sites.
There’s a chance this isn’t being done by someone who only wants Wikipedia’s data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.