Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 4 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

PhilipTheBucket@piefed.social · 4 days ago

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you’ve actively evading Anubis, fuckin’ game on.

TurboWafflz@lemmy.world · 4 days ago

I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed

xthexder · 4 days ago

I love catching bots in tarpits, it’s actually quite fun

31ank@ani.social · edit-2 4 days ago

Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post

phx@lemmy.ca · 4 days ago

Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.

I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance

SorteKanin@feddit.dk · 3 days ago

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s.

Even “high power” AIs would produce bad data. It’s currently well known that feeding AI data to an AI model decreases model quality and if repeated, it just becomes worse and worse. So yea, this is definitely viable.

phx@lemmy.ca · 3 days ago

Yup. It was more my thought that a low power over could produce sufficient results while requiring less resources. Something that can run on a desktop computer could still produce a database with reams of believable garbage that would take a lot of resources from the attacking AI to sort through, or otherwise corrupt its own harvested cache

sudo@programming.dev · edit-2 4 days ago

The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.

SorteKanin@feddit.dk · 3 days ago

The idea is that eventually they would stop scraping you cause the data is bad or huge. But it’s a long term thing, it doesn’t help in the moment.

Monument@lemmy.sdf.org · 3 days ago

The promise of money — even diminishing returns — is too great. There’s a new scraper spending big on resources every day while websites are under assault.

In the paraphrased words of the finance industry: AI can stay stupid longer than most websites can stay solvent.

The Infinite Nematode@feddit.uk · 4 days ago

Wasn’t this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?

traches@sh.itjust.works · 4 days ago

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP

AmbitiousProcess (they/them)@piefed.social · 4 days ago

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That’s why in Codeberg’s response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from “shady apps.” Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people’s devices to route network traffic they’re unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don’t want.

PhilipTheBucket@piefed.social · 4 days ago

Honestly, man, I get what you’re saying, but also at some point all that stuff just becomes someone else’s problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn’t really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that “no one will show up at my place with a bat, whatever I do” arrangement is a law of nature. It’s not.

sudo@programming.dev · 4 days ago

Or your TV or IOT devices. Residential proxies are extremely shady businesses.

amelaxxx@piefed.social · 4 days ago

PhilipTheBucket@piefed.social · 4 days ago

Is that really true? I guess I have no reason to doubt it, I just hadn’t heard it before.

sudo@programming.dev · 4 days ago

Here’s one example of a proxy provider offering to pay developers to inject their proxies into their apps. (“100% ethical proxies” because they signed a ToS). Another is BrightData proxies traffic through users of their free HolaVPN.

IOT and smart TVs are also obvious suspects.

NuXCOM_90Percent@lemmy.zip · 4 days ago

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to… rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.

bleistift2@sopuli.xyz · edit-2 4 days ago

https://en.wikipedia.org/wiki/Sarcasm, or maybe https://en.wikipedia.org/wiki/Hyperbole

amelaxxx@piefed.social · 4 days ago

Right