On fighting AI bots
Manu Moreale writes about fighting AI bots:
I guess there are only two options left:
- Accept the fact that some dickheads will do whatever they want because that’s just the world we live in
- Make everything private and only allow actual human beings access to our content
And Molly White responds:
One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.
In my personal view, option 1 is almost strictly better. Option 2 is never as simple as “only allow actual human beings access” because determining who’s a human is hard.
Molly’s perspective really resonates with me. I like the comparison to open source software, where a freely licensed project could always be used by companies in for-profit products. However, what’s missing from the open web is some standards around licensing with regard to AI models. Just because something is free to read, doesn’t mean it’s free to use in any way, including ingestion into an LLM.
I’m wondering, what if there was something like the GPL, but for text and images used as training data? This might work something like: it’s ok to use my content to train your model, as long as the model you produce is freely shared back with the public. It’s not ok to train on my content if your model remains private.
Of course, there are content producers who would not be ok with their data included in any sort of AI training, and that’s totally fine. It’s their work, they can decide how to license it. What’s missing is a standard way for them to declare this (and for LLM builders to respect that).
The other aspect of Molly’s post that resonated with me is trying to fight bots while letting all the humans view your content. In my last job, I worked for a few years on our “traffic” team, which was responsible for, among other things, some technology to block bots and other crawlers. We ran these systems this for things we found abusive (like DDoS’s or bots that put too much load on our servers), rather than trying to limit access to just humans. But, the fundamental problem is shared: trying to separate legit traffic from stuff that should be blocked. It’s really not easy to do this, especially if you consider some non-human traffic important. (Like Google’s crawler!) So, you’ll most likely want to err on the side of letting things through, if you aren’t absolutely sure it’s something you want to block. Which means that any AI crawler that isn’t playing nicely (respecting robots.txt, providing a clear user-agent, etc.) will probably slip through your defenses anyway.