Gongol.com Archives: June 2025

Brian Gongol


June 7, 2025

Computers and the Internet No cheating off others' work

The way artificial intelligence work has been treated like a gold rush shouldn't go unexamined. There is reporting to suggest that major technology companies could spend more than $300 billion on hardware alone, just in this calendar year. Everyone involved seems to be haunted by the possibility that someone else will achieve some kind of insurmountable dominance first. ■ The high-stakes approach has also led to a lot of disregard for the nature of the data being collected to "train" the artificial intelligence models. It has gone so far that in one (pre-publication) report, the US Copyright Office notes, "making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries." ■ Into this melee enters an effort called "Common Pile", which seeks to train an artificial intelligence model exclusively on text that is either in the public domain or licensed for open use. Philosophically, the team behind it makes the case that "One of the core tenets of the open source movement is that people should have the right to understand how the technologies they use -- and are subject to -- function and why. Training data disclosure is a key component of this." ■ In an encouraging note, the Common Pile team reports that their model "performs comparably to leading models trained in the same regime on unlicensed data". If it's possible to demonstrate that work can be done in this hot field without violating long-established principles of intellectual property protection, then Common Pile may provide a highly valuable proof of concept that good work can be done within the bounds of the rules.


@briangongolbot on Twitter