LARGE language models, like Google's Bard AI, feed off vast amounts of data, much of which seems to have been collected without the explicit knowledge or consent of the original creators. Now, there's a method for web content owners to opt out from being part of this vast data trove.
Simply put, if you don't want your content to contribute to Google's Bard AI or its successors, you can just block “User-Agent: Google-Extended” in your website's robots.txt file. This file guides automated web crawlers about what parts of a site they can or can't access.
Google insists it takes an ethical and inclusive approach to its AI development, but the reality is that scraping data for AI training is vastly different from indexing the web with crawlers. Indexing is inherently beneficial to the website owners, as it allows them to be discovered by search engines.
Scraping that data from AI training means that Google is profiting off the back of that website owner’s work, who would have to pay for an AI tool to see any ‘benefit’ for themselves.
Danielle Romain, Google's vice-president of Trust, mentions in a blog post that many web publishers expressed a desire for more autonomy regarding how their content is utilised in the context of advanced AI. Yet, she doesn't directly mention that this data serves as a foundation to "train" these machine learning models.
Instead, the message leans into a more collaborative tone, suggesting web owners should want to "help enhance Bard and Vertex AI generative APIs", essentially making them more efficient over time.
At its core, the request frames the situation as a benevolent collaboration, emphasising the significance of consent. While it’s true that proactive contribution is what they should be seeking, the reality is that models like Bard have already ingested enormous volumes of data without explicit permission. This post-consent request gives off the impression that Google might be trying to retroactively make its practices appear more ethically sound.
It's hard to ignore the fact that Google took advantage of the wide-reaching access it had to web data, accumulated the necessary resources, and is now seeking permission as a means to show commitment to ethical standards. Had consent been a top priority, such measures would have been implemented much earlier.
On a related note, Medium has just announced its decision to prevent such crawlers from accessing its platform until a more detailed and transparent solution emerges. They are not alone in this stance, with sites such as Reddit and X (formerly Twitter) taking more aggressive stances against their data being used to train somebody else’s AI by changing their data use policies and raising API pricing.
Telegram stands its ground on content moderation
Major social platforms such as X (formerly Twitter), Meta, and TikTok are now grappling with the scrutiny of regulators and the public gaze regarding their management of provocative content, misinformation, and various forms of media surrounding the present conflict between Israel and Hamas.
Pavel Durov, Telegram's CEO, has provocatively defended his messaging application's policy of not removing certain sensitive materials related to the conflict. He asserts that these materials could serve as crucial conduits of information.
Durov differentiates his platform from mainstream social media, emphasising that on Telegram, users are exposed only to content they have actively chosen to follow, although he acknowledges this doesn't consider the ripple effect of content sharing beyond the app.
In a recent post on his platform, Durov, echoing the sophisticated rhetoric often used by executives of other social media giants, acknowledged that "Telegram’s moderators and AI mechanisms eradicate millions of pieces of content deemed overtly harmful from our public spaces." However, he swiftly pivoted to justify the continued presence of what he terms “war-related coverage”.
“Navigating the complexities of war-related content is rarely straightforward,” he wrote, without specifying the boundary between content that's "clearly harmful" and that which pertains to "war-related coverage”.
He argued against the simplistic solution of eradicating this kind of information, suggesting that doing so could potentially worsen an already critical situation. He referenced instances where, according to him, Hamas utilised Telegram to forewarn civilians in Ashkelon of impending missile attacks. “Is silencing these channels a strategy that saves lives, or might it inadvertently place more in jeopardy?” Durov posited in his communiqué.
These remarks were made in the wake of Telegram finding itself at the centre of information dissemination to a global audience. Yet, this role extends beyond just the confines of Telegram channels. In the immediate aftermath of the October 7th attack, graphic content, often unfiltered, was disseminated by Hamas and its affiliates via Telegram.
This swiftly propelled the app into the spotlight, becoming a primary source for mainstream media and individuals alike, who either linked back to Telegram or redistributed the content on other platforms. This pivotal role in information propagation has not been without its critics.
Sceptics might argue that Telegram is capitalising on the crisis, experiencing a surge in traffic as a consequence. Durov, commenting on the influx, highlighted that “hundreds of thousands” of new users from Israel and the Palestinian Territories had flocked to the app following the attacks, prompting the introduction of Hebrew support in its interface.
“In these grave times, it's imperative that those impacted have unfettered access to news and confidential means of communication,” Durov has stated.
James Browning is a freelance tech writer and local music journalist.