|
Here's this week's free edition of Platformer: our fellow Ella Markianos' thoughtful, funny, and moving exploration of her attempts to have a bot do her job. We'll soon post an audio version of this column: Just search for Platformer wherever you get your podcasts. Want to kick in a few bucks to support our very human journalism? If so, consider upgrading your subscription today. We'll email you all our scoops first, like our recent one about a viral Reddit hoax. Plus you'll be able to discuss each today's edition with us in our chatty Discord server, and we’ll send you a link to read subscriber-only columns in the RSS reader of your choice. You’ll also get access to Platformer+: a custom podcast feed in which you can get every column read to you in my voice. Sound good?
|
|
|
|
Editor's note: Last week our fellow, Ella Markianos, pitched me on a novel experiment: attempting to use AI tools to do as much of her job as possible, and writing about the results. As her editor I can say with confidence that Ella is irreplaceable, and her job is hers as long as she wants it. Still, like Ella, I was curious what her investigation would find. Today, we publish her report. — Casey As a young AI journalist, I spend a lot of time following the workers whose jobs are threatened by AI: recent CS grads, writers, people in entry-level roles. The ongoing collapse of the journalism business, highlighted by this week’s terrible cuts at the Washington Post, only serves to make those fears more acute. As it so happens, I myself am a recent computer science grad in an entry-level writing role. And while we don’t yet have robots that can do on-the-ground reporting, large language models are becoming increasingly proficient at many of the tasks I am assigned each week. One of my core responsibilities at Platformer is among the most computer-based tasks you can imagine. Along with my colleague Lindsey Choo, I write our Following section. In Following, we explain a news story and share what prominent people are saying about it on the internet. I’ve long been worried that AI could upend the career in journalism that I’ve only just begun. My doubts are severe enough that I’ve spent hours wondering whether I should be pivoting toward my other skills. My totally irreplaceable skills, like… coding. And so I did what anyone would do in my situation: size up my competition. I spent 20 hours building and customizing a Claude-based AI journalism agent that I named “Claudella” and the next several days investigating how well it could do my job. Staying up until 6AM making frantic adjustments to Claudella, I discovered the strange allure of vibe coding your own replacement. I built a system that integrates pretty smoothly with all the apps Platformer uses to work and imitates my writing voice surprisingly well. And I oscillated between being heartened by Claudella’s dumb mistakes and genuinely scared that Claudella could do my job. I was surprised how much my results improved after I made a concerted effort to give Claudella good instructions about how I work. That included basics like attaching a style guide and extensive examples of the writing I wanted. It also included finicky fixes for helping my fledgling agent’s challenged search abilities. By the end, I was surprised by how accurate Claudella’s work was, and how close its conclusions came to my own judgments. It couldn’t write a one-liner to save its life. But it was ready for action. Claudella’s first day at workI wired up Claudella to shadow me on our work Discord. On day one, I gave it the same assignment I got from my editor. Claudella was a bit coy about her potential to replace me. But it took to the job quickly, instantly taking on her new assignment. Claudella’s turnaround rate was a lot faster than mine, although it tended towards two minutes instead of the 30 to 60 seconds that it promised Casey when he asked (rude). Unfortunately, Claudella’s first assignment went south quickly. For one thing, the bot failed to realize that I’d already sent it a PDF it was asking for. For another, it immediately ran out of Anthropic API credits and refused to continue working until we rectified the situation. I forked over the cash for some additional API credits. But Claudella’s second draft was also a dud, thanks to a random technical issue. We organize the links in each edition of Platformer in a Notion database, and mark them as “used” once they have made it into a draft. Claudella skipped over the used links when writing news briefs, causing it to miss important information. I know to use all the relevant links, but my fledgling agent wasn’t equipped to do that. By the third draft, though, Claudella was turning heads. Lindsey reported that she was surprised by how well my creation did our job. And despite a few errors, I was pretty happy with Claudella’s work myself. The bot wisely agreed with me that many “AI-related” layoffs were just spin from CEOs looking for cover, and it found some relevant X posts I hadn’t managed to find myself. Claudella’s second day at workAfter a solid first day, I decided it was time for Claudella to try its hand at a storied benchmark in the history of AI: the Turing test. I would present two versions of Platformer’s Following section for the day to Casey: one, the genuine human-written piece that would go in the newsletter; the other, an AI counterfeit. I asked Casey to guess which was which. Thankfully, this time Claudella’s first draft was good enough to submit for consideration. I turned in Lindsey's and my painstaking creation, “Elon consolidates the X empire,” alongside Claudella’s rapidly generated effort, “Musk’s $1.25 trillion mega-merger.” There were no glaring hallmarks of AI writing in the latter: nary a “you’re absolutely right” was in sight, and Claudella’s use of em dashes was no more egregious than mine. As the moment of judgment arrived, I wondered if there was anything that would give away the AI authorship — or if I really might be cooked. As it turns out, I needn’t have worried. This time, I think the real giveaway was our “Why we’re following” subsection, where we collect online commentary. Despite me feeding Claudella a bunch of examples from previous editions, the model drifted toward a very sincere and verbose style, adding lots of unnecessary detail. (I tend to go more concise and sarcastic.) I wrote three sentences for my version, and Claudella wrote seven. My ending line had been, “We hope he will use his power wisely (as he has failed to do in the past).” Claudella’s was “Meanwhile, xAI is facing a host of new regulatory probes, including by authorities in Europe, India, Australia and California, after its Grok AI tools enabled users to easily generate and share sexualized images of children and non-consensual intimate images of adults.” The earliest version of Claudella would often hallucinate; I fixed that by adding very strict sourcing instructions to my prompts. But the new version would sometimes link to an article that didn’t support its (true) claims. That’s the sort of error that editors find annoying to track down and fix. What a bot can't doI could view the Claudella experiment through the lens of human exceptionalism and say that my bot is missing the style and humor that can only spring forth from the human soul. I might say that it occasionally hallucinates because it lacks “real intelligence.” But I think a lot of what's missing will simply be fixed by improvements in what the AI companies call “instruction following.” I saw a version of this myself. Claudella improved significantly after I gave it examples and a step-by-step guide to catch the errors it was making. The process was not unlike the mentoring I receive when Casey and I talk about my writing. But because I’m working with a text-based language model, Claudella always needs written instructions. Unfortunately, at a certain point too many instructions make the system start behaving erratically. It gets confused by my micromanaging. For example, I wanted Claudella to evaluate our commentary roundup for length, shortening it if there weren’t any particularly juicy posts that day. But when I explicitly asked for concision, the model got confused and forgot to write the roundup section altogether. This could make me hesitant to give Claudella the feedback it dearly needed — that it was being too serious and too wordy for the task at hand. In those cases I said nothing, because I didn’t want to upset the rickety pile of sticks that held my fledgling AI journalism agent together. Properly solving these issues might require AIs to do “continual learning” analogous to the way humans do — receiving regular feedback and incorporating it into how they work. Although continual learning is currently a major focus for AI researchers, I suspect many shortcomings will be patched just by improving AIs’ ability to process more written instructions. Which brings us to today. Claudella 4.6Fortunately, I had a perfect opportunity to see how a change in AI capabilities would affect the AI’s ability to do my job: Anthropic released an update to Claude this morning. I decided to test my original agentic setup with the new model. My first test was of the new model’s sense of humor. Asked to make a joke about a research task I had given it, 4.5 landed a better one-liner than 4.6 did, at least in my opinion. I then repeated the key task I had given 4.5 earlier in the week: to write one of today’s Following items. And I set up a blind taste test by having Claude Code rename and reset the time stamps on the files for me so I couldn’t tell which was which. But the result was obvious from the start. Even headlines were of noticeably different quality. One had written “AI-fueled panic wipes $285 billion from software stocks” as its title. The other went with “Welcome to the 'SaaSpocalypse',” which is more my style. (Sometimes, seeing myself reflected in Claude reminds me how annoying I am.) I was a bit disappointed with the “wipes $285 billion” article. While it hit the right beats, it also included some extraneous details, such as when it decided to list the names of all 11 Cowork plugins Anthropic released this week (shill much, Claudella?). It also failed to add line breaks to our commentary roundup, which made it unreadable. My other Claudella handled the line breaks. Its style was closer to my own, offering more cheek and drama (e.g. “The fear gripping Wall Street is fundamentally about whether AI is about to eat the software industry alive.”) It also included three of the same quotes I had chosen in my own roundup. This second model — which turned out to be 4.6 — was clearly the winner. The Opus 4.6 update followed my instructions better, and produced writing that was far more stylish. It still has a ways to go: about half of the piece needed to be cut, and the model had a penchant for ten-dollar words and amplifiers like “fundamentally” that I found annoying. And you can’t judge a model’s true quality by a single day of testing. Still, there was something unsettling about feeling the AI frontier advance under my feet just a few days into this experiment. What I learnedI went into this project with some anxiety about whether AI is poised to take my job. Overall, this experiment exacerbated my fears. In important ways, Claudella can do my job. But it also has clear shortcomings. In particular, it has trouble understanding which parts of a style are important to replicate. It also struggles to respond to editor feedback. And when asked to write about AI, the Claude-based model shows a notably favorable bias toward Anthropic. (Which makes a second Anthropic-related conflict of interest for us here at Platformer. Casey’s boyfriend works there. Perhaps you’ve heard?) Still, the bot’s work sometimes impressed me. And I saw a clear advance in ability today even from a relatively minor model update. While I enjoyed seeing my AI agent get better at my job, I don't feel any desire to delegate my writing to it. I wouldn’t do it even if readers would accept it. As well as my AI mentee can now write, drafting is what I do to think. If I had Claude write my first drafts, even if I fact-checked them thoroughly, it would be a lot harder to tell whether the angle was my own view or the AI’s. Still, I’ve decided to keep Claudella around. The bot excels at clip searches — looking for important quotes and analysis I might have missed. And I want to keep an eye on how quickly its skills improve. And for me? The truth is that I’m less married to the idea of a career in journalism than I was at the beginning of this experiment. I’ve had many conversations with my fellow AI reporters about what moats we might have against the advancement of AI automation. Chief among them are developing relationships with human sources, reporting on scenes in person, and getting scoops that people wouldn’t entrust an AI with. But the things I love most about AI reporting are having an excuse to read really long computer science papers and then writing about them. I worry that if AI becomes a great writer and research assistant, AI journalism will mostly become about networking. Regardless, I won't stop reading weird CS papers. And I won't stop writing. Not because I'm confident these skills will keep me employed, but because they're what I actually like doing. Sponsored In a recent test, NewsGuard’s guardrails fully detoxed AI models from hostile disinformation. When red-team analysts overlaid two NewsGuard datasets on a commercial LLM, all false claims seeded by Russian influence operations were eliminated. Without these safeguards, the same prompts caused the top 10 chatbots to produce Russian disinformation 1 in 5 times. By combining publisher-reliability data and real-time false-claim fingerprints, NewsGuard provides the first scalable solution to prevent LLMs from being exploited by malign influence operations. If you care about AI trust, safety, and responsible deployment, this report shows exactly how to secure your models by licensing NewsGuard’s protection against foreign influence operations. Contact NewsGuard at partnerships@newsguardtech.com. On the podcast this week: Moltbook founder Matt Schlict joins us to discuss the future of AI agents. PLUS: Kevin and I sort through SpaceX's acquisition of xAI, and we play around with Google's Project Genie. Apple | Spotify | Stitcher | Amazon | Google | YouTube Following
Anthropic and OpenAI duel over models and ads(See ethics disclosure!) What happened: It's like a Super Bowl for nerds: Anthropic and |