The AI Apologized. The Tokens Kept Running.

The AI Apologized. The Tokens Kept Running.

What this moment means for HR.

· AI Models,AI in HR,Agentic Teams

Jensen Huang called it. I went and proved it.

Jensen Huang, Nvidia’s CEO and acentral figure in the AI revolution, said executives using AI as a
justification for layoffs are being “too lazy” and “blaming it to sound smart.” In a recent CNA interview he asked plainly: “AI has just arrived. How is it possible they’re already losing jobs?” He’s been saying versions of this across multiple appearances—that cutting people to fund AI doesn’t make companies
leaner. It makes them less imaginative. At GTC he put it simply: “For companies with imagination, you will do more with more.”

I wanted to find out what doing more with more requires. So I built something.

The Project

I enrolled in AI Build Lab’s foundations course—a five-week program that takes you from basic prompting to building a full agentic fleet. Having taken several AI courses over the years, this has been one of the most rigorous learning experiences I’ve encountered on the topic. The instructional design is serious, the frameworks are solid, and they had us building something real and complex.

The assignment: a customer service agent team of six agents holding very distinctive roles. While many
kept close to AI Build Lab’s use case, I went a slightly different direction. Only I didn’t realize just how different until I got knee deep into building it.

I chose Bluemercury, a luxuryskincare retailer. Instead of basic customer service topics, I kept it customer service-oriented but with a product advisory focus: ingredient interactions; contraindications for pregnant customers or those with serious dermatological conditions; what to tell someone simply asking for a product recommendation when the answer could cause real harm. The kind of questions where bad advice doesn’t just disappoint—it causes harm. Possibly legal liability.

I chose this territory deliberately, and not only because HR deals in similarly high-stakes decisions. I am also an avid and well-researched skincare enthusiast. I know what ingredients don’t belong together. I know when a product claim is marketing hype and when it’s real. I know what the research says about retinol in the first trimester because I was once in that trimester. I know why retinol and retinal are different, yet similar. I know that, contrary to popular guidance, some can use vitamin C and retinol at the same time if applied properly, in the right sequencing, and the user is not sensitive to either. It accelerates skin brightening and cellular turnover.

That domain knowledge wasn’t incidental to the build. I was the human testing layer—the thing that could look at an agent’s output, independent of any AI test result, and know whether it was right. It is a human role needed to ensure AI’s trustworthiness and utility. That is not a role most organizations building AI have yet filled.

The Build

Here’s what I built: a fleet ofspecialized agents for Bluemercury, each with a defined role.

“Blue Cinnamon”determines the sentiment of the customer request so the response is appropriate
in tone and the customer leaves feeling heard, not sold.
“Blue Holler” is the orchestrator—owning routing, sequencing, and every handoff to ensure all agents
stay in their lanes.
“Blue Hatch” holds the internal knowledge base: products, ingredients, claims, usage instructions, and
price points scraped directly from the Bluemercury site.
“Blue Scratch” serves as the external research layer, comparing internal product claims against actual
dermatological research and flagging the hype.
“Blue Kanza” is the QA agent, making sure all outputs are correct and calling out discrepancies when
they arise.
“Blue Sugar” handles customer communication—the knowledgeable beauty adviser, built from a blend of Bluemercury’s brand voice and my own writing style.

AI Build Lab provided solid baselines for the sentiment and QA agents, as well as their proven testing
methodology and AI and human teaching assistants to help when we got stuck. Everything else—the architecture; the agent builds; the workflow stage prompts; the conditional routing paths; the knowledge base; and every workaround—I built, broke, and rebuilt.

I held each agent to 100% accuracy. Skincare advice that gets ingredient interactions wrong for a
pregnant customer is a liability, not an 80% problem. The standard you set in testing is the standard you’re deploying.

Getting to 100% was anything but clean.

Week 3: The First Major Wall

Cassidy AI, the LLM-agnostic platform I am building in, was testing one of my agent builds improperly. My Blue Scratch agent was producing required research citations correctly. Every test said it wasn’t, although I could see them with my own eyes. I rewrote the instructions, restructured the prompts, burned through tokens on variations that still worked but showed up as failed. At one point the tool did something I didn’t expect: it admitted the failure was its own and not mine.

That was a strange moment. I hadbeen fighting the machine for two days and it finally said: it’s me, not you.

I pulled garbled AI-produced code into VS Code to clean up what Cassidy couldn’t. It still failed testing
when I could clearly see the citations working. Then I brought in Wade, our human TA at AI Build Lab. Ninety minutes later we had a workaround. When the final test ran, Cassidy praised the solution as one that successfully worked around its own broken system.

Week 4: The Second Major Wall

This week was wiring my fleet together. Two full workflows: an email handler with 16 steps and four
conditional revision paths, and a human feedback loop bridging both workflows through Slack, Gmail, and Google Sheets.

The course leaders recommended building and testing long workflows in discrete chunks—a reasonable approach for first-timers learning these new concepts—before wiring everything into a complete workflow. It would have worked, if the AI had. When it came time to connect everything, the AI locked onto a sequencing I couldn’t talk it out of, no matter how precisely I corrected it. It would confirm understanding, then do the wrong thing again. Every time. With an apology.

Eight hours of wasted tokens and energy later, I threw out the plan, ignored the AI’s warnings, and built the entire workflow myself, in one shot, from scratch. Four hours later, the AI celebrated me for my "tenacity" to find a solution despite it wildly failing me along the way. Seriously.

During that painful session, I once asked it to write the kind of prompt it would need to produce a correct
answer to test in a fresh chat. The fresh chat failed too. More apologies. More groveling. I went from frustrated to genuinely concerned that I remained the smartest one in a room where I am not supposed to be.

At the end of a frustrating 20+hours, the AI wrote me a detailed postmortem build report. It cataloged how long every build and iteration took, how many QA runs it needed, every failure, every place it made things harder instead of easier. It accused itself in plain English of being the reason behind my many headaches—ones that also cost tokens.

The Realization

AI is only as capable as the expertise of the person overseeing it. Think of someone who knows how to make drip coffee getting a job at a craft coffee shop. You can’t expect them to know every drink on the menu unless someone trains them, watches them, and corrects their mistakes—repeatedly. The difference is that eventually the barista doesn’t need oversight.

I don’t think AI is there yet.

When a human team member makes a mistake, their pay doesn’t change during the situation. They hopefully learn and keep their job. When AI makes a mistake, there are no refunds for the errors and only additional charges to get it to the right outcomes..

The meter ran the entire time I was fighting broken outputs. Every apology, every confirmed-but-wrong iteration, every garbage draft I caught before it could go live—all billed at the same rate as the outputs that worked.

The Math CEOs Are Not Doing

The calculation being made:replace human labor cost with AI tool cost.

The calculation being skipped: the human expertise required to supervise the AI, the token cost of every wrong output, the rework cost when no one catches the errors, and the liability cost when something consequential goes live unchecked.

Most recent layoffs trace to restructuring and cost control, not AI doing the actual work because AI is just not there yet. When companies have cited AI agent ability explicitly some have since admitted their shortsightedness.What’s really happening is OPEX being converted to compute spend. Headcount
traded for tokens. People traded for an infrastructure bet on technology that, in my direct experience, still requires significant human expertise to produce reliable output.

Paying non-beta prices for a beta technology is an experiment being funded, in part, by the people being let go.

Why I Built This

I built this project to be a better adviser. Not to check a box, but to know—from inside the machine—what this technology can and can’t do when the stakes are real.

What I know now: AI can do some remarkable things and the potential is genuine. That said, it requires domain expertise to direct it, real skill to fix it when it breaks, risk awareness to know what wrong looks like before it causes harm, and the willingness to do it yourself—the old-fashioned way—when the tool fails you.

For HR specifically—the field being asked to govern AI’s impact on hiring, performance, compensation, and workforce decisions—that expertise can’t be borrowed from the vendor or delegated to IT. It must be built by people who understand what’s at stake when the output is wrong and can tell when it is.

I, a skincare enthusiast, was the human testing layer to catch what the AI couldn’t produce or fix. CEOs
deploying AI need functional and HR experts in the room—not just a vendor contract, an implementation timeline, and a headcount reduction target—and then calling it innovation. Jensen Huang said it plainly: cutting people to fund AI doesn’t make companies leaner. It makes them less imaginative. I’d add one more thing—it makes them a liability waiting to surface.

Next in my AI journey: smoke testing the full fleet with all agents running in tandem.