AI Governance

Does prompt engineering still matter in late 2025?

Joe Marta

Senior Director of Applied AI

October 10, 2025

I just completed flow creation and testing for a new use case for my AI feature family "Audit AI." Here's what's different in prompt engineering circa October 2025.

You need a curated, high-quality real world dataset. If you're using RAG, web search, agent tool calls, any kind of grounding for your prompts you need to feed that EXACT context into your prompt writing environment. Context engineering is everything in 2025, and if you are testing on overly simplified context, you will not find real failure cases in your prompts (or your entire pipeline). We found multiple failures in the retrieval stage that showed up as incorrect responses, but only because the model got the wrong context.
You need to balance cost vs intelligence and understand how that impacts your prompts. Plus you need to know how to adapt for the inference/model provider you use. I use Bedrock. For this prompt chain it will run on tons of documents so it needs to be fast and cheap, this means I'm looking at small language models for all steps in the chain.

Here's the trick though - when I used 2024 and early 2025's reigning champ for cheap small language models, Claude Haiku 3.5, I kept getting incorrect results. In my test set of 43 documents, fully half of my answers (21) had at least one error in them. Of the 21 errors, 19 were failures of the model and instructions (not retrieval).

The typical move in 2022-2024 era prompt engineering would be to iterate over the failure cases adding instructions to cover it. This actually only solved 2 or 3 of my failures. I found that Haiku 3.5 lacked the capacity to follow the instructions as they got longer and contained more exceptions. This is 2025 prompt engineering -- I knew I needed a reasoning model that could reflect on the instructions, and I knew it had to be equally cheap and fast as Haiku 3.5. I tried Amazon Nova Pro to see if it would do better, and it actually did worse. I swapped in Claude Sonnet 4, which improved it somewhat, but didn't fix them all and added a lot more cost and latency. Turning on Reasoning in Sonnet 4 and 3.7 immediately solved the problem, but it added yet more cost and latency. But since I am deeply ingrained in the space, I had other ideas.

I knew GPT-OSS-120b and 20b were exactly what I needed. If I hadn't been keeping up with model developments like a hawk, I would've had no idea where to go next as the traditional instruction iteration method was failing. But I knew GPT-OSS-120b was a native reasoning model, and I knew it would be cheap and fast because of OpenAI's innovative MOE + MXFP4 approach.

In my testing due to improved instructions, its ability to follow the improved instructions, and the overall improvement in quality of the model, it resolved every single failure in my test set. From 21 fails to 0.

The lesson: keep up and stay nimble!

Advisory

Assessment

Security

Federal

Resources

Company

News & Events

Does prompt engineering still matter in late 2025?

YES! But the skillset has evolved.