links from February 2025
Introducing GPT-4.5 - hallucinations down, accuracy up, non-reasoning. Rolling out to pro + api. Doesn’t look like anyone will be coding with it any time soon with this type of api pricing:
Input: $75.00 / 1M tokens
Cached input: $37.50 / 1M tokens
Output: $150.00 / 1M tokens
And then in a tweet from sama:
this isn’t a reasoning model and won’t crush benchmarks. it’s a different kind of intelligence and there’s a magic to it i haven’t felt before. really excited for people to try it!
What We’ve Learned From A Year of Building with LLMs is a huge overview of findings building LLM applications from:
starting with prompting when prototyping new applications
all the way through:
what is a completely infeasible floor demo or research paper today will become a premium feature in a few years and then a commodity shortly after
Emerging Patterns in Building GenAI Products - a look at a number of different gen-ai patterns across evals, embeddings, RAG, Guardrails, fine tuning.
Fuck you, show me the prompt is an investigation into extracting the actual prompt that is sent to a model by llm abstraction libraries.
There are many libraries that aim to make the output of your LLMs better by re-writing or constructing the prompt for you. The prompts sent by these tools to the LLM is a natural language description of what these tools are doing, and is the fastest way to understand how they work.
The Novice’s LLM Training Guide - a look at fine-tuning LLMs using Low Rank Adaption (LoRA)
Claude 3.7 Sonnet and Claude Code - hybrid reasoning model, same price as claude 3.5, improved accuracy. Also a terminal-based agentic coding tool, however this requires an api key.
ChatGPT Deep Research hallucinates
it claimed again to produce a complete dataset but in fact only produced ~7 lines, with a placeholder for the other ~3000.
Grok3 set to launch though after the “launch” it appears that:
Not all the models and related features of Grok 3 are available yet (some are in beta), but they began rolling out on Monday.
Introducing Perplexity Deep Research - Perplexity undercuts OpenAI by releasing their own Deep Research, for free.
Building a SNAP LLM eval - the first write-up in a series about our process of building an “eval” — evaluation — to assess how well AI models perform on prompts
Your AI product needs evals - How to construct domain-specific LLM evaluation systems to improve AI by iterating quickly.
OpenAI roadmap update for GPT-4.5 and GPT-5 from sama, which indicates that the model they’ve been cooking for some time can no longer be considered GPT-5.
We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model.
Convert a figma design to code - After theoretically setting up claude to read/write from jira & github I remarked that my only job left would be to copy a screenshot from figma into the prompt and ask it to build the UI, but it looks that can can be integrated too.
Deepseek vs Claude PR Reviews - also demonstrates the value of being able to quickly switch between models.
The End of Programming as We Know It is another argument against ai replacing programmingers and for ai extending programmer capability.
OpenAI’s Deep Research: Novel User Applications and Community Insights - I prompted Deep Research to research itself
Investigate latest community news of OpenAI’s Deep research function and what novel approaches people are finding it useful for.
LLM Cost Analysis 2023-2026 - I asked ChatGPT deep research to generate a report that investigates the $/million tokens over time across providers and predict the price of tokens in 2026
Prepare a report that investigates the cost per million token of LLMs since 2023, with estimations on what the cost will be in 2026.
GitHub Copilot: The agent awakens - Just when you thought it was Cursor/Claude Desktop/Roo/Cline, m$ reminds you they’ll eat your lunch.
Open-source DeepResearch – Freeing our search agents - after the release of OpenAI’s Deep Research, Hugging Face deliver an open source alternative in 24 hours.
Getting AI-powered features past the post-MVP slump
The non-negotiable first step in systematically improving your AI systems is establishing a solid feedback loop.
Beyond the AI MVP: What it really takes
almost no one is talking about how to integrate this stuff into a normal software development lifecycle. There’s a reason no one is talking about this: it’s because most teams, even those at billion-dollar companies, just haven’t built this yet.
OpenAI o3-mini released.
This model continues our track record of driving down the cost of intelligence—reducing per-token pricing by 95% since launching GPT‑4—while maintaining top-tier reasoning capabilities.