llm
links
links
Claude now has web search but it’s only
available now in feature preview for all paid Claude users in the United States. Support for users on our free plan and more countries is coming soon.
OpenAI release o1-pro and it costs $150 per million token input and $600 per million token output.
Currently, it’s only available to select developers — those who’ve spent at least $5 on OpenAI API services
Evalite - a vitest-based eval runner by Matt Pocock.
Introducing GPT-4.5 - hallucinations down, accuracy up, non-reasoning. Rolling out to pro + api. Doesn’t look like anyone will be coding with it any time soon with this type of api pricing:
Input: $75.00 / 1M tokens
Cached input: $37.50 / 1M tokens
Output: $150.00 / 1M tokens
And then in a tweet from sama:
this isn’t a reasoning model and won’t crush benchmarks. it’s a different kind of intelligence and there’s a magic to it i haven’t felt before. really excited for people to try it!
What We’ve Learned From A Year of Building with LLMs is a huge overview of findings building LLM applications from:
starting with prompting when prototyping new applications
all the way through:
what is a completely infeasible floor demo or research paper today will become a premium feature in a few years and then a commodity shortly after
Fuck you, show me the prompt is an investigation into extracting the actual prompt that is sent to a model by llm abstraction libraries.
There are many libraries that aim to make the output of your LLMs better by re-writing or constructing the prompt for you. The prompts sent by these tools to the LLM is a natural language description of what these tools are doing, and is the fastest way to understand how they work.
The Novice’s LLM Training Guide - a look at fine-tuning LLMs using Low Rank Adaption (LoRA)
Claude 3.7 Sonnet and Claude Code - hybrid reasoning model, same price as claude 3.5, improved accuracy. Also a terminal-based agentic coding tool, however this requires an api key.
ChatGPT Deep Research hallucinates
it claimed again to produce a complete dataset but in fact only produced ~7 lines, with a placeholder for the other ~3000.
Grok3 set to launch though after the “launch” it appears that:
Not all the models and related features of Grok 3 are available yet (some are in beta), but they began rolling out on Monday.
Introducing Perplexity Deep Research - Perplexity undercuts OpenAI by releasing their own Deep Research, for free.
Building a SNAP LLM eval - the first write-up in a series about our process of building an “eval” — evaluation — to assess how well AI models perform on prompts
Your AI product needs evals - How to construct domain-specific LLM evaluation systems to improve AI by iterating quickly.
OpenAI roadmap update for GPT-4.5 and GPT-5 from sama, which indicates that the model they’ve been cooking for some time can no longer be considered GPT-5.
We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model.
Convert a figma design to code - After theoretically setting up claude to read/write from jira & github I remarked that my only job left would be to copy a screenshot from figma into the prompt and ask it to build the UI, but it looks that can can be integrated too.
Deepseek vs Claude PR Reviews - also demonstrates the value of being able to quickly switch between models.
The End of Programming as We Know It is another argument against ai replacing programmingers and for ai extending programmer capability.
OpenAI’s Deep Research: Novel User Applications and Community Insights - I prompted Deep Research to research itself
Investigate latest community news of OpenAI’s Deep research function and what novel approaches people are finding it useful for.
LLM Cost Analysis 2023-2026 - I asked ChatGPT deep research to generate a report that investigates the $/million tokens over time across providers and predict the price of tokens in 2026
Prepare a report that investigates the cost per million token of LLMs since 2023, with estimations on what the cost will be in 2026.
Open-source DeepResearch – Freeing our search agents - after the release of OpenAI’s Deep Research, Hugging Face deliver an open source alternative in 24 hours.
Getting AI-powered features past the post-MVP slump
The non-negotiable first step in systematically improving your AI systems is establishing a solid feedback loop.
On DeepSeek and Export Controls - CEO of Anthropic shares his take on DeepSeek - It’s not as good as everyone says it is, but China needs to be further restricted from chips anyway.
Nvidia releases a 72b multimodal LLM. The article claims it’s open source, but it appears to only have open weights and is otherwise commercially restricted.
Introducing OpenAI o1-preview, a thinking/reasoning model.
As an early model, it doesn’t yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT‑4o will be more capable in the near term.
Mistral announce Mistral Large 2
Mistral Large 2 has a 128k context window and supports dozens of languages
Prompting Fundamentals and How to Apply them Effectively has some really good prompting guidance.
I pondered whether LLMs would be any good at solving for Vehicle Routing Problem - thankfully I don’t need to investigate as arxiv.org once again delivers. TL;DR - yes, as long as you’re happy with it being wrong 30-40% of the time.
Microsoft releases Phi-3 vision
a 4.2B parameter multimodal model with language and vision capabilities.
I’ve been running koboldcpp in wsl, but the Tcl/tk UI is tiny. This looking interesting tho, and already in a container.
Marc Andreessen on navigating a model’s latent space via prompting.
We’re announcing GPT‑4o, our new flagship model that can reason across audio, vision, and text in real time.
Introducing the next generation of Claude
The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus.
GPT4All runs large language models privately on everyday desktops & laptops