Will Smidlein's Blog

Extracting Value From Forums With LLMs

I have so many forum threads in my bookmarks that I desperately want to consume, but when I sit down to read them, I very quickly get overwhelmed by crosstalk, multiple threads of conversation, weird forum conventions, etc. I know there are nuggets of genuine brilliance buried in these things, and my TikTok-addled attention span is incapable of filtering through to it.

And then it hit me… this would be a great use case for LLMs! In fact, it would not surprise in the slightest if forum data was a large basis of the training data used for these models in the first place.

I decided to pick a thread that has been on my list for years, Jeff Sponaugle’s INSANE Tony Stark lair build, but I approached the problem generically enough that I should be able to apply the approach to other forum threads.

Step 1: Scrape the thread

I wasn’t super comfortable with the idea of fully automating data collection from forums since many are maintained by hobbyists who shoulder the cost themselves and are already getting slammed by AI scrapers. I instead chose a compromise; I turned off images in my browser to minimize bandwidth consumption and manually paged through the thread, saving each page as HTML and clicking a few ads along the way. I think feeding images into vision models would be a fascinating next step, but it’s not something I’m comfortable doing without explicit opt-in from the forum owners.

Once I had my directory of 22 HTML pages, I needed to extract them into structured data. I initially started vibe-coding one myself, but very quickly discovered that the HTML structures were downright horrendous, and luckily discovered forumscraper which helpfully dumped to a 27,288 line (590,618 token) JSON file.

Step 2: Token Reduction

Luckily, much of the data is unnecessary for our use case, and much of the necessary data is quite repetitive.

The first thing that stood out to me was the repeated user objects. If someone posted in a thread 100 times, there were 100 user objects, each with sub-objects for arbitrary key-value pairs like Location and Join date. This is hundreds of thousands of characters that can be condensed.

Additionally I noticed there was still quite a bit of HTML markup scattered in the text- things like paragraph tags, img tags with lots of attributes, weird handling of quote. I also discovered there were 4+ different ways a blockquote/reply could be formatted, presumably depending on the client of the poster.

I split the output into 2 arrays, posts and users.

The post schema is:

The user schema is:

With Claude’s assistance, I was able to whip up this horrific set of regex replaces to turn all known blockquote/reply formats into plaintext

// Handle XenForo blockquotes: extract data-quote (author) and inner text content
// Note: use [\s\S] to match across newlines; [\s\s] was incorrect and missed non-whitespace
s = s.replace(/<blockquote[^>]*data-quote=\"([^\"]+)\"[^>]*>([\s\S]*?)<\/blockquote>/gi, (_m, author: string, inner: string) => {
  // Prefer content within expandContent if present
  const matchExpand = inner.match(/<div[^>]*class=\"[^\"]*bbCodeBlock-expandContent[^\"]*\"[^>]*>([\s\S]*?)<\/div>/i);
  let contentHtml = matchExpand ? matchExpand[1] : inner;
  // Drop any "Click to expand..." links
  contentHtml = contentHtml.replace(/<div[^>]*class=\"[^\"]*bbCodeBlock-expandLink[^\"]*\"[\s\S]*?<\/div>/gi, "");
  // Convert br to newlines early
  contentHtml = contentHtml.replace(/<br\s*\/?\s*>/gi, "\n");
  // Convert images to alt + URL when available
  contentHtml = contentHtml.replace(/<img[^>]*>/gi, (imgTag: string) => {
    const altMatch = imgTag.match(/\balt=\"([^\"]*)\"/i);
    const srcMatch = imgTag.match(/\b(?:src|data-src|data-url)=\"([^\"]+)\"/i);
    const alt = altMatch ? altMatch[1] : "";
    const src = srcMatch ? srcMatch[1] : "";
    if (src) return alt ? `${alt} (${src})` : src;
    return alt || "[image]";
  });
  // Convert anchors to their text
  contentHtml = contentHtml.replace(/<a[^>]*>([\s\S]*?)<\/a>/gi, "$1");
  // Strip remaining HTML
  let contentText = contentHtml.replace(/<[^>]*>/g, "");
  // Decode common entities
  contentText = contentText
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;/g, '"')
    .replace(/&#39;/g, "'")
    .trim();
  // Format clearly as a reply attribution for LLMs
  // 1) Explicit in-reply-to header
  // 2) Markdown-style blockquote for quoted text
  const quotedBlock = contentText
    .split("\n")
    .map(line => `> ${line}`)
    .join("\n");
  return `[in-reply-to: ${author}]\n${quotedBlock}\n`;
});

I also wrote everything as YML rather than JSON, which provides more free token savings.

Down to 190,147 tokens - a 67.8% reduction! And now technically within the 200K context window the Claude models support, meaning we could theoretically feed the entire doc in as a single prompt, however I have anecdotally found the models do much better when you give them less noise. Even 1M context models.

Step 3: Reducing The Noise

Browsing the YAML file, it pretty quickly became apparent that many of the thread posts were providing literally zero value to the thread. Stuff like “very cool! subscribed!”, cross-talk about irrelevant topics, meta commentary on the forum and it’s members, etc. Which, duh, that’s how we ended up here in the first place.

We could probably write a deterministic filter to get most of these, but that’s boring. Let’s use an LLM!

I vibe coded another script-


anthropic.messages.create({
    model: DUMB_MODEL,
    max_tokens: 5,
    temperature: 0,
    system: [
      {
        type: "text",
        text: `You are a forum summarizer's gatekeeper. Your single job is to prevent your boss from wasting time on posts that add zero benefit to understanding the thread. Decide whether this post is PURELY filler with no informational value for comprehension of the topic.

  Return ONLY one word with no other text:
  - "FILTER" if the post is pure filler that should be filtered out
  - "KEEP" if the post should be kept (has any informational value)

  Filter out (return "FILTER") ONLY if the post offers no signal that improves understanding:
  - thanks/agree/like/emoji-only/cheers
  - following/bump/subscribed  
  - off-topic small talk or jokes with no content
  - signature-only or attachments-only
  - quote-only with no added commentary
  - one-liners with no new info, no clarification, no data, no question

  Keep (return "KEEP") if the post contains ANY non-trivial detail, clarification, question, correction, measurement, trade-off, hypothesis, link WITH context, or actionable pointer. When uncertain, return "KEEP".`,
        cache_control: { type: "ephemeral" }
      }
    ],
    messages: [{
      role: "user",
      content: `User: ${post.user || 'Unknown'}

  Post content:\n${post.text}`
    }]
  })

I’ll be completely honest, I’m not totally sure I trust the whole max_tokens: 5 and ask for a single word response approach. I somewhat suspect you would get better results by asking the model to include a reason for filtering and paying the marginally higher cost for a few more output tokens, even if you’re then going to go and throw away that info. But I have seen people literally ask for “true/false”- which fundamentally misunderstands the strengths/weaknesses of a language model - and they seem to be getting fine results? Ultimately the cost to run this job as a batch request on an older model was so laughably cheap that if I were to do it again, I would not worry at all about clamping down on output tokens.

Anyway… 37.6% of posts removed in my test thread, which also removed several posters from our users array as well. Resulting in a new total of… 156,858 tokens.

Step 4: More LLMs

At this point we could probably feed the YAML doc straight into an LLM and call it a day. However replies pose a bit of a conundrum as they require the model to keep track of multiple (sometimes overlapping) trains of thought which (understandably) trips up the reasoning and can frequently lead to false conclusions. There is also still a lot of wasted tokens on filler. So let’s feed the posts into another LLM!


posts.map(post => {
  const sentenceCount = countSentences(post.text);
  const selectedModel = sentenceCount > 3 ? SMART_MODEL : DUMB_MODEL;
  const { targetSentences, maxTokens } = calculateSummaryLength(sentenceCount);
  
  return {
    custom_id: `post_${post.id}`,
    params: {
      model: selectedModel,
      max_tokens: maxTokens,
      system: [
        {
          type: "text",
          text: `You are an assistant that extracts the full substance of a forum post (including any inline replies). Put a specific focus on insights or findings. 
Your job is to produce a structured, detailed summary that captures every meaningful point while omitting pleasantries, signatures, or generic forum chatter.  

Guidelines:
- Identify the main topic or question.  
- List all technical details, arguments, experiences, or data points mentioned.  
- Include any advice, solutions, or proposed actions.  
- Capture disagreements or alternative viewpoints in replies.  
- Ignore filler (e.g., "thanks," "great post," "following," jokes).  
- Preserve nuance: if there are uncertainties, caveats, or edge cases, state them explicitly.  
- Use concise prose or bullet points as appropriate, but ensure nothing substantive is lost.  

Output should feel like a detailed set of notes a researcher would take, not a casual recap.`,
          cache_control: { type: "ephemeral" }
        }
      ],
      messages: [{
        role: "user",
        content: `Target length: Aim for approximately ${targetSentences} sentences (this post has ${sentenceCount} sentences originally).

User: ${post.user}

Post content:
${post.text}`
      }]
    }
  };
});

An Aside On Batch Mode

I think people are criminally underusing the batch mode that all the providers have. Unless your use case is a literal realtime back-and-forth chatbot, you should be reaching for batch mode by default. When using the Anthropic API, I pretty consistently see requests processed in under 10 minutes and don’t think I’ve ever seen them take longer than an hour. Counterintuitively, it seems larger batches process faster than multiple small batches run in parallel. And then finally, and I have absolutely nothing to back this up… I think batch mode performs better for the same prompt against the same model. If they’re quantizing or degrading performance under load on API requests, they don’t appear to be doing that in batch requests.

Step 5: Final Summary

And now for the moment we’ve all been waiting for. Down to a measly 67,808 tokens, ready for analysis and summary.


const prompt = `ROLE
You are a meticulous research analyst and editor. Transform a MASSIVE multi-post forum thread (provided as concatenated summaries) into a single, long-form, publication-ready blog post.

OUTPUT GOAL
Produce a logically organized Markdown article with clear headings, deep synthesis (not just summary), and evidence-aware claims. Prioritize correctness and internal consistency.

STYLE
- Voice: clear, journalistic, neutral-to-practical. No forum niceties or chit-chat.
- Structure: H2 for major sections, H3/H4 for subsections. Include a linked Table of Contents.
- Precision: preserve concrete numbers, units, part names, model numbers. Convert units when useful (imperial + metric).
- Evidence: cite minimally with inline markers like (§12) that reference the source block index if provided; short quotes ≤20 words only when necessary.
- No hallucinations. If extrapolating, mark as [inferred] and explain basis.
- Collapse duplicates; resolve contradictory claims explicitly.

WHEN INPUT IS HUGE
If content seems truncated or exceeds context:
1) Prioritize: Executive Summary → Key Specs → Decisions/Tradeoffs → Timeline → Costs → Lessons → Open Questions.
2) Deduplicate aggressively; keep unique details.
3) Note truncation explicitly at the end.

REQUIRED SECTIONS (in this order)
---
Front Matter (YAML)
- title, subtitle, tags, estimated_read_time_minutes, last_updated_utc

# Table of Contents
(autogenerated list of H2 anchors)

## TL;DR (5–9 bullets)
Concise, concrete facts and outcomes.

## Context & Scope
What this thread is about, constraints, objectives, assumptions.

## System / Build Overview
High-level architecture or project outline. One diagram-friendly list if applicable.

## Key Technical Details
Bullet list(s) of specs, dimensions, materials, tools, processes, models, part numbers, settings. Normalize units and naming.

## Problems & Constraints
Enumerate issues, blockers, edge cases. Note root causes when stated; mark unknowns.

## Solutions, Techniques & Tradeoffs
Pair each problem with adopted solutions/alternatives, with pros/cons and preconditions.

## Timeline & Progress
Chronological milestones. Table with: date/phase | action/decision | rationale | outcomes (§refs).

## Costs & Resources
Any prices, budgets, BOM items, time/people effort. Simple subtotal(s) if computable from given numbers.

## Lessons Learned
What worked, what didn’t, gotchas, decision rules. Make them portable/generalizable.

## Community Dynamics
Roles (experts/beginners), notable contributors, consensus vs. dissent, knowledge gaps.

## Patterns & Trends
Recurring themes, failure modes, best-practice patterns distilled from multiple posts.

## Notable Insights
Unique tips, counterintuitive results, rules of thumb worth highlighting.

## Open Questions & Risks
Ambiguities, pending decisions, risks with likelihood/impact if stated.

## Practical Checklist
Actionable next steps or verification list a practitioner could follow.

## Glossary
Acronyms/jargon with short definitions.

## Source Map
One compact list mapping major claims to (§#) markers; include contradictions noted.

## Red-Team Review
Short list of what might be wrong or overfit in this synthesis and what evidence would reduce uncertainty.

OUTPUT RULES
- Use Markdown only (tables allowed). No HTML.
- Do not echo the input.
- Keep numbers exact; round only when specified or obviously a display nicety.
- If two figures conflict, present both and state which has stronger support (with § markers).
- If links appear in input, include them in-context; otherwise don’t invent references.

PROCESS
1) Skim for scope, then pass to extract entities (specs, models, dimensions, prices, dates).
2) Cluster by topic; deduplicate; note contradictions.
3) Synthesize sections above; add § markers where evidence appears to originate (if indices exist in the summaries).
4) Run a final consistency pass: units, names, dates, totals, and section ordering.

=== INPUT ===
${summaries}`;

const batchRequest = {
 requests: [
   {
     custom_id: requestId,
     params: {
       model: model,
       max_tokens: 20000,
       temperature: 0.2,
       messages: [
         {
           role: "user" as const,
           content: analysisPrompt
         }
       ]
     }
   }
 ]
};

const batch = await anthropic.messages.batches.create(batchRequest);

I threw this at 4.1 Opus with a big token budget and was pretty happy with the outcome. I will also benchmark across some other providers but don’t expect crazy different results.

What’s next

One thing I like about my approach (and was not accidental) is that it’s purely additive - all the original data stays on my disk so as new models come out, I can re-run parts of the pipeline and hopefully get even better results without consuming any more bandwidth from the forum. The most obvious step to retry is the final one, but I am also eager to try redoing the individual post summarizations (or at least some of them) with open weight self-hosted models.

One potential area of optimization would be to pull the “users” array out of the prompt and instead give the model a get_post_info(post_id) tool, but I am also not entirely convinced it would ever reach for it (beyond the OP). I am also interested in the idea of using an LLM enrichment step to pull out distinct threads of conversation and then feeding those conversations as distinct conversations rather than as part of a broader thread.

As mentioned above, I am extremely excited at the prospect of using vision models to help the LLM gain visual context from image-heavy threads like this, but I want to be careful about bandwidth consumption. One of the reasons I picked this specific thread is that Jeff had the foresight to host the images himself, on his own domain, and embed them, so in this specific case he is actually footing the bill. I will be reaching out to see if he’s comfortable with me downloading all the images for this use case - given his proclivity for technical experimentation, I get the sense he will be very on board :)

I also want to investigate using embeddings for various stages of the pipeline.

Reading The Tea Leaves

A lot of forums are barely hanging on by a thread, started many years ago by hobbyists passionate about a specific niche. Some are just as passionate at they were 10-20 years ago, many are not. The overhead required to maintain and moderate a public forum is enormous, especially with the advent of LLMs, not to mention the skyrocketing bandwidth costs. Many admins are looking for an out.

On the flip side, you have model companies that are desperate for large swaths of text written by real humans that they can train on, flush with VC cash.

It would surprise me 0% to see intermediaries quietly buying up forums purely to turn around and sell the post data. In fact, it would surprise me 0% if that had already happened or is currently happening. Yes they could scrape for free, but if the cost is low enough and the value is high enough, why even bother?

EV Accounting

I recently leased a Hyundai Ioniq 5. I did not (and do not) need another car, I did not want another car, I just lowballed the hell out of a dealer and was utterly shocked when he said he’d do the deal.

It’s quite nice.

EVs have come a long way since I had my 2019 E-Golf. Public charging infrastructure has somehow gotten… worse? Which I truly did not believe was possible. But alas it’s not something I have to worry about (as much) this time around because I have a home EV charger.

I went with an Emporia mostly because I already have their whole-home energy monitoring and I figured it would tie in nicely. It was also $120 cheaper than the Chargepoint which helped. These things are remarkably simple so it seemed dumb to spend more than the bare minimum.

I have it configured to only charge during my utility’s off-peak hours so the cost is absurdly low- less than 1/10th the cost to drive the equivalent distance in my Subaru.

Alas, it’s not nothing. And therefore it must be accounted for. With gas it’s easy; every time I swipe my credit card at a transaction, I categorize it as “Gas- {car}” in YNAB and then at the end of the month I get a nice tally of how much I spent on gas per vehicle in my fleet.

With EVs, it’s part of the electric bill, and it’s not easily separated out by the utility company (they don’t know if you used your power to charge your EV or to run your oven).

I already have a script that extracts the line items from my utility company’s PDF bills and creates split transactions, so I figured I could simply use the Emporia API to figure out how much I spent on EV charging in that time period and add another line item.

Turns out Emporia doesn’t have an API. Or at least, they don’t have a public one. Luckily somebody has already done the heavy lifting which made adding the EV Charging Report endpoint quite easy.

Now every time I run my script, it will reach out to the Emporia API and ask “how much of this electric bill was EV charging?” and properly categorize.


report = vue.get_ev_charging_report(EVSE_ID, billStartDate, billEndDate)
total_cost = report.charging_cost

description = f"{report.energy_kwhs:.2f} kWh"

What will I do with this information? Absolutely nothing.

Early Career Advice From John Siracusa

If you just graduated with a computer science degree and you have any interest whatsoever in being involved in any kind of startup, or being one of a few people in a small company with a lot of responsibility, now is the time to do it - when you’re young.

If you don’t have any interest in that, don’t do it - don’t do it just because you think it’s what people do. But if you’re like, “Oh, I always wanted to be one of five programmers working on a project” or something, “I always wanted to be involved in a startup” or whatever, now is the time to do it.

It will only get harder to do that later. Having a job like that early in your career, where you were one of a small number of people, will force you to learn how to do a whole bunch of stuff, and that will make you a much more valuable employee when you get tired of the startup world or when you want to go to a company that’s not going to go under. Later, you will have so much more real-world experience and knowledge than people who went to work for IBM right out of school - to throw IBM under the bus - or went to work for some big company.

When you go to a big company, you do learn things on the job, but it’s a much more stable environment in terms of what’s expected of you. You don’t go in there and suddenly have seventeen jobs and have to learn them all now. It’s going to be more sustainably paced, let’s say, but you will learn less; it will take you longer, and your skills will be more narrow because there are 75,000 other people who do their own specialized jobs.

That’s not to say, “Do startups when you’re young because that’s the time to burn you out”. You shouldn’t have burnout even when you’re young. What I am saying is that if you are in a company with a small number of people, it can still be a healthy work environment, and you will still be required to learn how to do way more jobs simply because there just aren’t enough people. Someone’s got to figure out how to administer this Linux machine - congratulations, you’re the sysadmin now. Someone’s got to learn this new API or this new language - congratulations, it’s you and one other person.

You will learn so much and you will be battle-tested. And when your company inevitably goes under - because that’s what happens to most startups - when you apply for that job at Apple or Google or whatever, you should look a lot better than the other candidates because you will literally know how to do more stuff.

- John Siracusa on ATP episode 647 ▸ 1:34:07

Great discussion from the ATP folks, highly recommend listening to the entire segment.

What Makes A Good Manager

A good manager balances high empathy with high expectations and knows when to pull which lever.

- George Mandis

Unit Tests In Markdown

I was reading this great post comparing Pyrefly and Ty and the appendix stood out-

I just wanted to call out that ty’s tests are written in… MARKDOWN! How cool is that?

- Edward Li

Very cool indeed!

I did a bit of digging and found the Readme as well.

Engineering Advice From SuperfastMatt

As you work your way through an engineering career, you will meet people who are so good at their job, they will be able to tell you if a design will work or not just be looking at it.

And then, as you progress further, you’ll realize that these people are terrible engineers.

- SuperfastMatt

A closer look at the details behind the Go port of the TypeScript compiler

Set VS Code as Default Program for All Code Files (Mac)

Every goddamn time I set up a new Mac, I have to dig around to find that one Terminal CLI command that sets default programs (it’s called duti). Then I loop through every file type I can think of and manually execute the command to change it. And every time, I forget an extension and end up cursed—accidentally opening Xcode for that file type for the rest of time. And every time, I swear next time will be the time I automate this.

That time has finally come.

H3: Uber’s Hexagonal Hierarchical Spatial Index

H3 indexes points and shapes into a hexagonal grid. Coordinates can be indexed to cell IDs that each represent a unique cell.

Indexed data can be quickly joined across disparate datasets and aggregated at different levels of precision.

H3 enables a range of algorithms and optimizations based on the grid, including nearest neighbors, shortest path, gradient smoothing, and more.

- h3geo.org

Only 8 years late to the draw on this one.

Very neat, and looks like it has continued to evolve over time as well.

Hat tip to Simon Willison.

Comparing LLMs For <img> alt text

In an earlier post I mentioned that I might want to make LLM alt tag generation part of my build process. I went down that rabbit hole this evening- more to come on that in a future post- but I came across this delightful post - as well as his followup with findings after running his 9k images through and figured I should share.

Scotty Peeler Label & Sticker Remover

Bought this many years ago thinking I would use it once or twice. I use it once or twice a week if not more. You don’t realize how much you rely on this thing until you go to a friend’s house and they don’t have one. Extremely high ROI purchase. Not an ad, not an affiliate link, I have no relationship with the company beyond spending $8 on one of their products once.

uv: It's Really Good

Basically, they took what was working in pip, rye and poetry, and discarded all the stuff that didn’t work. Then they spent months killing tickets to bring it to an insane level of quality.

This cannot be understated, as such a level of quality and dedication is so extremely rare in software that I usually associate it with things like VLC or sqlite. This is the league I consider uv in.

Always try uv first. If it doesn’t work (which is very rare), go back to what you did before or find a workaround.

If you haven’t played with uv yet, I can’t recommend it enough. Python ecosystem tools come and go but this one feels like it’s got some staying power. Lots of tremendously well thought-out decisions, many of which are laid out in the linked article. No magic, just good abstractions. Such a joy.

I will say I am somewhat confused about who is paying for all of this. Like obviously VCs but uh… why? Definitely not out of the goodness of their hearts. Similar uneasiness around Bun, which I also love.

Harper Reed's LLM codegen workflow

For some reason I say “over my skies” a lot when talking about LLMs. I don’t know why. It resonates with me. Maybe it’s because it is beautiful smooth powder skiing, and then all of a sudden you are like “WHAT THE FUCK IS GOING ON!” and are completely lost and suddenly fall off a cliff.

I find that using a planning step […] can help keep things under control. At least you will have a doc you can double-check against. I also do believe that testing is helpful - especially if you are doing wild style aider coding. Helps keep things good, and tight.

Broad strokes this is very similar to my workflow but there lots of nuggets of wisdom in this post. He also goes into a lot of interesting detail about using LLMs in non-Greenfield projects. Just a great read all around.

First time hearing about Aider and repomix, excited to try them out.

Where The Linear Chat Paradigm Starts To Fall Apart

Expanding on this comment from my last post:

I have yet to find a UI that lets me tag a chunk of response (eg a specific bullet point) to come back to, or what I really want which is a waterfall of the different conversations that split out of a long response. Typically I want to respond and ask more (or provide more detail) about 2-5 bullet points but in a purely linear structure I’m constantly scrolling back up and trying to remember things I want to loop back and ask about.

I put together a (totally fake) example diagram to illustrate what I mean:

Human

How can I improve this code?

Consistent Error Handling and Wrapping

Context-Aware Concurrency

More Idiomatic Data Structures

Reduce Nesting & Improve Legibility

Centralized Logging

Test Coverage & Benchmarks

Optional: Configurable Retries / Intervals

How do you suggest we solve?

Creates its own logger

Can you just use slog?

Uses slog

Okay, apply & can you check the whole codebase?

Finds ~30%

Okay fix; any more?

Finds another ~20%

Okay fix, I'll do the rest with find-and-replace

Why go beyond Go's built-in primitives?

Explanation

Hm, makes sense—apply

Client in client.go (code)

Try pseudocode?

{code}

Great, apply it

Which data structures?

bindingManager in internal/client/binding.go

Great, apply it. Any others?

TransactionMap in internal/client/transaction_manager.go

Great, apply it

Intentional; idiomatic Go

(No changes needed)

Duplicate; ignore

Expand on the benchmarks

Big verbose response

It's fine we use an external test suite

Expand on test coverage

Transaction Testing in internal/client/transaction.go

Is that the right place for the test?

No, (fix)

Great apply it

Binding Manager Testing in internal/client/binding_test.go

Can we collocate these?

Yes, {code}

Great apply it

Are these retries/intervals RFC-defined or other standard?

RFC 5766, 8656, 6062, 8489

Okay they're standardized, no need to make them configurable

As we get further into the “reasoning era” I think this problem will only become more pronounced. It’s surprising to me that none of the major LLM providers have explored branching logic (at least to my knowledge).

The current solution of “scroll back up and reference earlier in the conversation” falls apart as soon as you get past a few messages. You almost need a mechanism that says “pick back up with my state from here”.

More to come…

How I Quickly Build Complex Side Projects With LLMs (in Feb 2025)

I’m a programmer both professionally and, somewhat begrudgingly, as a hobby. I am constantly building dumb hyper-specific side projects for an audience of 1. I have a shell script that orders a coffee at my local coffee drive-through. I have a Chrome extension that scrapes and syncs my Amazon purchases to YNAB. I have a neural network constantly checking the IP cam in my garage on trash day to make sure I took the trash out. These are things I build- almost compulsively- upon encountering the tiniest annoyance or inconvenience. And it’s been that way for as long as I can remember.

LLMs have hypercharged this. At first I was able to do projects in 1 or 2 iterations with a chatbot, but as these things’ capabilities have grown, so has my appetite. I’m building dumber projects of grander scales than I’ve ever built before. I figured I should share my approach and see how it compares to others.

This workflow particularly shines for data-heavy automation projects that would normally be tedious to build and maintain. Some examples I’ve built using this approach:


Step 1: Lay out the requirements

This stage is usually in a chatbot UI and I’m generally reaching for Sonnet or 4o unless the project has complex data parsing/structures at which point I’m usually going for o1.

First specify that the goal of the conversation is to create a comprehensive bullet point list of requirements “in the style of a Jira ticket”. Next specify that you want to have a conversation about the problem and the requirements prior to generating the list- ask it to be extremely detailed about it’s requirement gathering and verbose in it’s response. Instruct it to remain language-agnostic reaching for pseudocode where necessary. Finally the fun part… just word dump the problem. Explain what you want to do in plain English. If you have example input data, provide it. If you have a HAR file from a reverse engineered API, (turn it into Markdown and) provide it. API docs? Provide them.

I have yet to find a UI that lets me tag a chunk of response (eg a specific bullet point) to come back to, or what I really want which is a waterfall of the different conversations that split out of a long response. Typically I want to respond and ask more (or provide more detail) about 2-5 bullet points but in a purely linear structure I’m constantly scrolling back up and trying to remember things I want to loop back and ask about.

At the end you’ll get a nice big text blob with requirements. Save this as a mardown doc - initial-reqs.md. Assume that this requirements list is missing about 30% of what you think you told it. You also cannot assume any logic whatsoever for what it decides to drop and at what part of the process it decides to drop it. But it’s okay! We have a solution…

Step 2: Do it again

Usually still in a chatbot UI, usually reaching for a “smarter” model from a different provider. So if I used Sonnet in the last step, this time I’m probably going for o1. If I used a GPT model, I’m reaching for Gemini 2.0. I have not seen any evidence that this actually matters but anecdotally it feels right so ¯\(ツ)/¯. Using llm more and more for this step.

First specify that the goal is to find any gaps in the requirement document outlined “so that a talented but new Junior engineer can complete it without interruption”. Specify that you’re writing Python (even if the end goal is not Python- these things seem to be trained on a lot of Python… and it’s very easy to iterate on). Specify that you require small well-defined methods with verbose commentary. Ask an open ended question like “am I missing anything?” and answer any questions you feel are relevant. At the end, dump that entire conversation to a markdown file - detailed-reqs.md.

Step 3: Unit tests!

I’m generally in an “LLM-native” code editor like Cursor, VS Code with GH Copilot, or Windsurf for this step. I can’t say there is any distinct pattern of which model I’m reaching for- in fact I’m generally hot swapping between them just for funsies.

I’ve found I’m generally ask for the intended output language at this point. For me that’s either Python (I need iterate on this quickly and run it once), Node.js (I need to run this on a regular basis and can quickly deal with it when something breaks), or Go (I need this to run exactly the same today as it will run 5 years from now). However on a few projects- specifically one parsing really complex CSV structures using Go and another transcoding G.711->Opus with libopus- I found the models choked a bit and got stuck in a loop writing code that would never execute. The trick is to ask it to write the unit tests and code in Python and then ask it to turn that code into Go/Swift/whatever. For the sake of a (marginally) interesting post, I’m going to do that.

Specify that you want a comprehensive set of unit tests that check for conformity against the specifications attached. Attach both initial-reqs.md and detailed-reqs.md. Ask it to include excerpts of the requirements alongside the relevant tests.

This is the part of the process that requires the most brainpower. You need to go through all of the generated tests and ensure they

  1. Make sense
  2. Are comprehensive enough
  3. Cover every iota of intended functionality

This is made even easier by the ability to chat with the models about specific individual tests. I am finding that I generally 3-5x each test size from the initial generation. This may be out of habit but it’s likely because it’s so goddamn easy.

Make sure you’re breaking tests into individual files, I have completely unscientifically settled on <1k lines/file. This is really where an LLM-native editor starts to show it’s perks vs copying and pasting from a chatbot UI.

It is absolutely crucial that you’re reviewing the diffs on a line-by-line basis rather than just blindly accepting the “full file” results. The LLMs will drop comments to an almost comical degree- even for lines they haven’t touched the logic on. This is probably the biggest flaw of these models (or rather, the “code apply” models the IDEs are using) at the moment.

By the end of this step you want a comprehensive test suite. It’s on you, the human, to make sure this test suite actually encompasses everything you want your final program to do. You won’t (or at least I don’t) but you should try really really really hard to. Actually read every single one and question if it makes sense. This is literally the first time in the process that this is happening, so it’s critical that you take your time here.

Step 4: Write the code

If you did your job right in the previous steps, the model should be able to generate code that gets with the expected results in 1-3 steps. If you did not do a good job, it has probably quickly become clear to you what gaps exist in your unit tests. Do not fix the code. Fix the unit tests. It is extremely tempting to be like “oh duh I just forgot to tell it Fahrenheit instead of Celsius, if I correct it I’ll fix this one remaining bug with the program and be done.” It’s tempting because it’s probably true! For simple enough corrections, the LLM will get it right first iteration. But for more complex implementations, you may need to cycle back and forth for many message in order to get it to do what you want in the way you want. These conversations sometimes go totally off the rails into a loop where it’s just suggesting the same 3 solutions that will never work over and over and over again. Those conversations are where the unit tests really shine because you can close them and start fresh ones with zero fear of losing context. You’re only losing conversation-time context, which in this case is actually quite useful. I have yet to encounter a problem I could not solve in this way- although it does sometimes require several attempts to guide the model through the exact implementation path I want. I always always “save the state” of the requirements back into the unit tests.

Even if you get the thing to run in a way that seems like it’s working, it’s very possible some (or even many) unit tests will fail. Again, that’s the beauty of unit tests. Now you’ve just gotta run through them and figure out if it’s the code that’s wrong, the unit test that’s wrong, or both- and then fix. I find the cycle time on this to be fast enough (and fun enough) that I haven’t tried anything “agentic” to automate this process beyond a little bit of experimentation with Block’s Goose.

I cannot hammer this home enough… the code is the byproduct of the unit tests. In 6 months or 2 years or 10 years when the languages change or the dependencies change or your tastes change, you can simply ask the most cutting-edge model of the time to write you new code against those unit tests, and then you can objectively evaluate (in a single shot), if it did it. Sure sounds like a practical implementation of those “reward functions” I hear the thinkfluencers talking about.

Step 5: Write a Readme

I’m not actually sure if this helps the LLms at all, but it definitely helps me when I’ve gotta go back and figure out what the hell I built and why I built it that way.

I will usually feed both the initial-reqs.md and detailed-reqs.md and ask it something along the lines of “Generate a comprehensive Readme with sections outlining goals, data structures, dependencies, how to run locally, how to run unit tests” which pretty much any model can do in one shot. Sometimes I’m feeding the unit tests and the code in (which definitely improves the quality of the Readme) but sometimes I’m limited by context size. On my next project I want to try this stage with Gemini 2.0 Pro’s insane 2M context window.


Please email with feedback and ideas! I am excited to see how this process evolves (and likely continues to get simplified by better tools and better models) over time.

LLM Trick: Improve Project Directory Structure With tree


❯ tree -I 'node_modules|dist' | llm "How can I improve this project \
structure to follow best practices in 2025? Be verbose but format \
your response so that each line is less than 80 characters. Use as \
many newlines as necessary."

Improving your project structure involves adhering to best practices for organization,
maintainability, scalability, and clarity. Below are recommendations to enhance your
existing structure in line with current trends and methodologies.

### 1. Directory Organization

#### **Logical Grouping**
- Ensure that files are organized based on their functionality or purpose.
- Keep components, layouts, and utilities grouped logically and consistently.

#### **Feature-based Structure**
- Consider organizing files based on features or modules.
- For example, group related pages, components, and styles together.

#### **Example Structure**
src
├── features
│   ├── blog
│   │   ├── components
│   │   ├── pages
│   │   ├── utils
│   │   └── styles
│   └── about
│       ├── components
│       ├── pages
│       └── styles

### 2. Dedicated Folder for Hooks and Contexts

- If using React, create a dedicated folder for hooks and context providers. This
  enhances maintainability and encourages reusability.

src
├── hooks
│   ├── useAuth.ts
│   └── useFetch.ts
└── contexts
    ├── AuthContext.tsx
    └── ThemeContext.tsx

### 3. Asset Management

#### **Centralized Assets Folder**
- Move all assets (images, videos, fonts) into a designated directory for clarity.

src
├── assets
│   ├── images
│   ├── videos
│   └── fonts

### 4. Naming Conventions

- Use consistent naming conventions for files and directories. Opt for lowercase
  with hyphens for filenames (e.g., `header.astro`, `post-list.astro`).

- Ensure that folder names clearly describe the contents or purpose (e.g., use
  `components` instead of `component`).

### 5. Documentation

- Update the `README.md` to contain information on project setup,
  contributing guidelines, and architecture overview.

- Consider adding a **Contributing** folder to document coding standards, PR
  guidelines, and issue templates.

### 6. Typescript Organization

- Organize TypeScript files into dedicated folders depending on whether they
  are types, interfaces, or enums, enhancing structure and discoverability.

src
├── types
│   ├── global.d.ts
│   └── post.ts
└── interfaces
    └── user.ts

### 7. Tests Directory

- If not already present, create a `tests` directory to contain all test-related
  files, enhancing testing organization.

tests
├── unit
└── integration

### 8. Configuration Management

- Keep configuration files (like `tsconfig.json`) in a clearly labeled folder,
  perhaps under a `config` directory.

### 9. Enhance Version Control

- Consider adding `.gitignore` and `.editorconfig` files if not present
  to manage ignored files and maintain coding styles.

### 10. Update `public` folder organization

- Consolidate assets in the `public` directory and structure them for easy access.

public
├── assets
│   ├── images
│   └── styles
└── favicon.ico

### Conclusion

By adopting these organizational strategies, your project structure will be more
robust, easier to navigate, and maintainable as it grows over time. Furthermore,
these best practices align with modern development trends, fostering a better
developer experience and improving collaboration within teams.

Samuel Covin On AI Abstractions

If you’re running a customer service business and you have loads of people sitting answering telephones, the less well trained they are, the less that you trust them, the more that you need to give them a script to go through. […] If you’re doing high net worth banking, you just employ people who you think are going to be charming to other rich people and set them off to go and have coffee with people. […] And the same is true of models. The more intelligent they are, the less we need to tell them, like structure what they go and do and constrain the routes in which they take.

If models are getting faster as quickly as you say they are, then we don’t need agents and we don’t really need any of these abstraction layers. We can just give our model […] access to the internet, cross our fingers and hope for the best. Agents, agent frameworks, graphs, all of this stuff is basically making up for the fact that right now the models are not that clever.

Samuel Covin [~00:26:32]

One of many great tidbits from Samuel in this podcast.

I am generally not a fan of Python (in favor of the clearly far superior Javascript) but I am a superfan of Pydantic. I was thrilled when Pydantic AI was announced and have continued to follow it’s developments and iterations closely. I have a strong feeling it will continue to define mental models in the AI SDK space for many years to come.

[Video] Sidewalk Chalk Robot

Taylorator: Flood the FM Broadcast Band with Taylor Swift

Anomalous Tokens in DeepSeek-V3 and r1

Fascinating. Only a matter of time before somebody writes an llm fuzzer.