Extracting Value From Forums With LLMs
I have so many forum threads in my bookmarks that I desperately want to consume, but when I sit down to read them, I very quickly get overwhelmed by crosstalk, multiple threads of conversation, weird forum conventions, etc. I know there are nuggets of genuine brilliance buried in these things, and my TikTok-addled attention span is incapable of filtering through to it.
And then it hit me… this would be a great use case for LLMs! In fact, it would not surprise in the slightest if forum data was a large basis of the training data used for these models in the first place.
I decided to pick a thread that has been on my list for years, Jeff Sponaugle’s INSANE Tony Stark lair build, but I approached the problem generically enough that I should be able to apply the approach to other forum threads.
Step 1: Scrape the thread
I wasn’t super comfortable with the idea of fully automating data collection from forums since many are maintained by hobbyists who shoulder the cost themselves and are already getting slammed by AI scrapers. I instead chose a compromise; I turned off images in my browser to minimize bandwidth consumption and manually paged through the thread, saving each page as HTML and clicking a few ads along the way. I think feeding images into vision models would be a fascinating next step, but it’s not something I’m comfortable doing without explicit opt-in from the forum owners.
Once I had my directory of 22 HTML pages, I needed to extract them into structured data. I initially started vibe-coding one myself, but very quickly discovered that the HTML structures were downright horrendous, and luckily discovered forumscraper which helpfully dumped to a 27,288 line (590,618 token) JSON file.
Step 2: Token Reduction
Luckily, much of the data is unnecessary for our use case, and much of the necessary data is quite repetitive.
The first thing that stood out to me was the repeated user objects. If someone posted in a thread 100 times, there were 100 user objects, each with sub-objects for arbitrary key-value pairs like Location and Join date. This is hundreds of thousands of characters that can be condensed.
Additionally I noticed there was still quite a bit of HTML markup scattered in the text- things like paragraph tags, img tags with lots of attributes, weird handling of quote. I also discovered there were 4+ different ways a blockquote/reply could be formatted, presumably depending on the client of the poster.
I split the output into 2 arrays, posts and users.
The post schema is:
- id (int)
- user (string)
- date (iso 8601 string)
- text (string)
The user schema is:
- user_id (int)
- user (string)
- user_link (string)
- join date (string)
- location (string)
- post_count (int)
With Claude’s assistance, I was able to whip up this horrific set of regex replaces to turn all known blockquote/reply formats into plaintext
// Handle XenForo blockquotes: extract data-quote (author) and inner text content
// Note: use [\s\S] to match across newlines; [\s\s] was incorrect and missed non-whitespace
s = s.replace(/<blockquote[^>]*data-quote=\"([^\"]+)\"[^>]*>([\s\S]*?)<\/blockquote>/gi, (_m, author: string, inner: string) => {
// Prefer content within expandContent if present
const matchExpand = inner.match(/<div[^>]*class=\"[^\"]*bbCodeBlock-expandContent[^\"]*\"[^>]*>([\s\S]*?)<\/div>/i);
let contentHtml = matchExpand ? matchExpand[1] : inner;
// Drop any "Click to expand..." links
contentHtml = contentHtml.replace(/<div[^>]*class=\"[^\"]*bbCodeBlock-expandLink[^\"]*\"[\s\S]*?<\/div>/gi, "");
// Convert br to newlines early
contentHtml = contentHtml.replace(/<br\s*\/?\s*>/gi, "\n");
// Convert images to alt + URL when available
contentHtml = contentHtml.replace(/<img[^>]*>/gi, (imgTag: string) => {
const altMatch = imgTag.match(/\balt=\"([^\"]*)\"/i);
const srcMatch = imgTag.match(/\b(?:src|data-src|data-url)=\"([^\"]+)\"/i);
const alt = altMatch ? altMatch[1] : "";
const src = srcMatch ? srcMatch[1] : "";
if (src) return alt ? `${alt} (${src})` : src;
return alt || "[image]";
});
// Convert anchors to their text
contentHtml = contentHtml.replace(/<a[^>]*>([\s\S]*?)<\/a>/gi, "$1");
// Strip remaining HTML
let contentText = contentHtml.replace(/<[^>]*>/g, "");
// Decode common entities
contentText = contentText
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, '"')
.replace(/'/g, "'")
.trim();
// Format clearly as a reply attribution for LLMs
// 1) Explicit in-reply-to header
// 2) Markdown-style blockquote for quoted text
const quotedBlock = contentText
.split("\n")
.map(line => `> ${line}`)
.join("\n");
return `[in-reply-to: ${author}]\n${quotedBlock}\n`;
});
I also wrote everything as YML rather than JSON, which provides more free token savings.
Down to 190,147 tokens - a 67.8% reduction! And now technically within the 200K context window the Claude models support, meaning we could theoretically feed the entire doc in as a single prompt, however I have anecdotally found the models do much better when you give them less noise. Even 1M context models.
Step 3: Reducing The Noise
Browsing the YAML file, it pretty quickly became apparent that many of the thread posts were providing literally zero value to the thread. Stuff like “very cool! subscribed!”, cross-talk about irrelevant topics, meta commentary on the forum and it’s members, etc. Which, duh, that’s how we ended up here in the first place.
We could probably write a deterministic filter to get most of these, but that’s boring. Let’s use an LLM!
I vibe coded another script-
anthropic.messages.create({
model: DUMB_MODEL,
max_tokens: 5,
temperature: 0,
system: [
{
type: "text",
text: `You are a forum summarizer's gatekeeper. Your single job is to prevent your boss from wasting time on posts that add zero benefit to understanding the thread. Decide whether this post is PURELY filler with no informational value for comprehension of the topic.
Return ONLY one word with no other text:
- "FILTER" if the post is pure filler that should be filtered out
- "KEEP" if the post should be kept (has any informational value)
Filter out (return "FILTER") ONLY if the post offers no signal that improves understanding:
- thanks/agree/like/emoji-only/cheers
- following/bump/subscribed
- off-topic small talk or jokes with no content
- signature-only or attachments-only
- quote-only with no added commentary
- one-liners with no new info, no clarification, no data, no question
Keep (return "KEEP") if the post contains ANY non-trivial detail, clarification, question, correction, measurement, trade-off, hypothesis, link WITH context, or actionable pointer. When uncertain, return "KEEP".`,
cache_control: { type: "ephemeral" }
}
],
messages: [{
role: "user",
content: `User: ${post.user || 'Unknown'}
Post content:\n${post.text}`
}]
})
I’ll be completely honest, I’m not totally sure I trust the whole max_tokens: 5
and ask for a single word response approach. I somewhat suspect you would get better results by asking the model to include a reason for filtering and paying the marginally higher cost for a few more output tokens, even if you’re then going to go and throw away that info. But I have seen people literally ask for “true/false”- which fundamentally misunderstands the strengths/weaknesses of a language model - and they seem to be getting fine results? Ultimately the cost to run this job as a batch request on an older model was so laughably cheap that if I were to do it again, I would not worry at all about clamping down on output tokens.
Anyway… 37.6% of posts removed in my test thread, which also removed several posters from our users array as well. Resulting in a new total of… 156,858 tokens.
Step 4: More LLMs
At this point we could probably feed the YAML doc straight into an LLM and call it a day. However replies pose a bit of a conundrum as they require the model to keep track of multiple (sometimes overlapping) trains of thought which (understandably) trips up the reasoning and can frequently lead to false conclusions. There is also still a lot of wasted tokens on filler. So let’s feed the posts into another LLM!
posts.map(post => {
const sentenceCount = countSentences(post.text);
const selectedModel = sentenceCount > 3 ? SMART_MODEL : DUMB_MODEL;
const { targetSentences, maxTokens } = calculateSummaryLength(sentenceCount);
return {
custom_id: `post_${post.id}`,
params: {
model: selectedModel,
max_tokens: maxTokens,
system: [
{
type: "text",
text: `You are an assistant that extracts the full substance of a forum post (including any inline replies). Put a specific focus on insights or findings.
Your job is to produce a structured, detailed summary that captures every meaningful point while omitting pleasantries, signatures, or generic forum chatter.
Guidelines:
- Identify the main topic or question.
- List all technical details, arguments, experiences, or data points mentioned.
- Include any advice, solutions, or proposed actions.
- Capture disagreements or alternative viewpoints in replies.
- Ignore filler (e.g., "thanks," "great post," "following," jokes).
- Preserve nuance: if there are uncertainties, caveats, or edge cases, state them explicitly.
- Use concise prose or bullet points as appropriate, but ensure nothing substantive is lost.
Output should feel like a detailed set of notes a researcher would take, not a casual recap.`,
cache_control: { type: "ephemeral" }
}
],
messages: [{
role: "user",
content: `Target length: Aim for approximately ${targetSentences} sentences (this post has ${sentenceCount} sentences originally).
User: ${post.user}
Post content:
${post.text}`
}]
}
};
});
An Aside On Batch Mode
I think people are criminally underusing the batch mode that all the providers have. Unless your use case is a literal realtime back-and-forth chatbot, you should be reaching for batch mode by default. When using the Anthropic API, I pretty consistently see requests processed in under 10 minutes and don’t think I’ve ever seen them take longer than an hour. Counterintuitively, it seems larger batches process faster than multiple small batches run in parallel. And then finally, and I have absolutely nothing to back this up… I think batch mode performs better for the same prompt against the same model. If they’re quantizing or degrading performance under load on API requests, they don’t appear to be doing that in batch requests.
Step 5: Final Summary
And now for the moment we’ve all been waiting for. Down to a measly 67,808 tokens, ready for analysis and summary.
const prompt = `ROLE
You are a meticulous research analyst and editor. Transform a MASSIVE multi-post forum thread (provided as concatenated summaries) into a single, long-form, publication-ready blog post.
OUTPUT GOAL
Produce a logically organized Markdown article with clear headings, deep synthesis (not just summary), and evidence-aware claims. Prioritize correctness and internal consistency.
STYLE
- Voice: clear, journalistic, neutral-to-practical. No forum niceties or chit-chat.
- Structure: H2 for major sections, H3/H4 for subsections. Include a linked Table of Contents.
- Precision: preserve concrete numbers, units, part names, model numbers. Convert units when useful (imperial + metric).
- Evidence: cite minimally with inline markers like (§12) that reference the source block index if provided; short quotes ≤20 words only when necessary.
- No hallucinations. If extrapolating, mark as [inferred] and explain basis.
- Collapse duplicates; resolve contradictory claims explicitly.
WHEN INPUT IS HUGE
If content seems truncated or exceeds context:
1) Prioritize: Executive Summary → Key Specs → Decisions/Tradeoffs → Timeline → Costs → Lessons → Open Questions.
2) Deduplicate aggressively; keep unique details.
3) Note truncation explicitly at the end.
REQUIRED SECTIONS (in this order)
---
Front Matter (YAML)
- title, subtitle, tags, estimated_read_time_minutes, last_updated_utc
# Table of Contents
(autogenerated list of H2 anchors)
## TL;DR (5–9 bullets)
Concise, concrete facts and outcomes.
## Context & Scope
What this thread is about, constraints, objectives, assumptions.
## System / Build Overview
High-level architecture or project outline. One diagram-friendly list if applicable.
## Key Technical Details
Bullet list(s) of specs, dimensions, materials, tools, processes, models, part numbers, settings. Normalize units and naming.
## Problems & Constraints
Enumerate issues, blockers, edge cases. Note root causes when stated; mark unknowns.
## Solutions, Techniques & Tradeoffs
Pair each problem with adopted solutions/alternatives, with pros/cons and preconditions.
## Timeline & Progress
Chronological milestones. Table with: date/phase | action/decision | rationale | outcomes (§refs).
## Costs & Resources
Any prices, budgets, BOM items, time/people effort. Simple subtotal(s) if computable from given numbers.
## Lessons Learned
What worked, what didn’t, gotchas, decision rules. Make them portable/generalizable.
## Community Dynamics
Roles (experts/beginners), notable contributors, consensus vs. dissent, knowledge gaps.
## Patterns & Trends
Recurring themes, failure modes, best-practice patterns distilled from multiple posts.
## Notable Insights
Unique tips, counterintuitive results, rules of thumb worth highlighting.
## Open Questions & Risks
Ambiguities, pending decisions, risks with likelihood/impact if stated.
## Practical Checklist
Actionable next steps or verification list a practitioner could follow.
## Glossary
Acronyms/jargon with short definitions.
## Source Map
One compact list mapping major claims to (§#) markers; include contradictions noted.
## Red-Team Review
Short list of what might be wrong or overfit in this synthesis and what evidence would reduce uncertainty.
OUTPUT RULES
- Use Markdown only (tables allowed). No HTML.
- Do not echo the input.
- Keep numbers exact; round only when specified or obviously a display nicety.
- If two figures conflict, present both and state which has stronger support (with § markers).
- If links appear in input, include them in-context; otherwise don’t invent references.
PROCESS
1) Skim for scope, then pass to extract entities (specs, models, dimensions, prices, dates).
2) Cluster by topic; deduplicate; note contradictions.
3) Synthesize sections above; add § markers where evidence appears to originate (if indices exist in the summaries).
4) Run a final consistency pass: units, names, dates, totals, and section ordering.
=== INPUT ===
${summaries}`;
const batchRequest = {
requests: [
{
custom_id: requestId,
params: {
model: model,
max_tokens: 20000,
temperature: 0.2,
messages: [
{
role: "user" as const,
content: analysisPrompt
}
]
}
}
]
};
const batch = await anthropic.messages.batches.create(batchRequest);
I threw this at 4.1 Opus with a big token budget and was pretty happy with the outcome. I will also benchmark across some other providers but don’t expect crazy different results.
What’s next
One thing I like about my approach (and was not accidental) is that it’s purely additive - all the original data stays on my disk so as new models come out, I can re-run parts of the pipeline and hopefully get even better results without consuming any more bandwidth from the forum. The most obvious step to retry is the final one, but I am also eager to try redoing the individual post summarizations (or at least some of them) with open weight self-hosted models.
One potential area of optimization would be to pull the “users” array out of the prompt and instead give the model a get_post_info(post_id)
tool, but I am also not entirely convinced it would ever reach for it (beyond the OP). I am also interested in the idea of using an LLM enrichment step to pull out distinct threads of conversation and then feeding those conversations as distinct conversations rather than as part of a broader thread.
As mentioned above, I am extremely excited at the prospect of using vision models to help the LLM gain visual context from image-heavy threads like this, but I want to be careful about bandwidth consumption. One of the reasons I picked this specific thread is that Jeff had the foresight to host the images himself, on his own domain, and embed them, so in this specific case he is actually footing the bill. I will be reaching out to see if he’s comfortable with me downloading all the images for this use case - given his proclivity for technical experimentation, I get the sense he will be very on board :)
I also want to investigate using embeddings for various stages of the pipeline.
Reading The Tea Leaves
A lot of forums are barely hanging on by a thread, started many years ago by hobbyists passionate about a specific niche. Some are just as passionate at they were 10-20 years ago, many are not. The overhead required to maintain and moderate a public forum is enormous, especially with the advent of LLMs, not to mention the skyrocketing bandwidth costs. Many admins are looking for an out.
On the flip side, you have model companies that are desperate for large swaths of text written by real humans that they can train on, flush with VC cash.
It would surprise me 0% to see intermediaries quietly buying up forums purely to turn around and sell the post data. In fact, it would surprise me 0% if that had already happened or is currently happening. Yes they could scrape for free, but if the cost is low enough and the value is high enough, why even bother?