📖 How chanalyse works
A complete, indexed reference to the whole stack — what it does, every term, every formula and threshold, and the technical architecture. chanalyse continuously reads public 4chan boards (/biz/, /pol/, /g/), classifies the discussion with LLMs, detects what's surging, scores authenticity/coordination, and writes balanced articles about the stories that matter.
1. Overview & data flow
chanalyse reads public discussion, makes sense of it with language models, and publishes a static dashboard. Each cycle does roughly this:
1. scrape 4chan catalogs → deep-fetch active threads (+detect deletions)
2. classify each thread with an LLM (theme / story / entities / stance / novelty)
3. consolidate (merge duplicate stories, relabel themes)
4. rebuild hourly buckets → detect spikes → write article triggers
5. generate 3-perspective articles
6. export the dashboard to static HTML + JSON and publish it
2. Glossary of terms
| Term | Meaning |
|---|---|
| Post | A single 4chan message. Carries its own VADER sentiment. |
| Thread / OP | A discussion; the OP is the opening post. Thread sentiment = the OP's sentiment. |
| Entity | A named thing the LLM pulled out of a thread (a person, company, country, ticker…). Aliases are collapsed (BTC/XBT→bitcoin). |
| Story | A specific event/headline (“$300B Iran reconstruction deal”). Many threads map to one story; near-duplicate stories are merged. |
| Theme | A durable 2–4-word bucket a story sits in (“ai regulation”, “middle east war”). |
| Novelty | 0–1, how new/breaking a development is (1 = breaking news, 0 = evergreen chatter). LLM-assigned. |
| Spike / trigger | A story whose hourly volume jumped far above its own baseline — what becomes an article. |
| z-score | How many standard deviations the current hour's volume is above the story's normal hourly volume. |
| Signal | An entity's market-style read: tone (sentiment ×100), buzz, momentum, bullish/bearish. |
| Copypasta | The same text re-posted across many threads — a coordination signal. |
| Cui bono | “Who benefits” — the article's incentive/conflict-of-interest analysis. |
3. Scraping
Boards polled by default: /biz/, /pol/, /g/ (configurable). Each board is fetched from https://a.4cdn.org/{board}/catalog.json using If-Modified-Since — a 304 Not Modified costs nothing. Active threads are then deep-fetched for all their posts (kept forever as raw snapshots).
Per-board poll intervals (adaptive; only “due” boards are polled):
| /pol/ | /biz/ | /g/ | /sci/ /his/ /lit/ |
|---|---|---|---|
| 5 min | 15 min | 30 min | 60 min |
Archive links are board-correct (different archives host different boards):
| Board | Archive host |
|---|---|
| /pol/, /sp/, /tv/, /x/, /adv/… | archive.4plebs.org |
| /biz/, /vt/ | warosu.org |
| /g/, /sci/, /his/, /lit/, /int/, /mu/… | desuarchive.org |
4. Sentiment & novelty
Sentiment
Computed with VADER (a lexicon + rule-based model) on the cleaned OP text. The stored value is VADER's compound score in [-1.0, +1.0].
In the Signals tab, sentiment is rescaled to a tone in [-100, +100] (GDELT-style): tone = compound × 100.
Novelty
Novelty is not a formula — it's assigned by the classification LLM, 0.0–1.0: 1.0 = a new development / breaking news, 0.0 = evergreen general chatter. A story's novelty_avg is the running mean of every contributing thread's novelty.
5. Classification (LLM extraction)
Unclassified threads are sent to an LLM in batches of 20. For each thread the model returns:
| Field | Meaning |
|---|---|
theme | durable 2–4-word bucket (default “general discussion”) |
story | specific headline phrase |
entities | named things (max 12) |
key_claim | the thread's core claim, one sentence |
stance_word | supportive / critical / neutral … |
novelty, confidence | 0–1 each |
Backend chain: classification tries several LLM providers in turn and falls back to a fast offline heuristic if none are available, so a thread is never dropped. A cheaper/faster model is used for classification and a stronger one for writing articles.
Safety & robustness
- Purpose-framed prompt: the model is told it's a “trust-&-safety / academic discourse-analysis” component, so it labels hateful/extreme /pol/ content as data instead of refusing.
- Prompt-injection defang: thread text is wrapped in
⟦THREAD-DATA⟧markers and instruction-like phrases (“ignore previous instructions”, “you are now”…) get a zero-width space inserted so they can't hijack the model. - Refusal handling: if a batch is refused, it's retried once with stronger framing; if it still refuses, the batch is split in half recursively to isolate the one toxic thread — the rest still get full extraction. Only a genuinely un-processable single thread falls back to the heuristic.
6. Stories & themes
Each extracted thread is matched to an existing story/theme by a blended similarity score:
| Action | Threshold |
|---|---|
| Assign a thread to an existing story (at ingest) | ≥ 0.34 |
| Assign to an existing theme | ≥ 0.30 |
| Retro-merge two existing stories (consolidation) | ≥ 0.55 (stricter) |
A rolling consolidation runs every cycle (time-budgeted ~20s) to merge duplicate stories that the same event fragmented into — so e.g. a dozen “Fable 5…” variants collapse into one canonical story that can actually accumulate spike mass. Merged stories keep a merged_into_id pointer (nothing is destroyed) and their hourly buckets/spikes are reassigned to the canonical.
story_hourly.mention_count for an (board, story, hour) bucket = the number of distinct storied threads attributed to that story in that hour.
7. Spikes & article triggers
Every hour, each story's volume in the last complete hour is compared to its own behaviour over the prior 7 days of hourly buckets.
z-score
Per-board thresholds (slow tech boards get an easier bar):
| Board | min mentions | z-threshold | new-story min |
|---|---|---|---|
| /g/, /sci/ (tech) | 2 | 1.0 | 2 |
| all others (/pol/, /biz/…) | 3 | 2.0 | 4 |
A story qualifies as a spike when it has ≥2 specific entities, is novel or “sticky” (present in ≥2 of the last 3 hours), and either clears its z-threshold while sustained/new, or is a fresh breakout above the new-story count.
article score
Each spike gets a 0–1 score that ranks it for article-writing:
An article is written when article_score ≥ 0.55. A spike won't re-fire for the same story within 6 hours, and near-duplicate spikes (same theme + entity/label overlap) are suppressed.
8. Signals (the ticker)
A separate, GDELT-style market read over entity × sentiment × time. Buckets are hourly for windows ≤ 48h, otherwise daily.
tone & dispersion
buzz index (0–100 heat)
bullish / bearish classification
Category is inferred from curated keyword sets (crypto / commodity / political / tech / company / ticker / person / other). Only crypto/ticker/company/commodity appear in the “market” scan. A confidence 0–1 rewards more posters, more time-buckets and lower dispersion.
9. Topic velocity
On the Analysis tab — a concrete per-topic heat signal (more reliable than aggregate sentiment). For each top story's hourly series, the window is split 2/3 prior, 1/3 recent:
10. Authenticity / coordination
copypasta (repeated text)
Each post's text is normalised (collapse whitespace, casefold) and used as a cluster key. Texts shorter than 15 chars are ignored. A cluster is surfaced only if it appears in ≥2 distinct threads AND ≥3 posts. Each cluster carries every individual occurrence (board / thread / post / time) with a live link and an archive link so you can verify the post is real.
OP share = fraction of a cluster's posts that are thread OPs. A high OP-share means a recurring “general” thread template, not manufactured reply-sentiment — these can be excluded with the toggle.
substance score
target & stance
A cluster is matched to a narrative target (Israel, Iran, Trump, China…) via keyword sets, then a stance:
Short keywords use word-boundary matching, and a false-context filter suppresses geographic false-positives (e.g. “British Indian Ocean Territory” no longer tags a post as pro-Indian). When enabled, an LLM re-reads the top clusters and overrides the heuristic (clearing the ~ flag).
link spam
External URLs (4chan renders them as plain text, often split by <wbr>) are extracted and counted. A link is flagged “spam” if it appears in ≥3 posts; the table shows how many distinct threads it spanned.
influence campaigns
Substantive, non-template clusters (substance ≥ 0.18) that share a (target, stance) are grouped into a campaign. distinct_messages = how many differently-worded clusters push the same agenda; the timeline shows posts-per-hour for that agenda.
Summary metrics
| Metric | Definition |
|---|---|
dup_rate_pct | duplicate posts ÷ total posts × 100 |
unique_texts | distinct normalised texts (≥15 chars) |
named_operators | posts with a non-anonymous name or a tripcode |
spam_links | URLs posted in ≥3 posts |
11. Moderation
When a post seen on a previous fetch is absent on the next fetch (and hasn't reappeared), it's recorded as deleted, with its lifespan (time from first-seen to deletion). The Recent-deletions feed dedups identical removed text into one row carrying a ×N count and how many threads it spanned, so a copypasta deleted across many threads doesn't flood the list. deletion_rate_pct = deletions ÷ posts × 100.
12. Article generation
When a story spikes, an article is written by the standalone magazine generator (model: Claude Opus). The ordered pipeline:
- Fact verification FIRST — a live web search (Anthropic
web_searchtool) establishes what's actually confirmed in the real world, before anything is written. This is why the system states confirmed events as fact instead of calling real news “unverified”. - Verified sources — candidate URLs (the web-search news first, then thread links, library reuse, model-proposed) are each fetched and checked to really exist; only confirmed-real ones can be cited.
- Factbrief — the shared factual core (who/what/when/where/how) grounded in those sources, plus a key_terms glossary explaining jargon (what “SillyTavern”, “GLM”, “AUR” actually are).
- Three perspectives on the same facts: Critical (“The Prosecutor”), Neutral (“The Analyst”), Supportive (“The Advocate”). If any perspective comes back as a refusal/error, the whole article is rejected (never published broken).
- Cui bono — beneficiaries, losers, rivalry & conflicts of interest (who is harmed and which competitor gains, and any decision-maker's stake in that rival), ramification chains, plus an intentional vs structural reading.
- Hero image — an LLM first turns the facts into a concrete, literal real-world scene (so the picture is on-topic, not a fantasy interpretation of a product name), which is rendered by
flux-devon Replicate. Images are reused, never regenerated for an existing story.
Articles can also be commissioned manually on a chosen topic, which runs the same fact-checked pipeline.
lead-story selection
The featured “Top story” is chosen purely by live activity (re-ranked every build), never an editorial pick:
13. Tech stack
The tools behind chanalyse:
| Layer | Tech |
|---|---|
| Language | Python |
| Web API + pages | FastAPI + Jinja2 |
| Database | SQLAlchemy over SQLite |
| Sentiment | VADER |
| Charts | Chart.js |
| Classification | Claude (with a heuristic fallback) |
| Article writing | Claude + a live web-search tool for fact-checking |
| Hero images | Replicate flux-dev |
| Hosting | Cloudflare Pages (a static export of the dashboard) |
The collection/analysis runs continuously on one machine; the public site is a separate static snapshot, so the live dashboard never exposes the engine directly. The data is durable (the raw discussion is kept and the database is backed up), and the methods are deliberately shown openly so the output can be judged.
14. All constants (cheat sheet)
| Constant | Value |
|---|---|
| Spike z-threshold (default / tech) | 2.0 / 1.0 |
| Min spike mentions (default / tech) | 3 / 2 |
| New-story min count (default / tech) | 4 / 2 |
| Spike baseline window | 7 days |
| Min prior hours for a real z-score | 6 |
| Article-score weights | z .30, novelty .30, entity .20, sustained .12, new .08, cross .15, tech .14 |
| Article publish threshold | 0.55 |
| Spike re-fire window | 6 hours |
| Signals “abnormal” z | 1.5 |
| Signals bullish/bearish tone gate | ±8 |
| Signals momentum gate | ±6 |
| Buzz-index weights | z 55, size 30, momentum 15 |
| Story / theme match | 0.34 / 0.30 |
| Story consolidation (merge) | 0.55 |
| Similarity blend | 0.45 token + 0.40 entity + 0.15 fuzzy |
| Copypasta min length / cluster emit | 15 chars / ≥2 threads & ≥3 posts |
| Substance weights | 0.45 length + 0.30 opinion + 0.25 spread (×0.25 template) |
| Link-spam threshold | ≥3 posts |
| Velocity window split / bands | 2/3 vs 1/3 · accel ≥1.3 / ≤0.7 |
| Lead-story weights | novelty .34, traction .26, recency .18, volume .12, base .10, cross .12 |
| VADER sentiment label | ±0.05 |
| Classify batch size | 20 threads |
15. Caveats & honesty
- Sentiment is noisy. VADER is lexicon-based and misreads irony, slang and 4chan in-jokes. Treat tone as a rough signal.
- Target/stance labels are heuristic. They can mislabel subtle, ironic or geographic text; the Authenticity disclaimer says so, and an LLM pass corrects the surfaced ones where enabled.
- Repetition ≠ payment. Copypasta and link-spam are coordination signals; generals reuse boilerplate and memes spread organically. Investigate, don't conclude.
- The dashboard is a ~3-hour snapshot. The engine collects continuously, but the public view refreshes each publish.
- This is imageboard data. It is unverified, manipulable, and explicitly not financial advice.
chanalyse is an experiment in reading the live pulse of anonymous forums and surfacing stories before the mainstream — with the methods shown openly so you can judge them yourself.