📖 How chanalyse works

A complete, indexed reference to the whole stack — what it does, every term, every formula and threshold, and the technical architecture. chanalyse continuously reads public 4chan boards (/biz/, /pol/, /g/), classifies the discussion with LLMs, detects what's surging, scores authenticity/coordination, and writes balanced articles about the stories that matter.

1. Overview & data flow

chanalyse reads public discussion, makes sense of it with language models, and publishes a static dashboard. Each cycle does roughly this:

1.  scrape 4chan catalogs  →  deep-fetch active threads (+detect deletions)
2.  classify each thread with an LLM (theme / story / entities / stance / novelty)
3.  consolidate (merge duplicate stories, relabel themes)
4.  rebuild hourly buckets  →  detect spikes  →  write article triggers
5.  generate 3-perspective articles
6.  export the dashboard to static HTML + JSON and publish it

The published dashboard is a snapshot rebuilt roughly every 3 hours (see the clock in the header). The scraper and analysis run continuously underneath; the public view refreshes each build.

2. Glossary of terms

Term	Meaning
Post	A single 4chan message. Carries its own VADER sentiment.
Thread / OP	A discussion; the OP is the opening post. Thread sentiment = the OP's sentiment.
Entity	A named thing the LLM pulled out of a thread (a person, company, country, ticker…). Aliases are collapsed (BTC/XBT→`bitcoin`).
Story	A specific event/headline (“$300B Iran reconstruction deal”). Many threads map to one story; near-duplicate stories are merged.
Theme	A durable 2–4-word bucket a story sits in (“ai regulation”, “middle east war”).
Novelty	0–1, how new/breaking a development is (1 = breaking news, 0 = evergreen chatter). LLM-assigned.
Spike / trigger	A story whose hourly volume jumped far above its own baseline — what becomes an article.
z-score	How many standard deviations the current hour's volume is above the story's normal hourly volume.
Signal	An entity's market-style read: tone (sentiment ×100), buzz, momentum, bullish/bearish.
Copypasta	The same text re-posted across many threads — a coordination signal.
Cui bono	“Who benefits” — the article's incentive/conflict-of-interest analysis.

3. Scraping

Boards polled by default: /biz/, /pol/, /g/ (configurable). Each board is fetched from https://a.4cdn.org/{board}/catalog.json using If-Modified-Since — a 304 Not Modified costs nothing. Active threads are then deep-fetched for all their posts (kept forever as raw snapshots).

Per-board poll intervals (adaptive; only “due” boards are polled):

/pol/	/biz/	/g/	/sci/ /his/ /lit/
5 min	15 min	30 min	60 min

Archive links are board-correct (different archives host different boards):

Board	Archive host
/pol/, /sp/, /tv/, /x/, /adv/…	`archive.4plebs.org`
/biz/, /vt/	`warosu.org`
/g/, /sci/, /his/, /lit/, /int/, /mu/…	`desuarchive.org`

desuarchive does not host /pol/ or /biz/ (it returns 404); 4plebs does not host /biz/. The mapping above is the verified-correct routing.

4. Sentiment & novelty

Sentiment

Computed with VADER (a lexicon + rule-based model) on the cleaned OP text. The stored value is VADER's compound score in [-1.0, +1.0].

compound >= +0.05 → positive compound <= -0.05 → negative otherwise → neutral

In the Signals tab, sentiment is rescaled to a tone in [-100, +100] (GDELT-style): tone = compound × 100.

Novelty

Novelty is not a formula — it's assigned by the classification LLM, 0.0–1.0: 1.0 = a new development / breaking news, 0.0 = evergreen general chatter. A story's novelty_avg is the running mean of every contributing thread's novelty.

5. Classification (LLM extraction)

Unclassified threads are sent to an LLM in batches of 20. For each thread the model returns:

Field	Meaning
`theme`	durable 2–4-word bucket (default “general discussion”)
`story`	specific headline phrase
`entities`	named things (max 12)
`key_claim`	the thread's core claim, one sentence
`stance_word`	supportive / critical / neutral …
`novelty`, `confidence`	0–1 each

Backend chain: classification tries several LLM providers in turn and falls back to a fast offline heuristic if none are available, so a thread is never dropped. A cheaper/faster model is used for classification and a stronger one for writing articles.

Safety & robustness

Purpose-framed prompt: the model is told it's a “trust-&-safety / academic discourse-analysis” component, so it labels hateful/extreme /pol/ content as data instead of refusing.
Prompt-injection defang: thread text is wrapped in ⟦THREAD-DATA⟧ markers and instruction-like phrases (“ignore previous instructions”, “you are now”…) get a zero-width space inserted so they can't hijack the model.
Refusal handling: if a batch is refused, it's retried once with stronger framing; if it still refuses, the batch is split in half recursively to isolate the one toxic thread — the rest still get full extraction. Only a genuinely un-processable single thread falls back to the heuristic.

6. Stories & themes

Each extracted thread is matched to an existing story/theme by a blended similarity score:

similarity = 0.45·tokenJaccard(labels) + 0.40·Jaccard(entities) + 0.15·fuzzyRatio(text)

Action	Threshold
Assign a thread to an existing story (at ingest)	≥ 0.34
Assign to an existing theme	≥ 0.30
Retro-merge two existing stories (consolidation)	≥ 0.55 (stricter)

A rolling consolidation runs every cycle (time-budgeted ~20s) to merge duplicate stories that the same event fragmented into — so e.g. a dozen “Fable 5…” variants collapse into one canonical story that can actually accumulate spike mass. Merged stories keep a merged_into_id pointer (nothing is destroyed) and their hourly buckets/spikes are reassigned to the canonical.

story_hourly.mention_count for an (board, story, hour) bucket = the number of distinct storied threads attributed to that story in that hour.

7. Spikes & article triggers

Every hour, each story's volume in the last complete hour is compared to its own behaviour over the prior 7 days of hourly buckets.

z-score

base = the story's prior hourly mention counts (last 7 days) mean = average(base) std = population standard deviation(base) z = (count − mean) / std (if std > 0.3) = (count − mean) (if std is tiny) If fewer than 6 prior hours exist → treat as a NEW story, z = raw count.

Per-board thresholds (slow tech boards get an easier bar):

Board	min mentions	z-threshold	new-story min
/g/, /sci/ (tech)	2	1.0	2
all others (/pol/, /biz/…)	3	2.0	4

A story qualifies as a spike when it has ≥2 specific entities, is novel or “sticky” (present in ≥2 of the last 3 hours), and either clears its z-threshold while sustained/new, or is a fresh breakout above the new-story count.

article score

Each spike gets a 0–1 score that ranks it for article-writing:

z_norm = min(1, z / 6) article_score = 0.30·z_norm + 0.30·novelty + 0.20·entity_specificity + 0.12·(sustained?1:0) + 0.08·(new?1:0) + 0.14 (tech-board boost) + 0.08 (tech multi-thread) + 0.15 (cross-board: same entity spiking on >1 board)

An article is written when article_score ≥ 0.55. A spike won't re-fire for the same story within 6 hours, and near-duplicate spikes (same theme + entity/label overlap) are suppressed.

8. Signals (the ticker)

A separate, GDELT-style market read over entity × sentiment × time. Buckets are hourly for windows ≤ 48h, otherwise daily.

tone & dispersion

tone = mean(sentiments) × 100 → [-100, +100] dispersion = stdev(sentiments) × 100 (how contested the mood is) momentum = mean(recent third of tone) − mean(earlier tone)

buzz index (0–100 heat)

volume z-score (cur bucket vs prior buckets) → z_part (weight 55) raw discussion size log10(total+1)/2.5 → size_part (weight 30) tone momentum → mom_part (weight 15) buzz_index = round(z_part + size_part + mom_part)

bullish / bearish classification

volume z ≥ 1.5 AND tone ≥ +8 → bullish_spike volume z ≥ 1.5 AND tone ≤ −8 → bearish_spike volume z ≥ 1.5 (mixed tone) → buzz_spike momentum ≥ +6 → warming(_bullish) momentum ≤ −6 → cooling otherwise → quiet

Category is inferred from curated keyword sets (crypto / commodity / political / tech / company / ticker / person / other). Only crypto/ticker/company/commodity appear in the “market” scan. A confidence 0–1 rewards more posters, more time-buckets and lower dispersion.

Clicking an entity opens its mentions-over-time (bars) and sentiment-over-time (line), timescale-gated by the window buttons.

9. Topic velocity

On the Analysis tab — a concrete per-topic heat signal (more reliable than aggregate sentiment). For each top story's hourly series, the window is split 2/3 prior, 1/3 recent:

recent_rate = avg(posts/hour in the recent third) prior_rate = avg(posts/hour in the earlier two-thirds) accel = recent_rate / prior_rate trend: accel ≥ 1.3 → accelerating · ≤ 0.7 → cooling · else steady

10. Authenticity / coordination

⚠️ Everything in this section is AI/heuristic-derived and deliberately rough — treat it as a lead to investigate, not a verdict. Target/stance labels marked ~ are keyword guesses; ✓ means an LLM reviewed them.

copypasta (repeated text)

Each post's text is normalised (collapse whitespace, casefold) and used as a cluster key. Texts shorter than 15 chars are ignored. A cluster is surfaced only if it appears in ≥2 distinct threads AND ≥3 posts. Each cluster carries every individual occurrence (board / thread / post / time) with a live link and an archive link so you can verify the post is real.

OP share = fraction of a cluster's posts that are thread OPs. A high OP-share means a recurring “general” thread template, not manufactured reply-sentiment — these can be excluded with the toggle.

substance score

L = min(1, len(text)/400) (length) op = min(1, |avg sentiment|) (opinion intensity) spread = min(1, distinct_threads/20) (reach) substance = 0.45·L + 0.30·op + 0.25·spread × 0.25 if it's a template (≥3 links or ≥2 bullet glyphs)

target & stance

A cluster is matched to a narrative target (Israel, Iran, Trump, China…) via keyword sets, then a stance:

more negative-lexicon hits than positive → negative more positive than negative → positive tie → break by VADER sentiment (<−0.1 neg, >+0.1 pos, else mixed)

Short keywords use word-boundary matching, and a false-context filter suppresses geographic false-positives (e.g. “British Indian Ocean Territory” no longer tags a post as pro-Indian). When enabled, an LLM re-reads the top clusters and overrides the heuristic (clearing the ~ flag).

link spam

External URLs (4chan renders them as plain text, often split by <wbr>) are extracted and counted. A link is flagged “spam” if it appears in ≥3 posts; the table shows how many distinct threads it spanned.

influence campaigns

Substantive, non-template clusters (substance ≥ 0.18) that share a (target, stance) are grouped into a campaign. distinct_messages = how many differently-worded clusters push the same agenda; the timeline shows posts-per-hour for that agenda.

Summary metrics

Metric	Definition
`dup_rate_pct`	duplicate posts ÷ total posts × 100
`unique_texts`	distinct normalised texts (≥15 chars)
`named_operators`	posts with a non-anonymous name or a tripcode
`spam_links`	URLs posted in ≥3 posts

11. Moderation

When a post seen on a previous fetch is absent on the next fetch (and hasn't reappeared), it's recorded as deleted, with its lifespan (time from first-seen to deletion). The Recent-deletions feed dedups identical removed text into one row carrying a ×N count and how many threads it spanned, so a copypasta deleted across many threads doesn't flood the list. deletion_rate_pct = deletions ÷ posts × 100.

12. Article generation

When a story spikes, an article is written by the standalone magazine generator (model: Claude Opus). The ordered pipeline:

Fact verification FIRST — a live web search (Anthropic web_search tool) establishes what's actually confirmed in the real world, before anything is written. This is why the system states confirmed events as fact instead of calling real news “unverified”.
Verified sources — candidate URLs (the web-search news first, then thread links, library reuse, model-proposed) are each fetched and checked to really exist; only confirmed-real ones can be cited.
Factbrief — the shared factual core (who/what/when/where/how) grounded in those sources, plus a key_terms glossary explaining jargon (what “SillyTavern”, “GLM”, “AUR” actually are).
Three perspectives on the same facts: Critical (“The Prosecutor”), Neutral (“The Analyst”), Supportive (“The Advocate”). If any perspective comes back as a refusal/error, the whole article is rejected (never published broken).
Cui bono — beneficiaries, losers, rivalry & conflicts of interest (who is harmed and which competitor gains, and any decision-maker's stake in that rival), ramification chains, plus an intentional vs structural reading.
Hero image — an LLM first turns the facts into a concrete, literal real-world scene (so the picture is on-topic, not a fantasy interpretation of a product name), which is rendered by flux-dev on Replicate. Images are reused, never regenerated for an existing story.

Articles can also be commissioned manually on a chosen topic, which runs the same fact-checked pipeline.

lead-story selection

The featured “Top story” is chosen purely by live activity (re-ranked every build), never an editorial pick:

selection = 0.34·novelty + 0.26·traction(z/6) + 0.18·recency(decays over 72h) + 0.12·volume(count/40) + 0.10·article_score + 0.12·(cross-board?)

13. Tech stack

The tools behind chanalyse:

Layer	Tech
Language	Python
Web API + pages	FastAPI + Jinja2
Database	SQLAlchemy over SQLite
Sentiment	VADER
Charts	Chart.js
Classification	Claude (with a heuristic fallback)
Article writing	Claude + a live web-search tool for fact-checking
Hero images	Replicate `flux-dev`
Hosting	Cloudflare Pages (a static export of the dashboard)

The collection/analysis runs continuously on one machine; the public site is a separate static snapshot, so the live dashboard never exposes the engine directly. The data is durable (the raw discussion is kept and the database is backed up), and the methods are deliberately shown openly so the output can be judged.

14. All constants (cheat sheet)

Constant	Value
Spike z-threshold (default / tech)	2.0 / 1.0
Min spike mentions (default / tech)	3 / 2
New-story min count (default / tech)	4 / 2
Spike baseline window	7 days
Min prior hours for a real z-score	6
Article-score weights	z .30, novelty .30, entity .20, sustained .12, new .08, cross .15, tech .14
Article publish threshold	0.55
Spike re-fire window	6 hours
Signals “abnormal” z	1.5
Signals bullish/bearish tone gate	±8
Signals momentum gate	±6
Buzz-index weights	z 55, size 30, momentum 15
Story / theme match	0.34 / 0.30
Story consolidation (merge)	0.55
Similarity blend	0.45 token + 0.40 entity + 0.15 fuzzy
Copypasta min length / cluster emit	15 chars / ≥2 threads & ≥3 posts
Substance weights	0.45 length + 0.30 opinion + 0.25 spread (×0.25 template)
Link-spam threshold	≥3 posts
Velocity window split / bands	2/3 vs 1/3 · accel ≥1.3 / ≤0.7
Lead-story weights	novelty .34, traction .26, recency .18, volume .12, base .10, cross .12
VADER sentiment label	±0.05
Classify batch size	20 threads

15. Caveats & honesty

Sentiment is noisy. VADER is lexicon-based and misreads irony, slang and 4chan in-jokes. Treat tone as a rough signal.
Target/stance labels are heuristic. They can mislabel subtle, ironic or geographic text; the Authenticity disclaimer says so, and an LLM pass corrects the surfaced ones where enabled.
Repetition ≠ payment. Copypasta and link-spam are coordination signals; generals reuse boilerplate and memes spread organically. Investigate, don't conclude.
The dashboard is a ~3-hour snapshot. The engine collects continuously, but the public view refreshes each publish.
This is imageboard data. It is unverified, manipulable, and explicitly not financial advice.

chanalyse is an experiment in reading the live pulse of anonymous forums and surfacing stories before the mainstream — with the methods shown openly so you can judge them yourself.