Where the data comes from

HanziDB is assembled from established, openly published Chinese-language datasets. This page explains those sources, what makes a character or word appear, and how everything is ranked.

what's included

What's in the database

Two tables sit at the core — characters and words. Each one earns its page by appearing in at least one trusted list. Nothing is invented to fill gaps.

Characters· about 4,800

Taiwan's standard common-character list — the Ministry of Education's 常用國字標準字體表 (~4,800 traditional-script characters). Each character earns its page by appearing on this list.

Words

Every word in Taiwan's official TOCFL vocabulary (華語八千詞), plus the most frequent multi-character words from the COCT traditional-script corpus. A word appears if it's a TOCFL word or common enough in the COCT corpus.

sources

Where each piece comes from

Each kind of information has its own source.

Frequency ranks: Taiwan frequency from the COCT 通用詞頻表, a traditional-script corpus from Taiwan's National Academy for Educational Research (國教院). Word frequencies are tallied down to the characters that make them up.
Character facts: Unihan, the Unicode database: stroke count and Kangxi radical.
Readings: Taiwan-standard pinyin and bopomofo from Moedict (萌典); additional pinyin from an open hanzi compilation.
Proficiency: TOCFL (華語八千詞) levels — Taiwan's official 8,000-word graded vocabulary, from the official list published by the Ministry of Education.

methodology

How we rank

Ranks reflect real usage frequency — how often a character or word actually appears in a large body of text — not stroke order, teaching order, or any official notion of importance.

Taiwan's ranks come from a Taiwan-native corpus. Word frequencies from COCT are tallied down to the characters that make them up; they're never translated or borrowed from other regional data.