Where the data comes from
HanziDB is assembled from established, openly published Chinese-language datasets. This page explains those sources, what makes a character or word appear, and how everything is ranked.
What's in the database
Two tables sit at the core — characters and words. Each one earns its page by appearing in at least one trusted list. Nothing is invented to fill gaps.
Taiwan's standard common-character list — the Ministry of Education's 常用國字標準字體表 (~4,800 traditional-script characters). Each character earns its page by appearing on this list.
Every word in Taiwan's official TOCFL vocabulary (華語八千詞), plus the most frequent multi-character words from the COCT traditional-script corpus. A word appears if it's a TOCFL word or common enough in the COCT corpus.
Where each piece comes from
Each kind of information has its own source.
- Frequency ranks
- Taiwan frequency from the COCT 通用詞頻表, a traditional-script corpus from Taiwan's National Academy for Educational Research (國教院). Word frequencies are tallied down to the characters that make them up.
- Character facts
- Unihan, the Unicode database: stroke count and Kangxi radical.
- Readings
- Taiwan-standard pinyin and bopomofo from Moedict (萌典); additional pinyin from an open hanzi compilation.
- Proficiency
- TOCFL (華語八千詞) levels — Taiwan's official 8,000-word graded vocabulary, from the official list published by the Ministry of Education.
How we rank
Ranks reflect real usage frequency — how often a character or word actually appears in a large body of text — not stroke order, teaching order, or any official notion of importance.
Taiwan's ranks come from a Taiwan-native corpus. Word frequencies from COCT are tallied down to the characters that make them up; they're never translated or borrowed from other regional data.