HanziDB
⌘ K
Display

Saved on this device.

Pronunciation
how readings are shown
Changes apply instantly.

Where the data comes from

HanziDB is assembled from established, openly published Chinese-language datasets. This page explains those sources, what makes a character or word appear, and how everything is ranked.

what's included

What's in the database

Two tables sit at the core — characters and words. Each one earns its page by appearing in at least one trusted list. Nothing is invented to fill gaps.

Characters· about 4,800

Taiwan's standard common-character list — the Ministry of Education's 常用國字標準字體表 (~4,800 traditional-script characters). Each character earns its page by appearing on this list.

Words

Every word in Taiwan's official TOCFL vocabulary (華語八千詞), plus the most frequent multi-character words from the COCT traditional-script corpus. A word appears if it's a TOCFL word or common enough in the COCT corpus.

sources

Where each piece comes from

Each kind of information has its own source.

Frequency ranks
Taiwan frequency from the COCT 通用詞頻表, a traditional-script corpus from Taiwan's National Academy for Educational Research (國教院). Word frequencies are tallied down to the characters that make them up.
Character facts
Unihan, the Unicode database: stroke count and Kangxi radical.
Readings
Taiwan-standard pinyin and bopomofo from Moedict (萌典); additional pinyin from an open hanzi compilation.
Proficiency
TOCFL (華語八千詞) levels — Taiwan's official 8,000-word graded vocabulary, from the official list published by the Ministry of Education.
methodology

How we rank

Ranks reflect real usage frequency — how often a character or word actually appears in a large body of text — not stroke order, teaching order, or any official notion of importance.

Taiwan's ranks come from a Taiwan-native corpus. Word frequencies from COCT are tallied down to the characters that make them up; they're never translated or borrowed from other regional data.