15 March 2015

Unicode Han Database (Unihan)

CEDICT

CEDICT CEDICT Format Hanzi to Pinyin Online Tool

Chinese to Pinyin

Cjklib

Make sure SQLite 3+ and SQLAlchemy 0.5+ are installed.

The following code:

# -*- coding: utf-8 -*-
import cjklib
from cjklib.characterlookup import CharacterLookup

c = u'好'

cjk = CharacterLookup('T')
readings = cjk.getReadingForCharacter(c, 'Pinyin')
for r in readings:
    print r

produce:

hāo
hǎo
hào
>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup('C')
>>> cjk.getStrokeOrder(u'说')
[u'\u31d4', u'\u31ca', u'\u31d4', u'\u31d2', u'\u31d1', u'\u31d5', u'\u31d0', u'\u31d3', u'\u31df']

The code for Access a dictionary in pypi page does work.

>>> from cjklib.dictionary import EDICT
>>> d = EDICT()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/cjklib-0.3.2-py2.7.egg/cjklib/dictionary/__init__.py", line 271, in __init__
    % self.DICTIONARY_TABLE)

Stackoverfow has a solution

>>> from cjklib.dictionary import CEDICT
>>> from cjklib.dbconnector import getDBConnector
>>> db = getDBConnector({'sqlalchemy.url': 'sqlite://', 'attach': ['cjklib']})
>>> d=CEDICT(dbConnectInst=db)
>>> it=d.getFor(u'朋友')
>>> it
<itertools.imap object at 0x7fa2d30cb050>
>>> for x in it:
...   print x
... 
EntryTuple(HeadwordTraditional=u'\u670b\u53cb', 
HeadwordSimplified=u'\u670b\u53cb', Reading=u'p\xe9ng you', 
Translation=u'/friend/CL:\u500b|\u4e2a[ge4],\u4f4d[wei4]/')

Encoding Conversion

JSON uses UTF-8 by default. For Unicode code points in the range of U+0000 to U+FFFF, a single esape is enough. For code points in the range of U+10000 to U+10FFFF, two escapes using UTF-16 are needed.

Face with tears of joy:  😂 
Unicode code point: U+1F602
UTF-16: 
  encoding: D83D DE02 
  escapes: \uD83D\uDE02
  showed in xxd: 3DD8 02DE
  showed in xxd with a BOM: FFFE 3DD8 02DE, BOM is FEFF
  UTF-8: F09F9882                        
  showed with xxd -u:
    0000000: F09F 9882