-------------------------------------------------------------------------------
Hints and Tips for handling Unicode (UTF-8).

Also see "locale.hints" in this same directory

Remember, normal Ascii files without any low level control codes are valid
UTF-8

-------------------------------------------------------------------------------
Testing Unicode

You can test your unicode strings by running it through the "iconv"
program.  For example...

  iconv -f UTF-8 -t UTF-16  unicode_file > /dev/null  &&  echo "Valid UTF-8"

NOTE: recode replaces iconv
    recode utf8..utf16

BUT these just 'stop with a character offset. Which is not typically useful in
fixing a problem (such as loading the file using "gedit")


This reports the actual Line and Column of the problem (non-utf8) character

  isutf8 unicode_file

-------------------------------------------------------------------------------
Looking at Code

To print unicode codes for UTF16 use od -t x2

   od -t x2 unicode.txt

For utf8 use
  iconv -f utf8 -t utf16 unicode.txt | od -t x2

-------------------------------------------------------------------------------
Bash and Unicode

   echo $'\u00b6 \u2020 \u2021 a\u00b3 \u2192 \u221e \u275d \u263a \u275e'
   ¶ † ‡ a³ → ∞ ❝ ☺ ❞

But you can use individual hex characters too

   echo $'\xE2\x98\xB0'
   ☰

Store in a associative array with there names for readability
  unset u; declare -A u    # declare it as a associative array
  u[script_E]='\u2130'
  u[Rx]='\u211E'
  u[trademark]='\u2122'
  echo -e "${u[script_E]}  ${u[Rx]}  ${u[trademark]}"
  ℰ  ℞  ™

-------------------------------------------------------------------------------
VIM and Unicode

In VIM, type Ctrl-V u followed by a hexadecimal number. Example: Ctrl-V u 20ac

To see the actual charcater encoding in the file use..
  ga   for the byte character  (and what to type to get that character)
  g8   for the hexadecimal of the UTF-8 encoding (but not the character point)

If you are using a UTF-8 font you can force UTF-8 on the file using
  :set encoding=utf-8 fileencodings=
But you will need a terminal that understands UTF-8  (or gvim)

You can type a unicode character (in input mode) using
   <c-v>{decimal}
or <c-v>{4 digit hex code - uppercase only}

For example  <c-v>160  or  <c-v>u00A0  types a non-break space --> <--

NOTE there are two non-break spaces..
    0xA0      or 160 decimal, or characters  c2 a0
    U+202F    narrow no-break space or the characters e2 80 af
    U+2005    four-per-em space!
    U+2007    thin space (breakable)

You can also set the current locale to do this automatically

How else?

-------------------------------------------------------------------------------
Interactive Character finder

Gnome Character Map

   gucharmap

But remember is a specific font does not have a character it will
display a character from a different font!

-------------------------------------------------------------------------------
Unicode Display

You can display UTF-8 output (say from the perl example above)
using a Gnome Ternimal, if you first set...
  Terminal -> Set Character Encoding -> Unicode (UTF-8)

XTerm should be able to handle it but you need to set a locale during login.

-------------------------------------------------------------------------------
X window Selections store UTF-8 as "\u" encoded strings

   xselection PRIMARY
   \u6d4b\u8bd5\u7528\u7684\u6c49\u5b57

To return it to utf-8 use...

   env LC_CTYPE=en_AU.utf8 printf `xselection PRIMARY`'\n'
or env LC_CTYPE=en_AU.utf8 printf '\u6d4b\u8bd5\u7528\u7684\u6c49\u5b57\n'


-------------------------------------------------------------------------------
Perl and Unicode

chr() will convert a specific unicode character to UTF-8
However a warning about 'wide' charcaters may also be generated
unless prevented by output settings.

   perl -e 'binmode(STDOUT, ":utf8"); print chr(0x015C)' | od -t x1
   0000000 c5 9c
   0000002
Or using the -C option to set the input and output string attributes
   perl -CO -e 'print "\x{6d4b}\x{8bd5}\x{7528}\x{7684}\x{6c49}\x{5b57}\n";'

   perl -CO -e \
     'print pack("U*", 0x6d4b, 0x8bd5, 0x7528, 0x7684, 0x6c49, 0x5b57), "\n";'

Convert UTF-16  to UTF-8
   utf-16_source | perl -CO -ne 'print pack("U*",unpack("n*", $_)), "\n"'

   utf-8_source | iconv -f UTF-8 -t UTF-16 - | od -t x1

NOTE: recode replaces iconv
    recode utf8..utf16

Char Conversion in perl...
   use Encode;
   $text = 'Текст кириллица';
   $text = encode("utf8", decode("cp1251", $text));
   print "$text\n";

-------------------------------------------------------------------------------
X windows and Unicode


If you ever get a message about
  Warning: Missing charsets in String to FontSet conversion
  Warning: Unable to load any usable fontset
That application is uses Xaw widgets which that does not handle unicode fonts
The sulution is for set the env  "LANG=C" before running.
Example applications include  "xmessage"


To Add unicode keys to your Xwindow keyboard
Add the following to a xmodmap file like  ".Xmodmap"

! Unicode modifications
! The first line sets the key Alt-Right as the 3rd & 4th 'ModeSwitch' control
! See  http://www.cl.cam.ac.uk/~mgk25/unicode.html#input
!
! NOTEs: Right-Alt and these keys produce...
!   []     typographic single quotes
!   {}     typographic double quotes
!   23     superscript 2 and 3
!   d      degree symbol
!   -nm    hyphen, n-dash, m-dash
!   M      micro symbol
!   *      multiply
!   /      divide
!   $      euro symbol
!  space   no-break or shifted space
!
keycode 113 = Mode_switch Mode_switch
keysym    d     =    d      NoSymbol    degree         NoSymbol
keysym    m     =    m      NoSymbol    emdash         mu
keysym    n     =    n      NoSymbol    endash         NoSymbol
keysym    2     =    2      quotedbl    twosuperior    NoSymbol
keysym    3     =    3      numbersign  threesuperior  NoSymbol
keysym    4     =    4      dollar      EuroSign       NoSymbol
keysym  space   =  space    NoSymbol    nobreakspace   NoSymbol
keysym  minus   =  minus    underscore  U2212          NoSymbol
keysym  slash   =  slash    NoSymbol    division       NoSymbol
keysym asterisk = asterisk  NoSymbol    multiply       NoSymbol
keycode  34  = bracketleft  braceleft  leftsinglequotemark  leftdoublequotemark
keycode  35  = bracketright braceright rightsinglequotemark rightdoublequotemark

-------------------------------------------------------------------------------
UTF-8 Encoding...

For a summery of how UTF-8 came to be (which made unicode practical in modern
systems), see
  http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

It was invented by Ken Thompson and Rob Pike, September 1992
and immediatally put to use in Plan-9 and IBM X/Open

It also gives the actual original proposal, and the original UTF-8 to UTF-16
translation subroutines, though it is very slightly different, to make
determination of where you are in the sequence slightly easier to determine.
It also can be extended a little more, though there is not much need for it.


See RFC3629

 * Character without high bit is the same in ASCII and UTF-8
   EG:   5A -> 5A     (uppercase Z)

 * all multi character sequences have high bit set

 * Start of any multi character sequences have the two highest bits set
   While all others in multi character sequences has 10 as their high bits

 * Character codes  C0,  C1,  F5 - FF  will never appear in UTF-8

   For a more detailed summery of valid character codes see...
     http://www.phrack.org/phrack/62/p62-0x09_UTF8_Shellcode.txt
   (This is actually for the use by crackers, but good reading)

 * Searching in UTF-8 works as normal (unless you want all e's to match)

 * Number of charcters in a encoding is defined by the number of high
   bits in the first character
   (EG   C5 9B  gives  C or 2 high bits thus two bytes for the character

   As such...

     Unicode Character   valid bits    UTF String
         00 -     7F           7       0xxxxxxx
       0080 -   07FF          11       110xxxxx 10xxxxxx
       0800 -   FFFF          16       1110xxxx 10xxxxxx 10xxxxxx
     010000 - 10FFFF          21       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example encoding and decoding

    Unicode  "Latin Small Letter s with Acute"
       015B  ->  0000 0001 0101 1011     convert to binary
             ->  xxx0 0110 xx01 1011     re-organise bits (shifts)
             ->  1100 0101 1001 1011     add top level bits
             -> C59B                     final UTF-8 string

   Reverse
    UTF-8:  E2 80 9C -> 1110 0010 1000 0000 1001 1100
                     ->      0010   0000  0010   1100
                     -> 202C  in Unicode
                     -> "Left Double Quotation Mark"
                     or "Double Turned Comma Quotation Mark"
                     or (in my words) "Opening Double Quote"

-------------------------------------------------------------------------------
