Skip to content
Commit 44594c2f authored by Olaf Weber's avatar Olaf Weber Committed by Theodore Ts'o
Browse files

unicode: introduce code for UTF-8 normalization



Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfdi and
nfdicf.

  nfdi:
   - Apply unicode normalization form NFD.
   - Remove any Default_Ignorable_Code_Point.

  nfdicf:
   - Apply unicode normalization form NFD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).

From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword.  On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

From the original submission, we also converted from the compatibility
form to canonical.

Changes made by Gabriel:
  Rebase to Mainline
  Fix up checkpatch.pl warnings
  Drop typedefs
  move out of libxfs
  Convert from NFKD to NFD

Signed-off-by: default avatarOlaf Weber <olaf@sgi.com>
Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
parent 955405d1
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment