Codex

Interested in functions, hooks, classes, or methods? Check out the new WordPress Code Reference!

User:Hakre/UTF8

Back to my Page

Validating UTF8 by Regex

In ticket #5998 awareness has been raised about a bogus regular expression offered by the W3C to validate potentially UTF8 encoded input:

"In the process Jacques Distler discovered that the W3C page [REF] which recommends a regular expression for checking for proper XML/UTF-8 allows through UTF-8 encodings of U+FFFE and U+FFFF." Sam Ruby Wed 02 Jan 2008 at 19:23

"The other obvious problem with the aforementioned regexp (well, perhaps obvious to everyone but me) is that one needs to expand NCRs to utf-8 before applying it." Jacques Distler Thu 03 Jan 2008 at 21:45

Source of both quotes: interwingly (Blog)

The regex from the W3C page which is talked about here is (PERL-Code):

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

In a patch schiller suggested then the following regex which should not have that problem (PHP-Code):

'(' .
  '[\x09\x0A\x0D\x20-\x7E]' .        # ASCII
  '|[\xC2-\xDF][\x80-\xBF]' .        # non-overlong 2-byte
  '|\xE0[\xA0-\xBF][\x80-\xBF]' .    # excluding overlongs
  '|[\xE1-\xEC\xEE][\x80-\xBF]{2}' . # 3-byte, but exclude U-FFFE and U-FFFF
  '|\xEF[\x80-\xBE][\x80-\xBF]' .
  '|\xEF\xBF[\x80-\xBD]' .
  '|\xED[\x80-\x9F][\x80-\xBF]' .    # excluding surrogates
  '|\xF0[\x90-\xBF][\x80-\xBF]{2}' . # planes 1-3
  '|[\xF1-\xF3][\x80-\xBF]{3}' .     # planes 4-15
  '|\xF4[\x80-\x8F][\x80-\xBF]{2}' . # plane 16
')';

Another regex can be found (this time containing invalid codepoints) in Building Scalable Web Sites by Cal Hernderson (O'Reilly 2006) on page 94 and 95 (Chapter 5 as PDF), PHP-Code again:

'[\xC0-\xDF]([^\x80-\xBF]|$)' .
'|[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)' .
'|[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)' .
'|[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)' .
'|[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)' .
'|[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)' .
'|[\x00-\x7F][\x80-\xBF]' .
'|[\xC0-\xDF].[\x80-\xBF]' .
'|[\xE0-\xEF]..[\x80-\xBF]' .
'|[\xF0-\xF7]...[\x80-\xBF]' .
'|[\xF8-\xFB]....[\x80-\xBF]' .
'|[\xFC-\xFD].....[\x80-\xBF]' .
'|[\xFE-\xFE]......[\x80-\xBF]' .
'|^[\x80-\xBF]'

Further Resources

The php-utf8 library contains modified variants of W3C patterns to identify valid and invalid UTF-8 sequences as well as routines to identify bad sequences.