In ticket #5998 awareness has been raised about a bogus regular expression offered by the W3C to validate potentially UTF8 encoded input:
"In the process Jacques Distler discovered that the W3C page [REF] which recommends a regular expression for checking for proper XML/UTF-8 allows through UTF-8 encodings of U+FFFE and U+FFFF." Sam Ruby Wed 02 Jan 2008 at 19:23
"The other obvious problem with the aforementioned regexp (well, perhaps obvious to everyone but me) is that one needs to expand NCRs to utf-8 before applying it." Jacques Distler Thu 03 Jan 2008 at 21:45
Source of both quotes: interwingly (Blog)
The regex from the W3C page which is talked about here is (PERL-Code):
$field =~ m/\A( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z/x;
In a patch schiller suggested then the following regex which should not have that problem (PHP-Code):
'(' . '[\x09\x0A\x0D\x20-\x7E]' . # ASCII '|[\xC2-\xDF][\x80-\xBF]' . # non-overlong 2-byte '|\xE0[\xA0-\xBF][\x80-\xBF]' . # excluding overlongs '|[\xE1-\xEC\xEE][\x80-\xBF]{2}' . # 3-byte, but exclude U-FFFE and U-FFFF '|\xEF[\x80-\xBE][\x80-\xBF]' . '|\xEF\xBF[\x80-\xBD]' . '|\xED[\x80-\x9F][\x80-\xBF]' . # excluding surrogates '|\xF0[\x90-\xBF][\x80-\xBF]{2}' . # planes 1-3 '|[\xF1-\xF3][\x80-\xBF]{3}' . # planes 4-15 '|\xF4[\x80-\x8F][\x80-\xBF]{2}' . # plane 16 ')';
Another regex can be found (this time containing invalid codepoints) in Building Scalable Web Sites by Cal Hernderson (O'Reilly 2006) on page 94 and 95 (Chapter 5 as PDF), PHP-Code again:
'[\xC0-\xDF]([^\x80-\xBF]|$)' . '|[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)' . '|[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)' . '|[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)' . '|[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)' . '|[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)' . '|[\x00-\x7F][\x80-\xBF]' . '|[\xC0-\xDF].[\x80-\xBF]' . '|[\xE0-\xEF]..[\x80-\xBF]' . '|[\xF0-\xF7]...[\x80-\xBF]' . '|[\xF8-\xFB]....[\x80-\xBF]' . '|[\xFC-\xFD].....[\x80-\xBF]' . '|[\xFE-\xFE]......[\x80-\xBF]' . '|^[\x80-\xBF]'
The php-utf8 library contains modified variants of W3C patterns to identify valid and invalid UTF-8 sequences as well as routines to identify bad sequences.