In ticket #5998 awareness has been raised about a bogus regular expression offered by the W3C to validate potentially UTF8 encoded input:
"In the process Jacques Distler discovered that the W3C page [REF] which recommends a regular expression for checking for proper XML/UTF-8 allows through UTF-8 encodings of U+FFFE and U+FFFF." Sam Ruby Wed 02 Jan 2008 at 19:23
"The other obvious problem with the aforementioned regexp (well, perhaps obvious to everyone but me) is that one needs to expand NCRs to utf-8 before applying it." Jacques Distler Thu 03 Jan 2008 at 21:45
Source of both quotes: interwingly (Blog)
The regex from the W3C page which is talked about here is (PERL-Code):
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
In a patch schiller suggested then the following regex which should not have that problem (PHP-Code):
'(' .
'[\x09\x0A\x0D\x20-\x7E]' . # ASCII
'|[\xC2-\xDF][\x80-\xBF]' . # non-overlong 2-byte
'|\xE0[\xA0-\xBF][\x80-\xBF]' . # excluding overlongs
'|[\xE1-\xEC\xEE][\x80-\xBF]{2}' . # 3-byte, but exclude U-FFFE and U-FFFF
'|\xEF[\x80-\xBE][\x80-\xBF]' .
'|\xEF\xBF[\x80-\xBD]' .
'|\xED[\x80-\x9F][\x80-\xBF]' . # excluding surrogates
'|\xF0[\x90-\xBF][\x80-\xBF]{2}' . # planes 1-3
'|[\xF1-\xF3][\x80-\xBF]{3}' . # planes 4-15
'|\xF4[\x80-\x8F][\x80-\xBF]{2}' . # plane 16
')';
Another regex can be found (this time containing invalid codepoints) in Building Scalable Web Sites by Cal Hernderson (O'Reilly 2006) on page 94 and 95 (Chapter 5 as PDF), PHP-Code again:
'[\xC0-\xDF]([^\x80-\xBF]|$)' .
'|[\xE0-\xEF].{0,1}([^\x80-\xBF]|$)' .
'|[\xF0-\xF7].{0,2}([^\x80-\xBF]|$)' .
'|[\xF8-\xFB].{0,3}([^\x80-\xBF]|$)' .
'|[\xFC-\xFD].{0,4}([^\x80-\xBF]|$)' .
'|[\xFE-\xFE].{0,5}([^\x80-\xBF]|$)' .
'|[\x00-\x7F][\x80-\xBF]' .
'|[\xC0-\xDF].[\x80-\xBF]' .
'|[\xE0-\xEF]..[\x80-\xBF]' .
'|[\xF0-\xF7]...[\x80-\xBF]' .
'|[\xF8-\xFB]....[\x80-\xBF]' .
'|[\xFC-\xFD].....[\x80-\xBF]' .
'|[\xFE-\xFE]......[\x80-\xBF]' .
'|^[\x80-\xBF]'
The php-utf8 library contains modified variants of W3C patterns to identify valid and invalid UTF-8 sequences as well as routines to identify bad sequences.