Avoid wcwidth(), wcrtomb() and mbrtowc() on ASCII/ISO8859-1 characters.

ASCII <-> UTF has trivial mappings. Avoid wcrtomb() and mbrtowc(). ISO-8859-1 is all narrow characters, and cheap to test for. It might be possible to cheaply test other popular UTF blocks and/or planes as well. These two changes get 2-3x faster input processing on Linux and FreeBSD. Performance improvement in actual usage is more modest but still significant.
2014-09-28 02:48:32 -04:00
parent f5d814a9c4
commit e4a99256cb
3 changed files with 32 additions and 5 deletions
@@ -80,10 +80,15 @@ void Parser::UTF8Parser::input( char c, Actions &ret )
 {
  assert( buf_len < BUF_SIZE );

+  /* 1-byte UTF-8 character, aka ASCII?  Cheat. */
+  if ( buf_len == 0 && static_cast<unsigned char>(c) <= 0x7f ) {
+    parser.input( static_cast<wchar_t>(c), ret );
+    return;
+  }
+
  buf[ buf_len++ ] = c;

  /* This function will only work in a UTF-8 locale. */
-
  wchar_t pwc;
  mbstate_t ps = mbstate_t();