Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Edits: 2015-01-10 Added another offending character, NO-BREAK SPACE

My recent focus has been loading additional content into SHARE to enrich our offerings, such as TAUG and draft SDTM publications published in 4Q2014. I spent a lot of time having to clean up the obstreperous characters in my Word document sources (i.e., the documents used to render the PDF). I feel compelled to write this short blog entry, hoping it will give you a jump start if you happen to perform similar tasks.

The most significant offenders are the so-called smart quotes. Microsoft Word, as a default, auto-formats straight to curly quotes, meaning it automatically corrects every time you hit the single- or double-quote keys. So, what's wrong with the smart quotes and why am I mucking around with them? First of all, they are not consistently used by our volunteer authors since user can disable the AutoCorrect feature. Second, these character are not ASCII and can only understood by software applications that supports UTF. Until our industry is more acquainted with XML technologies, forcing UTF would introduce unnecessary burden of data transport incompatibilities.

Panel
bgColorlightYellow
titleBGColorlightGray
titleSidebar

SOA Semantics Manager is a web-based application and supports UTF. For SHARE, we use the default configuration option for character set displays, which is UTF-8. The backend database uses UTF for character encoding.

Hyphens. Apart from the one on the keyboard, four other flavors have been detected, which are not part of ASCII.

Perhaps, the worst class kind of obstreperous characters are those non-printerable onesprinterables. You know they are there, but you can't see it. They are hard to detect like household parasites, tagging along in copy-paste buffer.

...

Character ImageUnicode Name (Code Point)Replacement ASCII Character (Decimal)Remarks
LEFT SINGLE QUOTATION MARK (U+2018)' (39)
  • Commonly referred to left curly single-quote
  • This character is part of Microsoft Word's default AutoFormat setting.
RIGHT SINGLE QUOTATION MARK (U+2019)' (39)
  • Commonly referred to right curly single-quote
  • This character is part of Microsoft Word's default AutoFormat setting.
LEFT DOUBLE QUOTATION MARK (U+201D)" (34)
  • Commonly referred to left curly double-quote
  • This character is part of Microsoft Word's default AutoFormat setting.
RIGHT DOUBLE QUOTATION MARK (U+201D)" (34)
  • Commonly referred to right curly double-quote
  • This character is part of Microsoft Word's default AutoFormat setting.
NON-BREAKING HYPHEN (U+2011)- (45) 
FIGURE DASH (U+2012)- (45)
  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as ‒
EN DASH (U+2013)- (45)
  • This character is part of Microsoft Word's default AutoCorrect setting.
  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as –
EM DASH (U+2014)- (45)
  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as —
HORIZONTAL ELLIPSIS (U+2026)... (46, 3 times)
  • This character is part of Microsoft Word's default AutoCorrect setting.
  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as …
ZERO WIDTH SPACE (U+200B)null (0)
  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as ​
  • This is a non-printable character. Microsoft Word uses it to represent optional breaks. It is visible after enabling the Show Formatting Symbols option:

Image Modified

An example with text:

Image AddedNO-BREAK SPACE (U+00A0)

whitespace (32)

  • Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as ∩┐╜
  • Although it is a printable character, Microsoft Word uses it to represent nonbreaking space, but disguises it as regular whitespace. It is visible after enabling the Show Formatting Symbols option:

Image Added

An example with text:

Image Added

Image credit: FileFormat.info

Lastly, here is a snippet of Perl code that implements the above table:

Code Block
themeConfluence
languageperl
titleTransform Unicode
linenumberstrue
sub transformUnicode{
    my @input = @_;
    
    for (@input){
        s/\x{2018}/'/g; s/\x{2019}/'/g; # left and right curly single-quote
        s/\x{201c}/"/g; s/\x{201d}/"/g; # left and right curly double-quote

        # all kinds of hyphens/dashes
        s/\x{2011}/-/g; # Non-breaking hyphen
        s/\x{2012}/-/g; # Figure dash
        s/\x{2013}/-/g; # En dash
        s/\x{2014}/-/g; # Em dash

		s/\x{2026}/.../g; # Horizontal ellipse

		# all kinds of spaces
        s/\x{00a0}/ /g; # No-break space, a.k.a. Microsoft Word's nonbreaking break
		s/\x{200b}//g; # Zero width space, a.k.a. Microsoft Word's optional break
    }
    
    return wantarray ? @input : $input[0];
}