This page is Chapter 2 of the "Unicode and Cyrillic: problems & solutions" section of my site.
Unicode and Cyrillic: Copy/Paste and other problems
Problem: No readable Cyrillic or just question marks (????)
while working with MS Word 97 and newer or other Unicode-based programs
(Internet Explorer, Outlook Express, MS Outlook, Netscape 7/Mozilla, etc.):
Terminology Note.
Most of these problems do not exist in the 'Russian Windows' environment.
When I write below, "Russian version of Windows",
I do not mean only this special, localized version
where a word "Start" is in Russian.
What I mean is any Windows installation (even with English interface)
where a system code page (System Default locale)
is Russian ("Cyrillic" code page 1251).
(system code page issue is described in details in the "Full Russification"
section of my site)
So, to avoid this long description - "...when system code page..." -
I call such Windows installation a Russian Windows.
The reason for the appearance of these new problems in the modern software
(such as Word 97/2000, Internet Explorer, etc.) is that
the modern applications use a new type of data encoding - Unicode.
Many Windows applications are still non-Unicode programs and use
legacy encodings such as
"Western European, Code Page 1252" or "Cyrillic, Code Page 1251".
Examples of the non-Unicode programs where you can type some Cyrillic text:
- Text input windows of Netscape ver. 3 and 4, for example, an e-mail preparation window
(Composition window)
- UltraEdit by Ian D. Mead, that I use
all the time for preparing my Web pages and other text-related and programming work.
To work with Russian, I just need - via View/SetFont - choose say
"Courier New", Script="Cyrillic" there.
Majority of 3rd party plain text editors (those working with .TXT files)
are non-Unicode programs.
- Macromedia Dreamweaver - text input window
(didn't see myself, but have read that Macromedia software is also of a non-Unicode type)
When you want to move Cyrillic texts (perform Copy and Paste)
between Word 97 (or another Unicode-based program) and some
non-Unicode program, or want to work with a plain text file (.TXT)
along with Word 97/2000, you may face some problems:
working simultaneously with both types of Cyrillic texts -
texts that use Unicode and those that do not use Unicode -
often leads (under a non-Russian Windows) to the unreadable, gibberish texts
or just a set of question marks instead of Cyrillic letters.
Below you find the solutions for these problems.
Note.
I assume that you already know how to enable Cyrillic fonts and Cyrillic
keyboard tools in your Windows.
If it's not the case, then do it before reading any further here.
To enable Cyrillic fonts and keyboard, read "Cyrillic in Windows"
section of my site.
Table of Content
- Copy/Paste between Unicode and non-Unicode programs:
- Unicode program ---> non-Unicode program
From Unicode program (f.e. Word 97 and newer, Internet Explorer,
Outlook Express, MS Outlook 2000, Netscape 7/Mozilla)
to a non-Unicode one (f.e. Netscape 4.79, UltraEdit, Dreamweaver) -
question marks (???) instead of Cyrillic in the non-Unicode program's window
- non-Unicode program ---> Unicode program
From non-Unicode program (f.e. Netscape 4.79 or UltraEdit)
to Unicode one (f.e. Word 97 and newer,
Internet Explorer, Outlook Express, MS Outlook, Netscape 7/Mozilla) -
unreadable (gibberish) text instead of Cyrillic in the Unicode program's window
- Specific to MS Word 97 and newer: Cyrillic plain text (.TXT) file -
- reading such file: unreadable (gibberish) text instead of Cyrillic
- saving your Cyrillic document as .TXT: question marks instead of Cyrillic
Copy/Paste:
Unicode program ---> non-Unicode program
Trying to copy some Cyrillic text from a Unicode program
(f.e. Word 97 and newer, Internet Explorer, Outlook Express, MS Outlook 2000, Netscape 7/Mozilla)
to a non-Unicode one (f.e. Netscape 4.79, UltraEdit, Dreamweaver)
and see just question marks (???) instead of Cyrillic as a result.
This usually happens under a non-Russian version of Windows (that is, where System Code Page
is not "Cyrillic, CP-1251").
Conversion from Unicode text to a non-Unicode text is usually based on System Code Page,
and thus under the "Western" Windows installation (where system code page is
"Western European, CP-1252") the following happens:
- Unicode contains Cyrillic letters while "Western" code page (encoding)
does not contain any.
- Therefore for each Cyrillic letter a result of such conversion is a
question mark ('?')
which is a designated symbol meaning
"character is not found in the target code page".
By the way, it's a real, regular question mark and nothing more, i.e. no more Cyrillic letters
in that text with question marks.
Solution: use an intermediate window - a program that understands Unicode
and also lets you specify that you are dealing with "Cyrillic" and not "Western" encoding.
I suggest to use one of the following programs of such type
(click on the corresponding link below to read the instruction):
Netscape 4 can help to solve the problem during Copy/Paste from a Unicode program
(f.e. Internet Explorer or Word 97/2000) to a non-Unicode program
(f.e. plain text editor or Dreamweaver):
Netscape Communicator 4.õ has built-in HTML editor - Composer,
that is good for this - it understands Unicode and also
lets us specify that we are dealing with Cyrillic text and not "Western":
- Call Netscape
- Open Composer window via the menu - Communicator/Composer
- Switch to Cyrillic(Windows) encoding:
- in Netscape 4.x: View/CharacterSet/Cyrillic(Windows-1251)
- in Netscape 4.0x: View/Encoding/Cyrillic(Windows-1251)
Now you can use this window as an intermediate one:
- Copy the text from your Unicode program to Netscape Composer first
(where current encoding - Cyrillic(Windows)!)
- Select the text that was copied to Composer (f.e. via Ctrl/A or Edit/SelectAll)
and copy it now to the needed non-Unicode program.
This will produce normal Cyrillic text, not question marks, because
system now 'knows' that the text was in "Cyrillic" encoding and not
in the encoding of system code page ("Western")
Back to the Table of Content
UniPad
(freeware for personal use)
can help to solve the problem during Copy/Paste from a Unicode program
(f.e. Internet Explorer or MS Word 97/2000) to a non-Unicode program
(f.e. plain text editor or Dreamweaver):
- Download and install UniPad editor. Here is "UniPad Home Page"
- Open UniPad and do File/New in it to have a new document window
Now you can use this window as an intermediate one:
Back to the Table of Content
Copy/Paste:
non-Unicode program ---> Unicode program
Trying to copy some Cyrillic text from a non-Unicode program
(f.e. Netscape 4.79 or UltraEdit or Macromedia Dreamweaver)
to a Unicode program
(f.e. Word 97 and newer, Internet Explorer,
Outlook Express, MS Outlook 2000, Netscape 7/Mozilla)
and see just unreadable (gibberish) text instead of Cyrillic as a result.
This usually happens under a non-Russian version of Windows (that is, where System Code Page
is not "Cyrillic, CP-1251").
The Unicode program does not know that the incoming text is a Cyrillic one and is using system
code page as a default during the conversion from non-Unicode text to Unicode text.
For example, under "Western" installation of Windows it looks at the incoming bytes as a sequence
of "Western" encoding bytes and performs the conversion
"Western European, CP-1252" ---> Unicode
For example:
Cyrillic small 'd' contained in that original non-Unicode text has a byte value of 228
in the "Cyrillic, CP-1251" code page. But that Unicode program assumes that incoming data
belong to "Western" encoding! In "Western, CP-1252" code page a value 228 is a German
a-umlaut, so the following conversion takes place:
non-Unicode German a-umlaut ---> Unicode German a-umlaut
and you'll see German a-umlaut in that Unicode program instead of Russian 'd'
after you paste the text there.
There are 2 possible solutions to this situation. Some non-Unicode programs let you use
very simple Solution 1, so just try it first, but if it does not work, then
use Solution 2.
Note. Word 2000/XP has its own solution for the text copied to a Word's window -
see "MS macro Eefonts for Word 2000/XP" section below.
Solution 1
Use the following approach while copying the text from a non-Unicode program
(f.e. Netscape 4.79 or UltraEdit or Dreamweaver, etc.) to the Windows Clipboard:
- Select the text you want to copy
- Before you choose Edit/Copy in that program's menu (or press Ctrl/C),
you need to switch your keyboard to the needed Cyrillic mode, say, "Russian" if
it's a Russian text.
(The activation of Russian keyboard is covered in the
"Russian Keyboard: standard and phonetic" section of my site)
- Now, having "RU" on your Taskbar keyboard language indicator, do Edit/Copy.
This kind of 'tells' the system that you are trying to do Copy/Paste with a
Cyrillic text and not "Western"
- When you paste the text to a Unicode program now (f.e. to Word 97/2000 or
Internet Explorer),
you should see normal Cyrillic - the conversion from non-Unicode to Unicode
is correctly assuming that incoming text belongs to "Cyrillic, CP-1251" code page.
This Solution 1 (switching keyboard to Cyrillic mode before copying) may not work for each and every
non-Unicode program.
In such case:
MS macro Eefonts for Word 2000/XP
Microsoft offers a free macro that solves the problem of a non-readable text copied from
some non-Unicode program to a Word 2000/XP document.
Same macro helps to make readable an old Cyrillic .doc created in the past with non-Unicode
Word 6 for example.
Go to the Microsoft page (Knowledge base article Q260162)
"Incorrect Characters Appear When You Open Document in Earlier Eastern European Version of Word".
Find there a link to download Eefonts.exe.
Download and install it. Now in your Word 2000/XP you will have a new
option under the Tools menu:
Tools / Fix Broken Text
When you copy some Cyrillic from a non-Unicode program to Word 2000/XP,
you will see first some gibberish text (as explained above).
You need to select that text and
- Tools / Fix Broken Text
- Choose "Russian" in the list (if the text you are copying is Russian).
Now you will have a readable Cyrillic!
Solution 2
for non-Unicode --> Unicode copying case
The universal solution for the successful copy of Cyrillic text from a non-Unicode program
(Netscape 4.79, Dreamweaver, plain text editor, etc.) to a Unicode one
(Internet Explorer, Outlook, etc.) is the following:
Use an intermediate window such as a program that understands Unicode
and also lets you specify that you are dealing with "Cyrillic" and not "Western" encoding.
I suggest to use a freeware (for personal use) editor
UniPad as such intermediate program:
- Download and install UniPad editor:
"UniPad Home Page"
- open UniPad and do File/New in it to have a new document window
Now you can use this UniPad window while copying Cyrillic from
a non-Unicode program (f.e. Netscape 4.79 or Dreamweaver or UltraEdit)
to some Unicode program
(MS Word 97/2000, Internet Explorer, Outlook Express, etc.):
- In your non-Unicode program select and copy the text that contains Cyrillic letters to
Windows Clipboard (Edit/Copy in your program's menu or Ctrl/C)
- Go to UniPad and do Edit / Paste As
In the list choose needed encoding -
"Windows CP-1251 (Cyrillic)" and click Ok.
Now you should see normal Cyrillic text in this UniPad window.
That was the conversion from non-Unicode text to Unicode text (UniPad is a Unicode editor)
where instead of using System Code Page (say, "Western") as a source encoding,
we explicitly specified that the source encoding is "Cyrillic"!
- Now you can safely select and copy the text from this UniPad window (which is already
a Unicode text now) to any Unicode-based program -
you'll see normal Cyrillic as a result
Back to the Table of Content
Cyrillic in MS Word 97 and newer:
working with .TXT files
- opening such file in Word: unreadable (gibberish) text instead of Cyrillic
- saving your Cyrillic document as .TXT: question marks instead of Cyrillic
The above happens under a non-Russian Windows, i.e. when system code page is not
"Cyrillic, CP-1251".
Plain text files (.TXT) contain non-Unicode text, so when Unicode-based Word 97/2000
deals with such files, it performs the conversion between Unicode text and non-Unicode text.
By default, this conversion uses system code page and therefore we see the above
problems if system code page is say "Western, CP-1252" and not "Cyrillic".
The solution is to specify that the content of the plain text (.TXT) file
belongs to "Cyrillic" encoding and not to system code page.
MS Word 2000 and newer has its own way to specify that, while Word 97
requires an intermediate program to be used.
Here are the solutions for the two cases where plain text (.TXT) Cyrillic files are involved:
Opening a Cyrillic plain text (.TXT) file in MS Word 97 and newer
Let's assume that you have some plain text (.TXT) Russian file that contains
the text in "Cyrillic CP-1251" encoding
(a.k.a "Cyrillic(Windows)" or "Windows-1251").
Word 2000 (and newer versions) allows you to specify that this file is really a Cyrillic
one, while Word 97 requires more complex approach to be used.
MS Word 2000 and newer
- Tools/Options/General and check the box
"Confirm Conversion at Open" (i.e. show the conversion details dialog)
- File / Open and then in the "Files of Type" choose
"Encoded text files (.txt)"
(in Word 2003 - choose "Text file (*.txt)")
- Point to your Cyrillic plain text file and click "Open"
- Word 2000 presents you another list called "Convert File". Choose
"Encoded Text" there
Word 2003 offers you to choose "Encoded text" at once (it 'knows' already that it's the case)
- Now Word asks you to specify the encoding of that text:
- Click on "Other Encoding"
- In the list, choose "Cyrillic(Windows)"
- Click on Ok
You should see now normal Cyrillic text in your Word 2000 window.
MS Word 97
There are several possible solutions for loading Cyrillic .TXT file into Word 97,
let's look at two of them:
Netscape-based method for loading Cyrillic .TXT into Word 97
In Netscape, do File/Open, choose "Text (.TXT)" as a "Files of Type".
Your Cyrillic .txt opens in Netscape. Change encoding to Cyrillic(Windows-1251):
- in Netscape 6 - View/Character Encoding/Cyrillic(Windows-1251)
- in Netscape 4.5+ - View/Character Set/Cyrillic(Windows-1251)
- in Netscape 4.0x: View/Encoding/Cyrillic(Windows-1251)
Now you should see normal Cyrillic text and can safely copy it to Word 97.
Plain Text non-Unicode editor-based method of loading Cyrillic .TXT file into Word 97.
Instead of opening your Cyrillic plain text (.TXT) file directly in Word 97,
you need to open it in any non-Unicode plain text editor and then use Copy/Paste
methods of this page to place this text into Word 97.
I am using a shareware plain text editor UltraEdit,
so you can download it, too or use your favorite plain text editor that works with Cyrillic.
Let's use UltraEdit as an example:
- In the UltraEdit menu, go to View / Set Font and there:
- select "Courier New".
- in the Script list below right choose "Cyrillic"
- Do File / Open to load your Cyrillic .TXT file into UltraEdit. You should see normal
Cyrillic text.
- To copy the text from non-Unicode UltraEdit to Unicode-based Word 97
use the method explained above on this page:
Copy/Paste technique non-Unicode --->Unicode
Back to the Two .TXT related problems list
Back to the Table of Content
Saving Word 97+ Cyrillic text as a plain text (.TXT) file
Let's assume that you want to save your document opened in MS Word,
as a plain text (.TXT) Russian file.
Word 2000 (and newer versions) allows you to specify that this file is really a Cyrillic
one, while Word 97 requires more complex approach to be used.
MS Word 2000 and newer
- File / Save As, type the name of that new file and then in the "Save as Type" choose
"Encoded text (.txt)"
(in Word 2003 choose "Plain Text (*.txt))
- Word presents you another window - "File Conversion" where
it asks you to specify the encoding of that text:
- Click on "Other Encoding"
- In the list, choose "Cyrillic(Windows)"
- Click on Ok
That newly created plain text file contains normal Windows-1251 Cyrillic text and not
question marks :)
MS Word 97
So you have some Cyrillic text in your open MS Word 97 window and want to save it as a
a plain text file.
Instead of creating this Cyrillic plain text (.TXT) file using Word 97,
you need to copy the text to any non-Unicode plain text editor and then
do Save As there.
I am using a shareware plain text editor UltraEdit,
so you can download it, too or use your favorite plain text editor that works with Cyrillic.
Let's use UltraEdit as an example:
- In the UltraEdit menu, go to View / Set Font and there:
- select "Courier New".
- in the Script list below right choose "Cyrillic"
- To copy the text from Unicode-based Word 97
to non-Unicode UltraEdit, you need to
use the method explained above on this page:
Copy/Paste technique Unicode --->non-Unicode
- Now, in UltraEdit, when you see normal Cyrillic text copied from MS Word 97,
you can create this plain text (.txt) file - via
File / Save As menu of UltraEdit.
Back to the Two .TXT related problems list
Back to the Table of Content