| home
M
is for MACRO: ASCII to Unicode strings
Ernest Murphy ernie@surfree.com
Get the L.inc file
Abstract:
------------------------------------------------------------------------------
Documentation on several compile-time macros
is lacking or unclear from the
usual sources. This is a report on some
macro functions I have been using
as of late. Demonstrates a macro to convert
ASCII string to Unicode at compile
time.
Introduction:
------------------------------------------------------------------------------
Lately I've been doing lots of work with
COM methods from assembly. Scattered
inside these methods as a mixture of our
well known friend the ASCII text
string, and our new arch-nemesis, the Unicode
wide string. As MASM has no built
in functions to define Unicode strings,
I was constrained to define ASCII
strings, then use a buffer and a MultiByteToWideChar
API call to get Unicode
strings. Inefficient, annoying, dreadful,
but a quick and dirty way to keep
going where I wanted to go.
I spent a recent weekend just playing around
in MASM with wide strings, and
made a happy discovery. One can define a
Unicode string like so:
wszSomeString WORD
"H","E","L","L","O"," ","W","O","R","L","D",0
The way Unicode is defined an ASCII character
maps to the same Unicode
character, with just the size of the data
being different. You can try this out
yourself in a simple test app by using MessageBoxW
instead of just MessageBox.
MessageBox equates with MessageBoxA, the
ASCII version of this API. MessageBoxW
expects Unicode strings. Just add a proto
def for it in your code (MASM32
includes the correct library, but only defines
ACSII protos).
Now that's useful! But it's damn annoying
too... very prone to errors, hard
to read, even harder to type. It just cries
out for a macro to do the conversion.
How shall we proceed?
Well, when doing something new, I like to
do it the same way as something old
or already known. When writing in C++, one
defines a string like this:
wszSomeString wchar
L"MyString"
It would be great if we could make our macro
function look the same. Well,
we can't. MASM first, wants a macro functions
parameters inside parenthesis, and
it also wants text enclosed in angle brackets.
So the best we can do is:
wszSomeString wchar
L(<MyString>)
Still, not half bad. By now you should be used
to doing all your TEXTEQU's
like this anyway, with those surrounding
angle brackets. You're not doing this?
This is a great way to define constants
in your own .inc files because you only
make .data entries for the constants you
use. It works like this: Inside your
.inc file, you define a text constant like
this:
sRadiusOfEarthInMiles
TEXTEQU <3959>
Then, inside your source code .data area you
use it like:
RadiusOfEarthInMiles
DWORD sRadiusOfEarthInMiles
Which the compiler will re-arrange into:
RadiusOfEarthInMiles
DWORD 3959
Hey! That's what we wanted, and we didn't have
to do any EXTERNs or anything.
Simple, bulletproof. I use this a lot to
define GUIDS (a huge structure of
differently typed numbers).
So... back to ASCII to Unicode. Let's make
a simple text macro to surround
ASCII characters with quotes so we can equate
them to Unicode Strings. "FORC"
is a macro command that loops through once
for each letter in a text equate.
Whoops... we send it text, will that work?
Almost, if we surround our sText
variable with angle brackets. AND... add
an "&" in front of the variable; this
directs MASM to look up the value, not use
the value's name.
wchar TYPEDEF DWORD
L MACRO sText:REQ
LOCAL str, chr
FORC chr, <&sText>
str CATSTR str,
<">, <&chr>, <"> ; surround each char with
; quotes, and
add trailing
; comma for
the next character
ENDM
str CATSTR str, <,0> ;
almost done, just add
; the terminating zero
EXITM str
ENDM
Simple, direct, and has a BIG PROBLEM. If you
compile:
wszSomeString wchar
L(<Hello World>)
That works fine. But what is we try:
wszSomeString wchar
L(<Hello World!>)
Whoops... an error. Easy you say, knowing that
an exclamation point has a
special meaning in text macros. It means
take the next character as a literal,
just in case you wanted to include angle
brackets inside your text equate (and
you will at times). Well... just doing this
also fails:
wszSomeString wchar
L(<Hello World!!>)
Why would that fail? It all works fine for a
while, the correct string gets
passed to the macro, and eventually the
"!" character is parsed. Then trouble
happens when we try to do the CATSTR, because
<&chr> will expand to: <!> And
that's an imbalance equate. We need an odd
number of !'s to get the CATSTR to
work, but need to send an even number in
the macro function invoke line...
No matter what you do, it ain't gonna work.
No problem if you never use an
exclamation point, but dang, I sure want
to. So...
So we are left with doing some sort of alias
for "!". C++ uses a backslash to
do this, so we will too. Let's define "\|"
as an exclamation point. All we have
to do is compare the chr value in the loop
to "\" and we can... but wait a sec.
Compare chr? To what? The implementation
of IF in MASM is pretty lame. There
is no way it can do that comparison. It
just wants numbers. But wait...
In the "good old days," MASM was a real product
and sold on shelves and
had... REAL BOOKS. (MS still sells MASM
direct to MSVC and Studio owners for 20
bucks, I do not if it still ships with books.
Well worth the call). Inside
the Programmer's Guide to MASM come a few
more macro definitions you will find
useful. These are:
The Directive
Grants Assembly If
===================================================================
IF {expression}
{expression} is true
IFE {expression}
{expression} is false
IFDEF {name}
name has been previously defined
IFNDEF {name}
name has not been previously defined
IFB {argument}
{argument} is blank
IFNB {argument}
{argument} is not blank
IFIDN[I] {arg 1},
{arg2} {arg 1} equals {arg2}
IFDIF[I] {arg 1},
{arg2} {arg 1} does not equal {arg2}
the optional [I] in
IFIDN and IFDIF make comparisons
insensitive to differences
in case
Wow. Some of these look very useful. IF looks
good, except after trying it I
can tell you it wants numeric only args.
The expressions in IF are of the form
"IF num1 EQ num2" where num1 & num2
are numeric constants. IFIDN (the best
acronym I can come up for this command is
"IF IS DIFFERENT NOT," I hate
senseless command names) actually does what
we want, compare two text values.
Each value must be a text string or a text
equate variable.
Since we're looping through character by
character, we need some sort of state
information from loop to loop to remember
we're processing multi-character
information. Here again, we can use text
variables to do this for us. Let's try
a revised macro:
L MACRO sText:REQ
LOCAL str, chr, flag
flag TEXTEQU < >
FORC chr, <&sText>
IFDIF flag,
<\> ; if == we're processing a normal char
IFIDN <&chr>, <\> ; see if char is a backslash
flag CATSTR <\> ; and remember it in flag
ELSE
str CATSTR str, <">, <&chr>, <",>
; just add the character normally
ENDIF
ELSE ; !=, we're
processing a command
str CATSTR str,
<"!!",> ; add the exclamation point
flag CATSTR
< > ; clear the flag
ENDIF
ENDM
str CATSTR str, <0>
EXITM str
ENDM
Well, this works a little better. We get exclamation
points back, but we lost
the backslash at the same time. We need
to check the 2nd character! Let's fix that.
L MACRO sText:REQ
LOCAL str, chr, flag
flag TEXTEQU < >
FORC chr, <&sText>
IFDIF flag,
<\>
IFIDN <&chr>, <\>
flag CATSTR <\>
ELSE
str CATSTR str, <">, <&chr>, <",>
ENDIF
ELSE
IFIDN <&chr>, <|> ; check the 2nd command char
str CATSTR str, <"!!",> ; add the exclamation point
ELSE
str CATSTR str, <">, <&chr>, <",>
ENDIF
flag CATSTR < >
ENDIF
ENDM
str CATSTR str, <0>
EXITM str
ENDM
Now we're getting somewhere... but not too far.
MASM has a single line text
limit of 256 characters. This means we can
have a string of 57 characters
maximum before this technique bombs out
on us. One quick fix would be to take
out the automatic trailing zero, then we
can define lots of strings in a row,
and they all become one string until that
final terminating zero is met. Let's
add that:
L MACRO sText:REQ
LOCAL str, chr, flag
str CATSTR < > ; define
the initial str
flag TEXTEQU < >
FORC chr, <&sText>
IFDIF str, <
>
str CATSTR str, <,> ; add a training comma ONLY to
; non-null strings
ENDIF
IFDIF flag,
<\>
IFIDN <&chr>, <\>
flag CATSTR <\>
ELSE
str CATSTR str, <">, <&chr>, <"> ; no trailing comma
ENDIF
ELSE
IFIDN <&chr>, <|>
str CATSTR str, <"!!"> ; no trailing comma
ELSE
str CATSTR str, <">, <&chr>, <"> ; no trailing comma
ENDIF
flag CATSTR < >
ENDIF
ENDM
; no trailing zero here either
EXITM str
ENDM
Pretty neat, just use the macro function with
a trailing zero like this:
L(<Hello World>),0
and we get a single string, or put them together
for longer strings. But... as long
as we made all this bother just so we can
insert an exclamation point, why not
keep going and make this function really
work for us?
Let's add two more commands: a newline (\n)
command, and a terminating zero
(\0) command. We'll keep things simple by
not checking the trailing zero is at
the end. And as we thing of more functions,
these get easy to add.
Before we launch into this, one thing has
to be worked out, since there is no
matching ELSEIF to IFDIF. We need and "ELSE"
clause to make non-command
characters print least we loose our backslash
character. To do this, let's make
the flag variable do something else: in
the command code arm flag has already
done it's job of remembering the previous
character. So we can re-define it.
Here is the final macro:
L MACRO sText:REQ
LOCAL str, chr, flag
;; generates a wide character
string
;; usage: sztext wchar L(<Hello
World \|\|\0>)
;; generates: sztext WORD
"H","e","l","l","o","," ",
"W","o","r","l","d","!","!",0
;; max string length is 57
chars (MASM line length limit)
;; use multiple non-zero
term strings in sequence for longer strings
;; (zero term the last of
course)
str TEXTEQU < >
flag TEXTEQU <.>
FORC chr, <&sText>
IFDIF flag,
<\>
IFDIF str, < >
str CATSTR str, <,>
ENDIF
ENDIF
IFDIF flag,
<\>
IFIDN <&chr>, <\>
flag CATSTR <\>
ELSE
str CATSTR str, <">, <&chr>, <">
ENDIF
ELSE
flag CATSTR <X>
;; check for a pipe (exclamation point)
IFIDN <&chr>, <|>
str CATSTR str, <"!!">
flag CATSTR < >
ENDIF
;; check for an "n" (new line)
IFIDN <&chr>, <n>
str CATSTR str, <13,10>
flag CATSTR < >
ENDIF
;; check for an "0" (terminating zero)
IFIDN <&chr>, <0>
str CATSTR str, <0>
flag CATSTR < >
ENDIF
;; now check if no special chars were issued
IFIDN <&flag>, <X>
str CATSTR str, <">, <&chr>, <">
ENDIF
flag CATSTR < >
ENDIF
ENDM
EXITM str
ENDM
Well, here it is. Works very good, the only
drawback is the HUGE amount of
code (over 1,150 just to translate "Hello
World!") added to your listing file.
We'll just have to live with that, once
can't have everything. The only way
around that would be to compile a pre-processor,
which is a messy affair anyway.
home |