Introduction | Two-bit Binary | Four-bit Hex | Byte-size Hex | Words | Beyond Bytes | Character Codes | Wide Characters and Unicode | Memory Addresses | Segment Wrap | A Bit More | r_harvey home page | no frames

Mascotr_harvey


Binary 101

Here's a little background on memory, the hexadecimal numbering system, and binary--the root of it all.

Computers are by nature (and design) binary beasts, with chips containing millions of circuits, and each circuit representing one of two states: on or off. Here's a table showing the entire binary numbering system and the complete range:

01

No, you didn't miss anything, that's it! Binary digits, like light switches and the flag on your mail box, can be either on or off. The word bit is a contraction of binary digit.

For more complex signals to the letter carrier, we could add flags to the mail box: one for each class of message (now, what did that chartreuse flag mean?). The postal service usually only recognizes one flag, so I guess we can hang a note on the box.

Individual bits are about as useful in day-to-day life as subatomic particles. Tying notes on bits won’t help; the computer can't read anyway. We need to lump bits together to make useful quantities--and extend the range. Each additional bit doubles the range binary numbers can represent. The least significant digit is on the right--like decimal numbers. In the computer world, two-bits is not a quarter dollar.

Two-bit Binary Numbers
Bit 1Bit 0Result
00= 0. Both bits are false.
01= 1.
10= 2.
11= 3. Both bits are true

Now we have a range from 0 through 3, still using only two binary digits. The only problem with this scheme is that even if the number we want is zero, we still need both bits. This is always true--memory is always reserved for the greatest possible magnitude allowed--sometimes we type the number "1" even though it's really "01".

Two bits still aren't enough for serious work; let's double the size again.

Four-Bit Hexadecimal Numbers
23222120
8421

As we add values of set bits from each position, we find that 0101 binary is equivalent to 5 (22+20) decimal. When we get past 0111, things get out of hand: you can't represent these values as a single decimal digit.

We've progressed (out of need, not necessarily choice) to the base 16 numbering system, called Hexadecimal, where numbers don't stop at 9, but continue on through F, which is just big enough to represent four binary digits. Talking about numbers in binary is as fluid as talking about them in Roman numerals. Which is why we use hexadecimal: it's a compact way to express binary values--the numbers are always binary, regardless of the number base we use to represent them.

Binary to Hexadecimal Conversion
HexBinary
00000
10001
20010
30011
40100
50101
60110
70111
81000
91001
A1010
B1011
C1100
D1101
E1110
F1111

Hexadecimal is usually abbreviated as Hex. The fact that hex means six must have gone unnoticed in computer circles. A computer software wizard would be surprised to find that hex bolts don't have sixteen sides (not to worry, it's a hardware problem, anyway).

Byte-Size Hexadecimal Numbers

We combine hex digits to create larger binary numbers. A four-bit quantity--a single hex digit--is called a Nibble, or occasionally Nybble (abbreviated as Nib or Nyb).

Bytes are tidy units of eight bits, the smallest useful unit for storage, representing values from 0 through 255.

Binary Byte
76543210
00000000

The letter "A" takes one byte of memory. In memory, the letter "A" could be interpreted as the decimal number 65, the hexadecimal number 41, the binary number 01000001, or even part of a larger value. It all depends on how you look at it.

The most common binary units are Nibble, Byte, Word, and Double Word:

Common Data Sizes
NameBasicSize
Nibble-4-bits
ByteCharacter8-bits
WordWord16-bits
Double WordLong Integer32-bits

These data sizes usually reflect the size of microprocessor data registers. Internally, the 8086 works easiest with 8-bit and 16-bit quantities--the 8-bit byte and the 16-bit word--because the 8086 is a 16-bit microprocessor.

While 16-bits is the most common word size--it's the word size used by Intel 8086 family microprocessors--it is by no means universal. Hewlett-Packard, for instance, uses a 20-bit word size in their HP-48 handheld computers, while later computers often use 32-bit words. The word size often reflects microprocessor memory address calculation units.

The Intel-compatible 80386 and later microprocessors' registers can hold 32-bit long integers. Because of this, long integer calculations on the 386 can be much faster if you use the compiler's 80386 switch (for Microsoft compilers, the switch is /G3); otherwise, even the mighty Pentium family, Cyrix M1/6x86 and AMD K6/Athlon act like very (very) fast 8086s.

The proper name for the first Intel 32-bit processor is "80386" (that's what it says on top of the chip). Most people use its first name, "386." The same is true for 80286 (286) and 80486 (486). We often call Pentiums by their nickname of "586," although that makes them angry.

Moving from right to left, each position in the hex number doubles the magnitude of the number. This table begins with 1 in the least significant bit position. As we shift the bit to the left, the value doubles:

Shifting Binary Digits
76543210Result
00000001=1
00000010=2
00000100=4
00001000=8
00010000=16
00100000=32
01000000=64
10000000=128

Look at the table a little sideways, and you'll see that you can add bit values. If bit 0 and bit 1 are set (00000011), the value is 1+2=3. If bit 2 and bit 4 are set (00010100), the value is 4+16=20. And if all the bits are set (11111111), the sum is 1+2+4+8+16+32+64+128=255.

Our Mascot

Words

It's not until we get to 16-bits that binary numbers are big enough to use as, er, numbers. Numbers of bytes in a file, memory addresses, distances, bank check numbers, and so forth begin to fit in words, with unsigned values up to 65535 decimal.

Binary Word
1514131211109876543210
00000000 00000000

16-bit words--called Integers or Short Integers--often represent signed values. The leftmost (most significant) bit is the sign bit (true for negative, false for positive), leaving 15-bits for the mantissa. Thus, signed words can have values of -32768 (8000h) through 32767 (7FFFh). Single bytes can also be signed, but they rarely are.

You might think the smallest number would be -32767, but high-level languages use two's complement negation, where we NOT the number, then increment by one, so FFFFh in hex is -1 decimal.

Hexadecimal numbers frequently represent quantities like file sizes and number of bytes left on a disk. You cannot have negative file sizes, so there is usually no sign bit reserved. To avoid confusion between unsigned and signed numbers, high-level programming languages frequently use a larger variable than is necessary to hold file size information.

These powers of two tables compare 8- and 16-bit data:

8-Bit Bytes
2726252423222120
1286432168421

16-bit Words add 8-bits; the low 8-bits are identical to Bytes.

16-Bit Words
2152142132122112102928
32768163848192409620481024512256

For even larger values, we can combine bits into 32-, 64- and even 80-bit units. We use strings of ASCII codes in 8-bit units to represent text. Lately, however, as we'll soon see, 8-bits are not enough.

Beyond Bytes

Bytes and kilobytes are the ounces and pounds of the computer world—consistent, but not very intuitive. Computer companies sell memory chips by the Kilobyte (abbreviated KB or just K), not the pound, so the kilobyte is the standard bartering unit. Instead of making the storage unit larger, we just use more of them.

Kilobytes are expressions of powers of two, so instead of 1000, a K is 1024 (that's 2^10 bytes). The next size up is Megabyte (MB). This is not the villain from an old shark movie, but a K's-worth of Ks, or 1024*1024 bytes.

One common misconception is that a kilobyte is 1,000 bytes, and 512 kilobytes is 512,000 bytes. Since 1K is 1024 bytes, 512K is really 512*1024 bytes, or 524,288 bytes. It figures that a megabyte isn't one million bytes, but 1024*1024 bytes, or 1,048,576 bytes.

Kilobytes and Megabytes
NameFormulaBytes
Kilobyte1*10241,024
Megabyte1*1024*10241,048,576

Then we have Gigabytes and Terabytes, but we'll leave those units for another story.

Our Mascot

Character Codes

The ASCII (American Standard Code for Information Interchange) coding system uses an 8-bit number to represent letters, numbers, punctuation and special symbols. Each character in this sentence, including spaces, requires one byte for storage on the disk, and one byte will have to be allocated as each character is loaded into memory.

The ASCII standard, developed by the American National Standards Institute (ANSI), assures that the character represented by ASCII 65 is "A" regardless of the machine. Using ASCII codes for communications, weather map computers on television, automatic teller machines, and even otherwise combative computer systems, could have reasonable conversations. Although it's hard to imagine what they'd talk about.

The ASCII standard applies only to codes 0 through 127, using only the lower seven bits; the eighth bit is reserved for parity checking. That parity bit is a checksum, tagged-onto data to assure that when the byte gets to its destination, it is still intact. In the last twenty-five or so years, data transmission has become much more reliable, so the parity bit is seldom needed anymore. This left it open--in the '70s--for engineers at each computer company to decide how to use the extra bit.

ASCII codes above 127, called high-order characters, are today defined by a few more standards. By far the most common is the character set used by most PC-compatibles, known as Codepage 437. ANSI has stepped-in and assigned another, and character sets used by European countries lie somewhere in between—the most common European character set is Codepage 850, which is similar to Codepage 437, but with accented letters replacing some box-drawing symbols.

Wide Characters And Unicode

Computers are traditionally English-speaking devices. Internationalization means that 8-bit ASCII character codes are not enough (there are lots of Kanji characters), so Microsoft has introduced a 16-bit wide–character standard, called Unicode, for late versions of Microsoft Windows, which takes the original ASCII standard and adds a high-byte; for low-order characters, the high-byte is zero.

There is little speed penalty using 16-bit characters because Intel microprocessors handle 16-bit data nearly as fast as 8-bit codes. There is a size penalty--text files are twice as large, and programs are noticeably larger.

With 65535 unique characters possible, we may have finally reached a standard that will remain standard for a generation. Don't be surprised, though, if someone introduces 32-bit characters, which overcome the limitations of Unicode.

Our Mascot

Memory Addresses

Members of Intel's 8086 family, from the humble 8088, to the latest super-pipelined, fire-breathing generations, are Segmented Little Endian microprocessors. This means the computer stores multiple-byte values with the least significant bytes (the little ends) lower in memory, and it calculates real-mode memory addresses with segments and offsets.

The Apple Macintosh family uses Motorola microprocessors and PowerPC processors, which are flat big endians. This may explain some of the communications problems.

Address calculations are often difficult to grasp, even for experienced programmers. The complexity of using 8086 segments is probably the reason most commercial applications are written in high-level languages like Basic and C++, instead of Assembly language. High-level languages automatically take care of segmentation hassles. When an otherwise civilized bit of code suddenly goes bonkers, it's usually the result of incorrect memory pointers. Let's have a go at sorting out pointers, anyway.

We express addresses in hexadecimal as segment:offset, in the format 0000:0000. This address represents segment zero, at the first address (offset zero). Looking at these values in memory, the low byte of the offset will physically come first, followed by the high byte of the offset. Next come the low then high bytes of the segment.

Address offsets increase first, so 0000:0001 immediately follows 0000:0000. Note that adding and subtracting integers works the same whether values are signed or unsigned--as long as both integers are unsigned.

Segments and offsets are unsigned words that can have values of 0000 through FFFF, so each segment can be up to 64K bytes (FFFF in hex equals 65535 decimal). The computer calculates addresses as twenty-bit (five hex digits) values, so the maximum address space is 16*64K, or 1024K (that's 1,048,576 bytes).

Shift the 16-bit segment address left by four bits (one hex digit); this is the same as multiplying by 16 decimal (10 hex), since every left-shift multiplies by two and each right-shift divides by two. The lower four bits of the twenty bit result are zeroes.

 123416-bit segment

Shift the segment register left 4-bits (multiply by 16):

1234020-bit segment

Add the 16-bit offset:

1234020-bit segment
 0567Add this 16-bit offset
128A720-bit physical address

Any combination of segment and offset can be combined, though since segments are multiplied by 16, they always begin on 16-byte boundaries (the lower nibble is always zero), called pages. Memory is often rounded-down or up to page boundaries by BIOS and operating system functions.

Let's look at a real memory location. The current keyboard shift state is stored at address 0040:0017; here's how to calculate the 20-bit address:

 0040Start with a 16-bit segment
00400Shift into 20-bit segment (multiply by 16)
 0017Add the 16-bit offset
00417The result is a 20-bit physical address

The calculated 20-bit physical address of 00417 will always point to the same location. We can represent this address in any way that results in 00417; for instance 0000:417 or even 0041:0007.

Our Mascot

Segment Wrap

Segment registers can hold values up to FFFF. Matched with offsets just as large, we exceed the one megabyte address space, so the 8086 has to be able to deal with these peculiar numbers. It does this by a method known as segment wrap. When address calculations grow too large, the pointers wrap around to zero again, within the same segment. This is an interesting feature, but don't depend on it.

Intel processors that address more than one megabyte (like the Intel 80286) can disable segment wrap to address the next 64K beyond one megabyte limit--using 21-bit calculations and 21 address lines! Starting with MS-DOS version 5.0, much of the DOS kernel is placed in the first 64K above the one megabyte block, freeing additional base-memory for applications and data.

 Note:  One method to tell if segment wrap is disabled is to compare low memory with memory at segment FFFF. If both are the same, the segment is wrapping; if they are different, the processor is using 21-bit addresses (known as "enabling the A20 line," which works in real mode on 80286 or later processors). Better yet, 80386 and newer processors can use 32-bit addresses as one huge linear segment, which you address with a simple 32-bit offset. Ah, life in flatland, no more segments to worry about.

A Bit More

Whether stored on a disk, in memory, or in processor registers, bytes are the same. We use bytes to represent binary, hexadecimal, decimal or ASCII codes--or groups of bytes strung-together to form paragraphs of text and long integers--the values are the same to the computer.

This information is borrowed from my Assembly language book. An early version of this book is currently being sold by a software company as a programming library owner's manual.

Note: Teachers find this document on the Internet as often as students do.

Top | Notice | Home | | © Copyright 2007 R. E. Harvey, All rights reserved.