Simplify. If you have a mission-critical application that takes hours to run, you may be looking at it as a high-level project -- too high. Here are some suggestions.
For best floating point performance, get the fastest processor and the fastest memory you can afford.
Clock-for-clock, the Pentium-III or Pentium-M was faster than 2005-vintage Pentium-4 processors, especially for floating point calculations. However, the P-III tops-out at 1.4GHz, while the The P-4 offers faster clock and memory speeds. You can argue that a 1.4GHz P-III/M is faster than a 1.7GHz P-4; however, the P-4 is available at above 3GHz.
The AMD Athlon offers perhaps the best floating point performance, but at slower memory performance than Pentium-4 processors.
If it still takes too long, try multiple processors and an operating system system that supports it. Or try dividing the tasks among multiple computers.
Install only as much memory as you need. Use memory instead of a hard drive if you have lots of data, but only if the data will be read more than once.
Run plain-DOS (32-bit DOS, if possible), not Windows. Or another text-based operating system. Floating point performance can be twice as high without extra drivers or operating system overhead. You can use 32-bit CPU instructions in a 16-bit operating system, but pointers must still be 16-bits.
Use a C or Fortran compiler, not a spreadsheet or any macro language.
If you must use Basic, use an old DOS-based Basic compiler, not Visual Basic or VBA.
Don't limit functionality and accuracy by using single-precision variables instead of double-precision. Calculations using either data type offer about the same performance. The only performance gain from single-precision is from smaller memory footprint and perhaps less memory I/O. The AMD Athlon uniquely offers more than twice the performance when performing single-precision calculations.
Use all data and code alignment switches available to your programming tools.
Unaligned data is a terrible performance hit. Dynamic arrays are usually paragraph (16-bytes) aligned, but scalar variables are often not. Group the declarations for floating point scalar variables together, then run the compiled program. Now declare a short integer before the floating point variables. If performance improves, leave the integer there; if it goes down, remove the integer. Use the same method with local (stack) variables declared in subroutines.
Read and write data files in big chunks, not one array-element at a time.
Read and write arrays in ascending order. Today's processors are much faster than memory, and caching hides much of the difference by reading-ahead; if, for instance, the program reads an 8-byte value, the processor will also fetch the next 8-bytes, taking the chance that it will be the next data the program will need. If the program instead requires the 8-bytes ahead of the previous read, the cached data is discarded, and the processor has to fetch the new data (as well as at least 8 more bytes that will not be used). Try summing the values in an array in both directions, and you will likely see significant differences.
Avoid multi-dimensional arrays. There is extra overhead created by the compiler to find its place in the array. Use multiple, single-subscript arrays.
If you are using a 16-bit operating system, avoid huge arrays (larger than 64KB). Index fix-ups and segment overrides add a tremendous amount of overhead, even though the source code may look perfectly innocent.
Don't use Currency or Variant data types. In addition to being slow, they are not supported by newer programming languages (not even Microsoft Visual Basic).
Use a high-quality floating point library.
If you must maintain the data in a spreadsheet, write a custom DLL to perform the tough tasks.
Isolate critical formulas. Have your compiler do an Assembly language listing. If the code looks sloppy, and it likely will, consider rewriting the formula in Assembly language (or farm it out to a third-party).
Display progress indicators during program development. When you have the process well defined and you know how long it will take, get rid of extraneous I/O; just show "Calculating... will be done at about 14:30".
If you need to display graphical results, save the results in a file, then run Windows.
Note: Don't depend on floating point and integer benchmark tests. The integer tests are generally larger, and they exercise memory performance. Floating point tests are relatively small, so they usually fit within the processor cache. If, for instance, you compare a processor running at 400MHz on a 66MHz bus to the same microprocessor running 400MHz on a 100MHz bus, the integer performance will appear to improve with the faster bus, but the floating point test will be little different. For the best guess of overall performance, especially if you will be using large floating point arrays, look at integer performance as an indication of memory I/O speed.
I/O Bound
Many small programs read lots of data; they are as dependent on file reading and writing (I/O) speed as processor performance. I've rewritten several 16-bit DOS Basic utilities as 32-bit command line programs in C, and found significant performance gains, even though I did nothing special in the C programs, while their Basic predecessors went to extremes to improve performance.
One simple Basic program reads a file of any size, and writes an Assembly language source code file containing DB statements for every byte. The Basic program was terribly slow, so I replaced the file I/O functions with a custom Assembly language library; performance improved about 270%.
A new test, running on a fast hard drive and a slow CPU, converts a 5.5MB DLL in 37 seconds using the optimized 16-bit Basic program, while the simple 32-bit C version completes the task in under four seconds. A brute-force method (using a look-up table instead of division) saved more than 300-million CPU clock cycles for processing (0.6 seconds on a 500MHz CPU!), but it increased the output file from 19.5MB to about 27MB (because the look-up table had fixed-length fields). Run time increased from 3.7 seconds to over five seconds, because of the time required to write the extra 7.5MB--and of course the next program that uses this data will also take longer to read a larger file. My final solution was multiple, variable-length look-up tables instead of division, which further improved performance by 20-percent, without increasing the file size.
Another important optimization is buffering, if the input and output files are on the same hard drive. Reading 64KB chunks is significantly faster than reading 8KB, while reading more than 64KB may not improve performance. The goal is to minimize disk head movement (seeking), which occurs if you constantly read a small amount of data from one file, and write processed data to another. An optimized programming library and operating system will cache some reads and writes, but a large amount of data will overwhelm the buffers if we do not minimize hard disk seeking.
This table shows averaged results for a 22MB source file. Output files are the same length, avoiding extra spaces, since we have discovered that extra spaces cost time.
Time
Program
424 seconds
Unoptimized 16-bit DOS Basic
157
Optimized 16-bit DOS Basic with Assembly
18
Unoptimized 32-bit C Console application
15
Optimized 32-bit C Console application with tables instead of division
~13.5
Optimized 32-bit C Console with optimized Assembly
Most of the performance improvement is from switching to 32-bit file I/O, not improved code. If your program reads a massive amount of data with simple calculations, consider a 32-bit Console program.