http://www.scl.ameslab.gov/Projects/Rabbit/menu12.c
 
* You can see that "imull -1012(%ebp),%eax" uses 1 data memory reference,
 *   3 cycles, 1 floating-point multiply operation, 2 micro-ops (load and
 *   fp multiply), and will stall the processor for 3 cycles if the result
 *   is needed in the next instruction.
 
* You can see that "fmull -1040(%ebp)" uses 1 data memory reference,
 *   4 cycles, 1 floating-point multiply operation, 2 micro-ops (load and
 *   fp multiply), and will stall the processor for 2 cycles if the result
 *   is needed in the next instruction.
http://lua-users.org/wiki/FloatingPoint

Big CPUs

Regarding performance, most serious modern desktop CPUs these days can process double floating point as fast as or faster than integer. E.g. modern MIPS R5000, modern PPC 700, and better. Common FPU operations are one clock throughput. Add subtract compare multiply. (Better, a multiply-add instruction may well achieve one clock throughput in FPU only, making it faster than integer multiply-add in the ALU. Multiscalar architectures and SIMD can improve floating point even further. Probably not relevant to lua though). Often floating point multiply is faster than integer multiply (because floating point multiply is used more often so the CPU designers spend more effort optimising that path). Floating point divide may well be slow (often more than 10 cycles) , but then so is integer divide.

Admittedly Intel's Pentium design comes a poor third (because it has too few FP registers). But Intel is catching up.

The only remaining performance concern is floating point to integer conversion. Like it or not, memory load instructions operate with an address and byte offset (i.e., two integer values). Therefore, any performance savings of using floating point instead of integers is for naught if the CPU's float-to-int conversion performance is poor. Some claims state that float to int conversion can have a devastating effect on performance (http://mega-nerd.com/FPcast/).

http://people.csail.mit.edu/devadas/6.004/Lectures/lect17/tsld019.htm

Latencies for Typical Modern Processor

·         Central Processing Unit (CPU):

·         On-chip Floating Point Unit (FPU):

Museum Waalsdorp: Computer history

Initially, there was no integer multiply instruction. Integer multiply was added to the instruction set pretty early in the game, though, when CDC engineers ...
museumwaalsdorp.nl/computer/en/6400hwac.html - 22k - Cached - Similar pages

[Comp-arch] Kids can do math better than x86 cpu's.

SPARC v9 doesn't have 64x64->128 integer multiply. I don't think PA-RISC 2.0 has 64x64->128 multiply (I'm not even sure they have integer multiply at all, ...
https://www.gelato.unsw.edu.au/archives/comp-arch/2007-March/007721.html - 6k - Cached - Similar pages

SPARC Options - Using the GNU Compiler Collection (GCC)

The only difference from v7 code is that the compiler emits the integer multiply and integer divide instructions which exist in SPARC v8 but not in SPARC v7 ...
www-h.eng.cam.ac.uk/help/tpl/languages/C++/gcc/SPARC-Options.html - 17k - Cached - Similar pages

http://developer.amd.com/articles.jsp?id=73&num=2

64-bit Integer Math
The AMD64 General Purpose Registers (GPRs) are 64-bits wide as shown above in Figure 2. These registers support 64-bit integer math operations such as additions and multiplications. The native width for a pointer in AMD64 is 64-bits which require 64- bit arithmetic for pointer calculations. Many applications will use 32-bit integers as data, both for ease of migration to 64-bits and also to minimize the code size of integer literals. Two applications that can take advantage of native 64-bit arithmetic are:

  1. Encryption algorithms such as SSL require 64-bit multiplies. Performing a single 64- bit integer multiply using 32-bit arithmetic requires several multiplies and additions to compute the final result. Doing the same 64-bit integer multiply using AMD64 takes a single instruction.
  2. Algorithms that perform integer bit packing such as Huffman encoding, which is used in video compression, can be performed more efficiently using 64-bit registers. More data can be packed into a single register than when