Where can I find the world's fastest atof implementation?

Question!

I'm looking for an extremely fast atof() implementation on IA32 optimized for US-en locale, ASCII, and non-scientific notation. The windows multithreaded CRT falls down miserably here as it checks for locale changes on every call to isdigit(). Our current best is derived from the best of perl + tcl's atof implementation, and outperforms msvcrt.dll's atof by an order of magnitude. I want to do better, but am out of ideas. The BCD related x86 instructions seemed promising, but I couldn't get it to outperform the perl/tcl C code. Can any SO'ers dig up a link to the best out there? Non x86 assembly based solutions are also welcome.

Clarifications based upon initial answers:

Inaccuracies of ~2 ulp are fine for this application.
The numbers to be converted will arrive in ascii messages over the network in small batches and our application needs to convert them in the lowest latency possible.



Answers

This implementation I just finished coding runs twice as fast as the built in 'atof' on my desktop. It converts 1024*1024*39 number inputs in 2 seconds, compared 4 seconds with my system's standard gnu 'atof'. (Including the setup time and getting memory and all that).

UPDATE: Sorry I have to revoke my twice as fast claim. It's faster if the thing you're converting is already in a string, but if you're passing it hard coded string literals, it's about the same as atof. However I'm going to leave it here, as possibly with some tweaking of the ragel file and state machine, you may be able to generate faster code for specific purposes.

https://github.com/matiu2/yajp

The interesting files for you are:

https://github.com/matiu2/yajp/blob/master/tests/test_number.cpp

https://github.com/matiu2/yajp/blob/master/number.hpp

Also you may be interested in the state machine that does the conversion:

Number parsing state machine diagram

By : matiu


It seems to me you want to build (by hand) what amounts to a state machine where each state handles the Nth input digit or exponent digits; this state machine would be shaped like a tree (no loops!). The goal is to do integer arithmetic whereever possible, and (obviously) to remember state variables ("leading minus", "decimal point at position 3") in the states implicitly, to avoid assignments, stores, and later fetch/tests of such values. Implement the statemachine with plain old "if" statements on the input characters only (so your tree gets to be a set of nested ifs). Inline accesses to buffer characters; you don't want a function call to "getchar" to slow you down.

Leading zeros can simply be surpressed; you might need a loop here to handle rediculuously long leading zero sequences. The first nonzero digit can be collected without zeroing an accumulator or multiplying by ten. The first 4-9 nonzero digits (for 16 bit or 32 bits integers) can be collected with integer multiplies by constant value ten (turned by most compilers into a few shifts and adds). [Over the top: zero digits don't require any work until a nonzero digit is found and then a multiply 10^N for N sequential zeros is required; you can wire all this in into the state machine.] Digits following the first 4-9 may be collected using 32 or 64 bit multiplies depending on the word size of your machine. Since you don't care about accuracy, you can simply ignore digits after you've collected 32 or 64 bits worth; I'd guess that you can actually stop when you have some fixed number of nonozero digits based on what your application actually does with these numbers. A decimal point found in the digit string simply causes a branch in the state machine tree; that branch knows the implicit location of the point and therefore later how to scale by a power of ten appropriately. With effort, you may be able to combine some state machine subtrees if you don't like the size of this code.

[Over the top: keep the integer and fractional parts as separate (small) integers. This will require an additional floating point operation at the end to combine the integer and fraction parts; probably not worth it].

[Over the top: collect 2 characters for digit pairs into a 16 bit value; lookup the 16 bit value. This avoids a multiply in the registers in trade for a memory access; probably not a win on modern machines].

On encountering "E", collect the exponent as an integer as above; look up accurately-precomputed/scaled powers of ten up in a table of precomputed multiplier (reciprocals if "-" sign present in exponent) and multiply the collected mantissa. (don't ever do a float divide). Since each exponent collection routine is in a different branch (leaf) of the tree, it has to adjust for the apparant or actual location of the decimal point by offsetting the power of ten index.

[Over the top: you can avoid the cost of "ptr++" if you know the characters for the number are stored linearly in a buffer and do not cross the buffer boundary. In the kth state along a tree branch, you can access the the kth character as "*(start+k)". A good compiler can usually hide the "...+k" in an indexed offset in the addressing mode.]

Done right, this scheme does roughly one cheap multiply-add per nonzero digit, one cast-to-float of the mantissa, and one floating multiply to scale the result by exponent and location of decimal point.

I have not implemented the above. I have implemented versions of it with loops; they're pretty fast.



This video can help you solving your question :)
By: admin