Module 0267: Base-10 Scientific Notation to Double Precision Floating Number

Tak Auyeung, Ph.D.

March 1, 2017

1 About this module

2 Parsing

The syntax of a base-10 scientific notation is as follows:

[\+|-]<d>[.[<d>*]][e[\+|-]<d>+]

In this syntax description, the square brackets denote optional components, and a vertical bar separates alternatives. <d> is a single base-10 digit.
+ means an actual plus symbol, as opposed to a single + that means at least one of the item right before it. A single asterisk * means any number of the item right before it.

For convenience, it is best to parse a base-10 scientific notation by its components. Tracking the “cursor” and results can be tedious. A structure like the following can be used so that a single pointer to such a structure is needed for all the subroutines.

struct Base10Parse 
{ 
  const char ptr; // tracks the next char to parse 
  int s; // remembers the sign of the number, 1 or 1 
  uint64_t m; // the mantissa but without the decimal point 
  int32_t e; // the exponent in base 10 
};

Using a structure like this, the parser can be broken up into smaller components, each one using a function. The value of the base-10 scientific notation is as follows:

v = sm 10e

The parser must pay attention to when the decimal point is encountered when parsing the mantissa so that e is adjusted accordingly. Let us consider some examples.

The number 1582 is parsed so that s=1, m=1582 and e=0.

The number -4.78e3 is parsed so that s=-1, m=478 and e=1.

The number 2.5e-1 is parsed so that s=1, m=25 and e=-2.

3 Conversion

Assuming a scientific notation is parsed using the method outlined in the previous section, we can convert the base-10 oriented representation to a base-2 oriented representation in steps.

Upper case letters E is used to represent the base-2 exponent, M represents the base-2 exponent. The sign is the same regardless of the base.

The idea is to get the mantissa right, so that 1 M < 2. We can adjust E to make this happen.

Also, the value being represented cannot change, meaning that

v = sm 10e = sM 2E

It is given that m is an unsigned integer.

3.1 When e < 0

The problem is when e < 0.

Integer division by 10 is not a solution because it loses precision in the process unless the remainder is 0 in the division.

If the division results in a non-zero remainder, we can multiply m by 2 (and adjust that with E) and try again. Ultimately, we are looking for a value p so that

(m 2p)mod10 = 0

For example, if m = 25, then p 1 will do the trick.

However, there are many natural number values for which such a value of p does not exist. For example, if m = 3, then there is no natural number p that can satisfy the equality (m 2p)mod10 = 0.

The error of the conversion is the exact amount of δ = (m 2p)mod10. p can be chosen as a large value so that this error is smaller. To adjust for bias error, if δ 5, increment the quotient. This makes sure that the conversion does not accumulate errors that keep making the result smaller than it should be.

In practice, p cannot exceed a certain number because the product m 2p must be represented in an integer with a fixed width.

To summarize, to handle the case when e < 0, division by 10 is not avoidable. However, for the ith divsion by 10, we can find a maximum pi > 0 such that di = mi1 2pi < 2w where w is the width of unsigned integer (for example, 64).

In this notation m0 = m, this is the base-10 mantissa that we start off with.

Then, if di mod10 5, ci = 1, otherwise ci = 0. The mantissa being converted to binary is then mi = di10 + ci.

These steps repeat n = e times. mn is almost M. Let us denote P = i=1npi, this is the total of exponents of 2 introduced to handle all the divisions by 10.

3.2 When e 0

This may seem like an easy case, but in general, it is not! This is because e can be a large value so that m 10e cannot be represented in a fixed width integer.

In the event that m 10e is too large, then we need to adjust the mantissa by tracking a compensating exponent of 2.

The algorithm is as follows. Denote m0 = m. For iteration i, find the largest pi 0 such that x = mi1 2pi 10 < 2w, mi = x + c. Note that pi can be zero. c is a compensation term that is 0 or 1 depending on whether (mi1 10)mod2pi < 2pi1. If so, c = 0, otherwise, c = 1. In practice, no division is really needed because division by 2 can be done by bit shifting. c becomes the value of the last bit shifted out from the right hand side.

There are n = e iterations. At the end of all iterations, compute P = i=1npi as the total power of 2 introduced to handle all the multiplications by 10.

3.3 With mn and P computed

At this point, we have the mantissa converted to binary, but it can be large number and not between 1 and 2.

P is the exponent of 2 introduced to balance the division or multiplication by 10. Because these apply to the mantissa, the corresponding correction of the exponent of 2 is e2 = P.

Each bit shift is compensated by adjusting e2. A right shift is the same as division by 2, so e2 needs to be incremented. A left shift is the same as multiplication by 2, so e2 needs to be decremented. The mantissa of a double precision floating point number has 52 bits to the right of the binary point. We need to shift mn so that it becomes a binary number so that the most significant bit of 1 is at bit 52 (not 51!).

Once M is a 53-bit unsigned integer, we are almost done. Bit 0 to 51 of the double representation is the last significant 52 bits of M. The actual exponent of the double representation needs to express the value e2 + 52 because the implied binary point is 52 bits from the least significant bit of M.

Note that the exponent of a double representation is e 1023. This means e2 + 52 = e 1023, or e = 1023 + e2 + 52. e is a 11-bit number taking up bit 52 to bit 62 of a double representation.

The sign bit of a double is bit 64. If s = 1, then the sign bit has a 0, if s = 1, then the sign bit has a 1.