Module 0267: Base-10 Scientiﬁc Notation to Double Precision Floating Number

Tak Auyeung, Ph.D.

March 1, 2017

1 About this module

2 Parsing

In this syntax description, the square brackets denote optional components, and a vertical bar separates alternatives. <d> is a single base-10 digit.
+ means an actual plus symbol, as opposed to a single + that means at least one of the item right before it. A single asterisk * means any number of the item right before it.

For convenience, it is best to parse a base-10 scientiﬁc notation by its components. Tracking the “cursor” and results can be tedious. A structure like the following can be used so that a single pointer to such a structure is needed for all the subroutines.

struct Base10Parse
{
const char ∗ptr; // tracks the next char to parse
int s; // remembers the sign of the number, 1 or

-

1
uint64_t m; // the mantissa but without the decimal point
int32_t e; // the exponent in base 10
};

Using a structure like this, the parser can be broken up into smaller components, each one using a function. The value of the base-10 scientiﬁc notation is as follows:

The parser must pay attention to when the decimal point is encountered when parsing the mantissa so that e is adjusted accordingly. Let us consider some examples.

3 Conversion

Assuming a scientiﬁc notation is parsed using the method outlined in the previous section, we can convert the base-10 oriented representation to a base-2 oriented representation in steps.

Upper case letters E is used to represent the base-2 exponent, M represents the base-2 exponent. The sign is the same regardless of the base.

The idea is to get the mantissa right, so that

1 \leq M < 2

. We can adjust E to make this happen.

3.1 When

e < 0

Integer division by 10 is not a solution because it loses precision in the process unless the remainder is 0 in the division.

If the division results in a non-zero remainder, we can multiply m by 2 (and adjust that with E) and try again. Ultimately, we are looking for a value

p

so that

However, there are many natural number values for which such a value of

p

does not exist. For example, if

m = 3

, then there is no natural number

p

that can satisfy the equality

(m \cdot 2^{p}) mod 10 = 0

The error of the conversion is the exact amount of

δ = (m \cdot 2^{p}) mod 10

p

can be chosen as a large value so that this error is smaller. To adjust for bias error, if

δ \geq 5

, increment the quotient. This makes sure that the conversion does not accumulate errors that keep making the result smaller than it should be.

In practice,

p

cannot exceed a certain number because the product

m \cdot 2^{p}

must be represented in an integer with a ﬁxed width.

To summarize, to handle the case when

e < 0

, division by 10 is not avoidable. However, for the

i

th divsion by 10, we can ﬁnd a maximum

p_{i} > 0

such that

d_{i} = m_{i - 1} \cdot 2^{p_{i}} < 2^{w}

where

w

is the width of unsigned integer (for example, 64).

In this notation

m_{0} = m

, this is the base-10 mantissa that we start oﬀ with.

Then, if

d_{i} mod 10 \geq 5

c_{i} = 1

, otherwise

c_{i} = 0

. The mantissa being converted to binary is then

m_{i} = d_{i} ∕ 10 + c_{i}

These steps repeat

n = - e

times.

m_{n}

is almost

M

. Let us denote

P = \sum_{i = 1}^{n} p_{i}

, this is the total of exponents of 2 introduced to handle all the divisions by 10.

3.2 When

e \geq 0

This may seem like an easy case, but in general, it is not! This is because

e

can be a large value so that

m \cdot 1 0^{e}

cannot be represented in a ﬁxed width integer.

In the event that

m \cdot 1 0^{e}

is too large, then we need to adjust the mantissa by tracking a compensating exponent of 2.

The algorithm is as follows. Denote

m_{0} = m

. For iteration

i

, ﬁnd the largest

p_{i} \leq 0

such that

x = m_{i - 1} \cdot 2^{p_{i}} \cdot 10 < 2^{w}

m_{i} = x + c

. Note that

p_{i}

can be zero.

c

is a compensation term that is 0 or 1 depending on whether

(m_{i - 1} \cdot 10) mod 2^{- p_{i}} < 2^{- p_{i} - 1}

. If so,

c = 0

, otherwise,

c = 1

. In practice, no division is really needed because division by 2 can be done by bit shifting.

c

becomes the value of the last bit shifted out from the right hand side.

There are

n = e

iterations. At the end of all iterations, compute

P = \sum_{i = 1}^{n} p_{i}

as the total power of 2 introduced to handle all the multiplications by 10.

3.3 With

m_{n}

and

P

computed

At this point, we have the mantissa converted to binary, but it can be large number and not between 1 and 2.

P

is the exponent of 2 introduced to balance the division or multiplication by 10. Because these apply to the mantissa, the corresponding correction of the exponent of 2 is

e_{2} = - P

Each bit shift is compensated by adjusting

e_{2}

. A right shift is the same as division by 2, so

e_{2}

needs to be incremented. A left shift is the same as multiplication by 2, so

e_{2}

needs to be decremented. The mantissa of a double precision ﬂoating point number has 52 bits to the right of the binary point. We need to shift

m_{n}

so that it becomes a binary number so that the most signiﬁcant bit of 1 is at bit 52 (not 51!).

Once

M

is a 53-bit unsigned integer, we are almost done. Bit 0 to 51 of the double representation is the last signiﬁcant 52 bits of

M

. The actual exponent of the double representation needs to express the value

e_{2} + 52

because the implied binary point is 52 bits from the least signiﬁcant bit of

M

Note that the exponent of a double representation is

e - 1023

. This means

e_{2} + 52 = e - 1023

, or

e = 1023 + e_{2} + 52

e

is a 11-bit number taking up bit 52 to bit 62 of a double representation.

The sign bit of a double is bit 64. If

s = 1

, then the sign bit has a 0, if

s = - 1

, then the sign bit has a 1.