The syntax of a base-10 scientific notation is as follows:
In this syntax description, the square brackets denote optional components, and a vertical bar separates alternatives. <d> is a
single base-10 digit.
+ means an actual plus symbol, as opposed to a single + that means at least one of the item right before it. A single asterisk *
means any number of the item right before it.
For convenience, it is best to parse a base-10 scientific notation by its components. Tracking the “cursor” and results can be tedious. A structure like the following can be used so that a single pointer to such a structure is needed for all the subroutines.
Using a structure like this, the parser can be broken up into smaller components, each one using a function. The value of the base-10 scientific notation is as follows:
The parser must pay attention to when the decimal point is encountered when parsing the mantissa so that e is adjusted accordingly. Let us consider some examples.
The number 1582 is parsed so that s=1, m=1582 and e=0.
The number -4.78e3 is parsed so that s=-1, m=478 and e=1.
The number 2.5e-1 is parsed so that s=1, m=25 and e=-2.
Assuming a scientific notation is parsed using the method outlined in the previous section, we can convert the base-10 oriented representation to a base-2 oriented representation in steps.
Upper case letters E is used to represent the base-2 exponent, M represents the base-2 exponent. The sign is the same regardless of the base.
The idea is to get the mantissa right, so that . We can adjust E to make this happen.
Also, the value being represented cannot change, meaning that
It is given that m is an unsigned integer.
The problem is when .
Integer division by 10 is not a solution because it loses precision in the process unless the remainder is 0 in the division.
If the division results in a non-zero remainder, we can multiply m by 2 (and adjust that with E) and try again. Ultimately, we are looking for a value so that
For example, if , then will do the trick.
However, there are many natural number values for which such a value of does not exist. For example, if , then there is no natural number that can satisfy the equality .
The error of the conversion is the exact amount of . can be chosen as a large value so that this error is smaller. To adjust for bias error, if , increment the quotient. This makes sure that the conversion does not accumulate errors that keep making the result smaller than it should be.
In practice, cannot exceed a certain number because the product must be represented in an integer with a fixed width.
To summarize, to handle the case when , division by 10 is not avoidable. However, for the th divsion by 10, we can find a maximum such that where is the width of unsigned integer (for example, 64).
In this notation , this is the base-10 mantissa that we start off with.
Then, if , , otherwise . The mantissa being converted to binary is then .
These steps repeat times. is almost . Let us denote , this is the total of exponents of 2 introduced to handle all the divisions by 10.
This may seem like an easy case, but in general, it is not! This is because can be a large value so that cannot be represented in a fixed width integer.
In the event that is too large, then we need to adjust the mantissa by tracking a compensating exponent of 2.
The algorithm is as follows. Denote . For iteration , find the largest such that , . Note that can be zero. is a compensation term that is 0 or 1 depending on whether . If so, , otherwise, . In practice, no division is really needed because division by 2 can be done by bit shifting. becomes the value of the last bit shifted out from the right hand side.
There are iterations. At the end of all iterations, compute as the total power of 2 introduced to handle all the multiplications by 10.
At this point, we have the mantissa converted to binary, but it can be large number and not between 1 and 2.
is the exponent of 2 introduced to balance the division or multiplication by 10. Because these apply to the mantissa, the corresponding correction of the exponent of 2 is .
Each bit shift is compensated by adjusting . A right shift is the same as division by 2, so needs to be incremented. A left shift is the same as multiplication by 2, so needs to be decremented. The mantissa of a double precision floating point number has 52 bits to the right of the binary point. We need to shift so that it becomes a binary number so that the most significant bit of 1 is at bit 52 (not 51!).
Once is a 53-bit unsigned integer, we are almost done. Bit 0 to 51 of the double representation is the last significant 52 bits of . The actual exponent of the double representation needs to express the value because the implied binary point is 52 bits from the least significant bit of .
Note that the exponent of a double representation is . This means , or . is a 11-bit number taking up bit 52 to bit 62 of a double representation.
The sign bit of a double is bit 64. If , then the sign bit has a 0, if , then the sign bit has a 1.