If I have a variable x
bound to a machine precision real in Mathematica, I can use
y = FromDigits[RealDigits[x]]
then y is bound to a completely equivalent rational number which has infinite precision.
What are the basic principles behind FromDigits
and RealDigits
? How could I implement a similar conversion using common machine operations (fixed-size integers and floating point operations) to coerce a double in a rational, or at least obtain the long integer denominator and numerator of a double?
Asked By : LCFactorization
Answered By : Wandering Logic
On most current machines a double-precision real number will be represented in IEEE 754's binary64
format.
63 62 x x x x x x 52 51 x x x x x x x x x x x x x x x x x x x x x 0 | s | exp | mantissa |
s
=0 implies positive, s
=1 implies negative. exp
is an 11 bit biased signed integer in the range [-1023, 1024] (take the unsigned number from the bits and subtract 1023). mantissa
stores the bottom 52 bits of the fraction.
Numbers with exponents in the range [-1022, 1023] are normalized (bit 52 of the mantissa is implicitly "1".) The number represented by a particular bit pattern is $-1^s (1 + (\mathrm{mantissa} / 2^{52})) 2^\mathrm{exp}$.
Numbers with exp=1024 are special: if the mantissa is 0 then they represent + or - Infinity (from overflow or divide by 0, for example). If the mantissa is non-zero then the represent NaN (for example sqrt(-1)).
Numbers with exp=-1023 are special: if the mantissa is 0 then the represent + or - 0. If the mantissa is non-zero then the represent denormalized numbers: $-1^s (\mathrm{mantissa}/2^{1075})$.
In C++ on most current machines the following code will probably do most of what you want (I adapted this from some binary32 code without testing, so test first.) We will use uint64_t
from the cstdint
header for the bit-level representation, and a union for extracting the bits. The following code does not deal with zero, denormals, NAN or positive or negative Infinity (HUGE_VAL).
#include <cstdint> // for uint64_t typedef union { double value; uint64_t bits; } ieee754_binary64_union; #define IEEE754_MANTISSA_BITS 52 #define IEEE754_EXPONENT_BITS 11 #define IEEE754_HIDDEN_BIT (((uint64_t)1) << IEEE754_MANTISSA_BITS) #define IEEE754_MANTISSA_MASK (IEEE754_HIDDEN_BIT - ((uint64_t)1)) #define IEEE754_EXPONENT_MASK ((1 << IEEE754_EXPONENT_BITS) - 1) #define IEEE754_EXPONENT_BIAS ((1 << (IEEE754_EXPONENT_BITS - 1)) - 1) static inline double ieee754_bits2float(uint64_t u) { ieee754_binary64_union fiu; fiu.bits = u; return fiu.value; } static inline uint64_t ieee754_float2bits(double f) { ieee754_binary64_union fiu; fiu.value = f; return fiu.bits; } static inline int ieee754_get_sign(double f) { return ieee754_float2bits(f) >> 63; } static inline int ieee754_get_exponent(double f) { uint64_t bits = ieee754_float2bits(f); return (((bits >> IEEE754_MANTISSA_BITS) & IEEE754_EXPONENT_MASK) - IEEE754_EXPONENT_BIAS); } static inline uint64_t ieee754_get_mantissa(double f) { uint64_t bits = ieee754_float2bits(f); return ((bits & IEEE754_MANTISSA_MASK) + IEEE754_HIDDEN_BIT); } static inline double ieee754_make_double(int sign, // 1 (neg) or 0 (pos) int exponent, // range -1023 -> 1024 uint64_t mantissa) // with or without hidden bit { uint64_t collected_bits = ((uint64_t)sign << (IEEE754_MANTISSA_BITS + IEEE754_EXPONENT_BITS)) | ((uint64_t)((exponent + IEEE754_EXPONENT_BIAS) & IEEE754_EXPONENT_MASK) << IEEE754_MANTISSA_BITS) | (mantissa & IEEE754_MANTISSA_MASK); return ieee754_bits2float(collected_bits); }
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/18365
0 comments:
Post a Comment
Let us know your responses and feedback