浮点数介绍
浮点数(Floating Point)是计算机中用二进制表示实数的一种近似表示方式。就是用科学记数法来表示一个实数,分为3个部分,第一个部分是符号位,第二个部分是指数,第三个部分是尾数。
IEEE 754是IEEE的计算机表示浮点数的标准。它包含下面几种:
Type |
Sign |
Exponent |
Significand field |
Total bits |
Exponent bias |
Bits precision |
Number of decimal digits |
Half (IEEE 754-2008) |
1 |
5 |
10 |
16 |
15 |
11 |
~3.3 |
Single |
1 |
8 |
23 |
32 |
127 |
24 |
~7.2 |
Double |
1 |
11 |
52 |
64 |
1023 |
53 |
~15.9 |
x86 extended precision |
1 |
15 |
64 |
80 |
16383 |
64 |
~19.2 |
Quad |
1 |
15 |
112 |
128 |
16383 |
113 |
~34.0 |
最常见的就是单精度(Single)和双精度(Double),就是C++和C#中的float和double。
先举个用float来表示实数的例子,比如我们要表示100
,换算成2进制就是1100100
,那么就是1.1001*10^6
。所以符号位是0,指数是6加上Exponent bias(127)是133
,换算成2进制是10000101
,尾数是1001
。那么拼一起就是0 10000101 10010000000000000000000
。
浮点数中的特殊值
接着我们看看根据定义,float中那些特殊值都是啥。看如下的C++程序,把2进制的表示转换成float数值并且打印出来。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| int binaryPresentation; float floatPresentation;
binaryPresentation = 0x42C80000; floatPresentation = *(float *)&binaryPresentation; printf("100: %g\n", floatPresentation);
binaryPresentation = 0x00000000; floatPresentation = *(float *)&binaryPresentation; printf("Positive ZERO: %g\n", floatPresentation);
binaryPresentation = 0x80000000; floatPresentation = *(float *)&binaryPresentation; printf("Negative ZERO: %g\n", floatPresentation);
binaryPresentation = 0x00000001; floatPresentation = *(float *)&binaryPresentation; printf("Positive Min Denorm Float: %g\n", floatPresentation);
binaryPresentation = 0x007fffff; floatPresentation = *(float *)&binaryPresentation; printf("Positive Max Denorm Float: %g\n", floatPresentation);
binaryPresentation = 0x00800001; floatPresentation = *(float *)&binaryPresentation; printf("Positive Min Normalize float: %g\n", floatPresentation);
binaryPresentation = 0x7f7fffff; floatPresentation = *(float *)&binaryPresentation; printf("Positive Max Normalize float: %g\n", floatPresentation);
binaryPresentation = 0x7F800000; floatPresentation = *(float *)&binaryPresentation; printf("Positive infinite: %g\n", floatPresentation);
binaryPresentation = 0xFF800000; floatPresentation = *(float *)&binaryPresentation; printf("Negative infinite: %g\n", floatPresentation);
|
输出如下:
1 2 3 4 5 6 7 8 9
| 100: 100 Positive ZERO: 0 Negative ZERO: -0 Positive Min Denorm Float: 1.4013e-045 Positive Max Denorm Float: 1.17549e-038 Positive Min Normalize float: 1.17549e-038 Positive Max Normalize float: 3.40282e+038 Positive infinite: 1.#INF Negative infinite: -1.#INF
|
C++中的浮点数特殊值
接着看看C++中的浮点数特殊值是怎么定义的:
1 2 3 4 5 6 7 8 9
| printf("denorm_min: %g\n", std::numeric_limits<float>::denorm_min()); printf("epsilon: %g\n", std::numeric_limits<float>::epsilon()); printf("infinity: %g\n", std::numeric_limits<float>::infinity()); printf("lowest: %g\n", std::numeric_limits<float>::lowest()); printf("max: %g\n", std::numeric_limits<float>::max()); printf("min: %g\n", std::numeric_limits<float>::min()); printf("quiet_NaN: %g\n", std::numeric_limits<float>::quiet_NaN()); printf("signaling_NaN: %g\n", std::numeric_limits<float>::signaling_NaN()); printf("round_error: %g\n", std::numeric_limits<float>::round_error());
|
输出如下:
1 2 3 4 5 6 7 8 9
| denorm_min: 1.4013e-045 epsilon: 1.19209e-007 infinity: 1.#INF lowest: -3.40282e+038 max: 3.40282e+038 min: 1.17549e-038 quiet_NaN: 1.#QNAN signaling_NaN: 1.#QNAN round_error: 0.5
|
CSharp中的浮点数特殊值
再来看看C#中的浮点数特殊值是怎么定义的:
1 2 3 4 5 6
| Console.WriteLine("Epsilon:" + float.Epsilon); Console.WriteLine("MaxValue:" + float.MaxValue); Console.WriteLine("MinValue:" + float.MinValue); Console.WriteLine("NaN:" + float.NaN); Console.WriteLine("PositiveInfinity:" + float.PositiveInfinity); Console.WriteLine("NegativeInfinity:" + float.NegativeInfinity);
|
输出如下:
1 2 3 4 5 6
| Epsilon:1.401298E-45 MaxValue:3.402823E+38 MinValue:-3.402823E+38 NaN:NaN PositiveInfinity:Infinity NegativeInfinity:-Infinity
|
可以看出来C#和C++中关于Epsilon和MinValue的定义是不一样的。