-
Notifications
You must be signed in to change notification settings - Fork 0
/
floating-point-examples.txt
425 lines (274 loc) · 10.3 KB
/
floating-point-examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
about FLOATING POINT ARITHMETIC
-------------------------------
arithmetic operations on floating point numbers consist of
addition, subtraction, multiplication and division
the operations are done with algorithms similar to those used
on sign magnitude integers (because of the similarity of
representation) -- example, only add numbers of the same
sign. If the numbers are of opposite sign, must do subtraction.
ADDITION
example on decimal value given in scientific notation:
3.25 x 10 ** 3
+ 2.63 x 10 ** -1
-----------------
first step: align decimal points
second step: add
3.25 x 10 ** 3
+ 0.000263 x 10 ** 3
--------------------
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)
third step: normalize the result (already normalized!)
example on fl pt. value given in binary:
.25 = 0 01111101 00000000000000000000000
100 = 0 10000101 10010000000000000000000
to add these fl. pt. representations,
step 1: align radix points
shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1
shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1
we want to shift the mantissa right, because the bits that
fall off the end should come from the least significant end
of the mantissa
-> choose to shift the .25, since we want to increase it's exponent.
-> shift by 10000101
-01111101
---------
00001000 (8) places.
0 01111101 00000000000000000000000 (original value)
0 01111110 10000000000000000000000 (shifted 1 place)
(note that hidden bit is shifted into msb of mantissa)
0 01111111 01000000000000000000000 (shifted 2 places)
0 10000000 00100000000000000000000 (shifted 3 places)
0 10000001 00010000000000000000000 (shifted 4 places)
0 10000010 00001000000000000000000 (shifted 5 places)
0 10000011 00000100000000000000000 (shifted 6 places)
0 10000100 00000010000000000000000 (shifted 7 places)
0 10000101 00000001000000000000000 (shifted 8 places)
step 2: add (don't forget the hidden bit for the 100)
0 10000101 1.10010000000000000000000 (100)
+ 0 10000101 0.00000001000000000000000 (.25)
---------------------------------------
0 10000101 1.10010001000000000000000
step 3: normalize the result (get the "hidden bit" to be a 1)
it already is for this example.
result is
0 10000101 10010001000000000000000
SUBTRACTION
like addition as far as alignment of radix points
then the algorithm for subtraction of sign mag. numbers takes over.
before subtracting,
compare magnitudes (don't forget the hidden bit!)
change sign bit if order of operands is changed.
don't forget to normalize number afterward.
MULTIPLICATION
example on decimal values given in scientific notation:
3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------
algorithm: multiply mantissas
add exponents
3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------
1.50 x 10 ** 3
example in binary: use a mantissa that is only 4 bits so that
I don't spend all day just doing the multiplication
part.
0 10000100 0100
x 1 00111100 1100
-----------------
mantissa multiplication: 1.0100
(don't forget hidden bit) x 1.1100
------
00000
00000
10100
10100
10100
---------
1000110000
becomes 10.00110000
add exponents: always add true exponents
(otherwise the bias gets added in twice)
biased:
10000100
+ 00111100
----------
10000100 01111111 (switch the order of the subtraction,
- 01111111 - 00111100 so that we can get a negative value)
---------- ----------
00000101 01000011
true exp true exp
is 5. is -67
add true exponents 5 + (-67) is -62.
re-bias exponent: -62 + 127 is 65.
unsigned representation for 65 is 01000001.
put the result back together (and add sign bit).
1 01000001 10.00110000
normalize the result:
(moving the radix point one place to the left increases
the exponent by 1.)
1 01000001 10.00110000
becomes
1 01000010 1.000110000
this is the value stored (not the hidden bit!):
1 01000010 000110000
DIVISION
similar to multiplication.
true division:
do unsigned division on the mantissas (don't forget the hidden bit)
subtract TRUE exponents
The IEEE standard is very specific about how all this is done.
Unfortunately, the hardware to do all this is pretty slow.
Some comparisons of approximate times:
2's complement integer add 1 time unit
fl. pt add 4 time units
fl. pt multiply 6 time units
fl. pt. divide 13 time units
There is a faster way to do division. Its called
division by reciprocal approximation. It takes about the same
time as a fl. pt. multiply. Unfortunately, the results are
not always the same as with true division.
Division by reciprocal approximation:
instead of doing a / b
they do a x 1/b.
figure out a reciprocal for b, and then use the fl. pt.
multiplication hardware.
example of a result that isn't the same as with true division.
true division: 3/3 = 1 (exactly)
reciprocal approx: 1/3 = .33333333
3 x .33333333 = .99999999, not 1
It is not always possible to get a perfectly accurate reciprocal.
ISSUES in floating point
note: this discussion only touches the surface of some issues that
people deal with. Entire courses could probably be taught on each
of the issues.
rounding
--------
arithmetic operations on fl. pt. values compute results that cannot
be represented in the given amount of precision. So, we must round
results.
There are MANY ways of rounding. They each have "correct" uses, and
exist for different reasons. The goal in a computation is to have the
computer round such that the end result is as "correct" as possible.
There are even arguments as to what is really correct.
3 methods of rounding:
round toward 0 -- also called truncation.
figure out how many bits (digits) are available. Take that many
bits (digits) for the result and throw away the rest.
This has the effect of making the value represented closer
to 0.
example:
.7783 if 3 decimal places available, .778
if 2 decimal places available, .77
round toward + infinity --
regardless of the value, round towards +infinity.
example:
1.23 if 2 decimal places, 1.3
-2.86 if 2 decimal places, -2.8
round toward - infinity --
regardless of the value, round towards -infinity.
example:
1.23 if 2 decimal places, 1.2
-2.86 if 2 decimal places, -2.9
in binary -- rounding to 2 digits after radix point
----------------------------------------------------
round toward + infinity --
1.1101
|
1.11 | 10.00
------
1.001
|
1.00 | 1.01
-----
round toward - infinity --
1.1101
|
1.11 | 10.00
------
1.001
|
1.00 | 1.01
-----
round toward zero (TRUNCATE) --
1.1101
|
1.11 | 10.00
------
1.001
|
1.00 | 1.01
-----
-1.1101
|
-10.00 | -1.11
------
-1.001
|
-1.01 | -1.00
-----
round toward nearest --
ODD CASE:
if there is anything other than 1000... to the right
of the number of digits to be kept, then
rounded in IEEE standard such that the least significant
bit (to be kept) is a zero.
1.1111
|
1.11 | 10.00
------
1.1101
|
1.11 | 10.00
------
1.001 (ODD CASE)
|
1.00 | 1.01
-----
-1.1101 (1/4 of the way between)
|
-10.00 | -1.11
------
-1.001 (ODD CASE)
|
-1.01 | -1.00
-----
NOTE: this is a bit different than the "round to nearest" algorithm
(for the "tie" case, .5) learned in elementary school for decimal numbers.
use of standards
----------------
--> allows all machines following the standard to exchange data
and to calculate the exact same results.
--> IEEE fl. pt. standard sets
parameters of data representation (# bits for mantissa vs. exponent)
--> Pentium architecture follows the standard
overflow and underflow
----------------------
Just as with integer arithmetic, floating point arithmetic operations
can cause overflow. Detection of overflow in fl. pt. comes by checking
exponents before/during normalization.
Once overflow has occurred, an infinity value can be represented and
propagated through a calculation.
Underflow occurs in fl. pt. representations when a number is
to small (close to 0) to be represented. (show number line!)
if a fl. pt. value cannot be normalized
(getting a 1 just to the left of the radix point would cause
the exponent field to be all 0's)
then underflow occurs.
HW vs. SW computing
-------------------
floating point operations can be done by hardware (circuitry)
or by software (program code).
-> a programmer won't know which is occuring, without prior knowledge
of the HW.
-> SW is much slower than HW. by approx. 1000 times.
A difficult (but good) exercize for students would be to design
a SW algorithm for doing fl. pt. addition using only integer
operations.
SW to do fl. pt. operations is tedious. It takes lots of shifting
and masking to get the data in the right form to use integer arithmetic
operations to get a result -- and then more shifting and masking to put
the number back into fl. pt. format.
A common thing that manufacturers used to do is to offer 2 versions of the
same architecture, one with HW, and the other with SW fl. pt. ops.