-
Notifications
You must be signed in to change notification settings - Fork 3
/
ChangeLog
665 lines (539 loc) · 26.4 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
2012-11-28 Dick Grune
* newargs.c (recursive_args): Liqun Chen (liqun.chen@hp.com)
submitted a bug report noting that the separator / is expanded under
the -R option. Corrected.
2012-09-30 Dick Grune
* pass2.c (pass2_txt): Boyd Blackwell (Boyd.Blackwell@anu.edu.au)
submitted a bug report in which the line numbers (and runs
representations) were way off (75 lines). The input files were
characterized by extremely long lines, hundreds of tokens (max. 521).
After 2.5 days of debugging the cause was found: 1. since the mapping
from token positions to line numbers is stored as the difference of
the token positions from one line to the next (see text.c); 2. since
these differences are stored in unsigned chars to save space; 3. since
the nl_buff mechanism is switched off when one of these unsigned
characters overflow; and since 521 tokens on one line overflowed this
unsigned char, the nl_buff mechanism was shut off.
Since when there is no nl_buff information in pass2, pass2 resorts to
rereading the input file calling yylex again; 2. since the preceding
file had few runs to find line number to, the preceding file was not
read to the end, and the rest remained in flex's buffer, so a portion
of the preceding file seemed prefixed to the present file, adding 75
lines to it.
Remedy: flushing flex's buffer explicitly in pass2_txt(); this is
simpler than using flex's YY_BUFFER_STATE mechanism.
Advice: get rid of the nl_buff mechanism; it is no longer relevant.
2012-06-09 Dick Grune
* lang.h:
The *lang.l files are unusual in two respects:
1. they present two interfaces to the rest of the system:
language.[ch], static data about the language, and lang.[ch], dynamic
data about the input file's content;
2. both interfaces come with multiple implementations, one for each
*lang.l file; i.e., they are abstract.
This has been sorted out with some difficulty.
2012-05-08 Dick Grune
* Changed to 16-bit tokens, for better resolution for sim_text and
on -F option, and for UTF-8 input.
It was not worth while to save the 8-bit token code: on serious
comparisons the increase in memory usage is about 10% (330 000 on a
maximum allocation of 3 030 976 for comparing the sources of MCD2).
2009-03-11 Dick Grune <dick@flits.few.vu.nl>
* newargs.c: added -R option to follow directories recursively.
See recursive_args().
2008-09-22 <Dick@ACER>
* added newargs.[ch], to supply file names from standard input,
for those compilers that do not have the @ facility. Implemented
without fixed limits.
2008-09-21 <Dick@ACER>
* changed default format back to original, and inverted the
-v(erbose) option into a -T(erse) option.
2008-03-31 Dick Grune <dick@flits.few.vu.nl>
* *.l: the following are not universally recognized; removed.
%option nounput
%option never-interactive
2008-03-31 <Dick@ACER>
Introduced aiso.* and Malloc.? as imported modules.
2007-11-21 Carlos Maziero <maziero@ppgia.pucpr.br>
- output format modified in order to facilitate "grep" filtering
- added option "-v" for a more verbose output
- added option "-tN" to define a threshold %N (only similarities
over N% are shown)
- fixed SEGV on writing to the output file
- the file list can be informed through STDIN (one file per line,
accepts "/" marker); this is useful for compilers that lack the
@ facility
2007-08-23 Dick Grune <dick@hydra.cs.vu.nl>
LICENSE.txt added.
2006-11-27 Dick Grune <dick@hydra.cs.vu.nl>
Removal of setbuff() for compatibility.
2005-01-17 Dick Grune <dick@blade014.cs.vu.nl>
Corrections by Jerry James <james@eecs.ku.edu>; ANSIizing, etc.
2004-08-05 Dick Grune <dick@blade014.cs.vu.nl>
Finished the 'percentage' option.
08-Nov-2001 Dick Grune
Begun to add a 'percentage' option, which will express the
similarity between two files in percents.
27-Sep-2001 Dick Grune
Split add_run() off from compare.c into add_run.c, to accommodate
different add_run()s, for different types of processing.
27-Nov-1998 Dick Grune
Installed a Miranda version supplied by Emma Norling (ejn@cs.mu.oz.au)
23-Feb-1998 Dick Grune
Renamed text.l to textlang.l for uniformity and to make room for
a possible module text.[ch].
Isolated a module for handling the token array from buff.[ch] to
tokenarray.[ch], and renamed buff.[ch] to text.[ch].
23-Feb-1998 Dick Grune
There is probably not much point in abandoning the nl_buff list
when running out of memory for TokenArray[]: each token costs 1
byte for the token and 4 bytes for the entry in
forward_references[], a total of 5 bytes. There are about 3
tokens to a line, together requiring 15 bytes, plus 1 byte in
nl_buff yields 16 bytes. So releasing nl_buff frees only 1/16 =
6.7 % of memory.
Since the code is a bother, I removed it. Note that nl_buff is
still abandoned when the number of tokens in a line does not fit
in one unsigned char (but that is not very likely to happen).
21-Feb-1998 Dick Grune
Printing got into an infinite loop when the last line of the
input was not terminated by a newline AND contained tokens that
were included in a matching run.
This was due to a double bug: 1. the non-terminated line was not
registered properly in NextTextTokenObtained() / CloseText(),
and 2. the loop in pass 2 which sets the values of
pos->ps_nl_cnt was terminated prematurely when the file turned
out to be shorter than the list of pos-es indicated.
Both bugs were corrected, the first by supplying an extra
newline in CloseText() when one is found missing, and the second
by rewriting the list-parallel loop in pass 2.
02-Feb-1998 Dick Grune
Pascal does not differentiate between strings and characters
(strings of one character); this difference has been removed
from pascallang.l.
22-Jan-1998 Dick Grune
Detection of non-ASCII characters added. Since the lexical
analyser itself generates non-ASCII characters, the test must occur
earlier. We could replace the input routine of lex by a
checking routine, but with several lex-es going around, we want
a more lex-independent solution. To allow each language its own
restrictions about non-ASCII characters, the check is
implemented in the *lang.l files.
28-Nov-1997 Dick Grune
Changed the name of the C similarity tester 'sim' to 'sim_c', for
uniformity with sim_java, etc.
23-Nov-1997 Dick Grune
Java version finished; checked by Matty Huntjens and crew.
24-Jun-1997 Dick Grune
Started on a Java version, by copying the C version.
22-Jun-1997 Dick Grune
Modern lexical analysers, among which flex, read the entire input into
a buffer before they issue the first token. As a result, ftell() no
longer gives a usable indication of the position of a token in a file.
This pulls the rug from under the nl_buff mechanism in buff.c, which
is removed. We loose a valuable optimization this way, but there just
seems to be no way to keep it.
Note that this has nothing to do with the problem in MS-DOS of
character count and fseek position not being synchronized. That
problem has been solved on June 14, 1991 (which see) and the code has
been running OK since.
18-Jun-1997 Dick Grune
The thought has occurred to use McCreight's linear longest common
substring algorithm rather than the existing algorithm, which has a
small quadratic component. There are a couple of problems with this:
1. We need the longest >non-overlapping< common substring;
McCreight provides just the longest. It is not at all clear
how to modify the algorithm.
2. Once we have found our LCS, we want to find the
one-but-longest; it is far from obvious how to do that in
McCreight's algorithm.
3. Once we have found our LCS, we want to take one of its
copies out of the game, to suppress duplicate messages.
Again, it is difficult to see how to do that, without
redoing all the calculations.
4. McCreight's algorithm seems to require about two binary
tree nodes per token, say 8 bytes, which is double we
use now.
17-Jun-1997 Dick Grune
Did some experimenting with the hash function; it is still
pretty bad: the simple-minded second sweep through
forward_references easily removes another 80-99% of false hits.
Next, a third sweep that does a full comparison will remove another
large percentage.
So I have left in the second sweep in all cases.
There are a couple of questions here:
1. Can we find a better hash function, or will we forever need a
second sweep?
2. Does it actually matter, or will we loose on more expensive
hashing what we gain by having a better set of forward
references in compare.c?
16-Jun-1997 Dick Grune
Cleaned up sim.h and renamed aiso.[ch] to runs.[ch] since they
are instantiations of the aiso module concerned with runs.
Aiso.[spc|bdy] stays aiso.[spc|bdy], of course.
16-Jun-1997 Dick Grune
Redid largest_function() in algollike.c.
Corrected bug in CheckRun; it now always removes NonFinals from
the end, even when it has first applied largest_function().
15-Jun-1997 Dick Grune
Reorganized the layers around the input file. There were and
still are three layers: lang, stream and buff.
Since the lex_X variables are hoisted unchanged through the levels
lang, stream, and buff, to be used by pass1, pass2, etc., they
have to be placed in a module of their own.
The token-providing module 'lang' has three interfaces:
- lang.h, which provides access to the lowest-level token
routines, to be used by the next level.
- lex.h, which provides the lex variables, to be used by
all and sundry.
- language.h, which provides language-specific info about
tokens, concerning their suitability as initial
and final tokens, to be used by higher levels.
This structure is not satisfactory, but it is also unreasonable
to combine them in one interface.
There is no single lang.c; rather it is represented by the
various Xlang.c files generated from the Xlang.l files.
14-Jun-1997 Dick Grune
Added a Makefile zip entry to parallel the shar entry.
13-Jun-1997 Dick Grune
A number of simplifications, in view of better software and bigger
machines:
- Removed good_realloc from hash.c; I don't think there are
any bad reallocs left.
- Removed the option to run without forward_references.
On a 16Mb machine this means you have at least 2M tokens;
using a quadratic algorithm will take 4*10^6 sec. at an
impossible rate of 1M actions/sec., which is some 50 days.
Forget it.
- Renamed lang() to print_stream(), and incorporated it in sim.c
- Removed the MSDOS subdirectory mechanism in the Makefile.
- Removed the funny and sneaky double parameter expansion in
the call of idf_in_list().
12-Jun-1997 Dick Grune
Converted to ANSI C. Removed cport.h.
09-Jan-1995 Dick Grune
Decided not to do directories: they usually contain extraneous
files and doing sim * is simple enough anyway.
09-Sep-1994 Dick Grune
Added system.h to cater for the (few) differences between Unix and DOS.
The #define int32 is also supplied there.
05-Sep-1994 Dick Grune
Added many prototype declarations using cport.h.
Added a depend entry to the Makefile.
31-Aug-1994 Dick Grune
All these changes require a 32 bit integer; introduced a #define
int32, set from the command line in the Makefile.
25-Aug-1994 Dick Grune
It turned out that one of the most often called routines was .rem,
from idf_hashed() in idf.c. Moving the % out of the loop chafed off
another 6% and reduced the time to 18.4 sec.
19-Aug-1994 Dick Grune
With very large files (e.g., concatenated /usr/man/man1/*) the fixed
built-in hash table size of 10639 is no longer satisfactory. Hash.c
now finds a prime about 8 times smaller than the text_size to use
for hash table size; this achieves optimal speed-up without gobbling
up too much memory. Reduced the time for the above file from 30.2
sec. to 19.6 sec.
For checking, the same test was run with all hashing off; it took
20h 27m 19s = 73639 sec. But it worked.
11-Aug-1994 Dick Grune
For large values of MinRunSize (>1000) a large part of the time
(>two-thirds) was spent in calculating the hash values for each
position in the input, since the cost of this calculation was
proportional to MinRunSize. We now sample a maximum of 24 tokens
from the input string to calculate the hash value, and avoid
overflow. On my workstation, this reduces the time for
sim_text -r 1000 -n /usr/man/man1/*
from 60 sec to 21 sec.
30-Jun-1992 Dick Grune,kamer R4.40,telef. 5778
There was an amazing bug in buff.c where NextTextToken() for pass 2
omitted to set lex_token to EOL when retrieving newline info from
nl_buff. Worked until now!?!
23-Sep-1991 Dick Grune
Cport.h introduced, CONST and *.spc only.
17-Sep-1991 Dick Grune
The position-sorting routine in pass2.c has been made into a
separate generic module.
14-Jun-1991 Dick Grune (dick@cs.vu.nl) at dick.cs.vu.nl
Replaced the determination of the input position through counting
input characters by calls of ftell(); this is cleaner and the other
method will never work on MSDOS.
30-May-1989 Dick Grune (dick) at dick
Replaced the old top-100 module (which had been extended to top-10000
already anyway) by the new aiso (arbitrary-in sorted-out) module.
This caused a considerable speed-up on the Mod2 test bed:
%time cumsecs #call ms/call name
17.9 99.20 7209 13.76 _InsertTop
0.3 1.37 7209 0.19 _InsertAiso
It turns out that malloc() is not a serious problem, so no special
version for the aiso module is required.
23-May-1989 Dick Grune (dick) at dick
No more uncommented comment at the end of preprocessor lines, to
conform to ANSI C.
23-May-1989 Dick Grune (dick) at dick
Added code in the X.l files to (silently) reject characters over 0200.
This does not really help, since lex stops on null chars. Ah, well.
19-May-1989 Dick Grune (dick) at dick
Made the token as handled by sim into an abstract data type, for
aesthetic reasons. Sign extension is still a problem.
03-May-1989 Dick Grune (dick) at dick
Optimized lcs() by first checking from the end if a sufficiently long
run is present; if in fact only the first 12 tokens match, chances
are good that you can reject the run right away by first testing
the 20th token, then the 19th, and so on.
21-Apr-1989 Dick Grune (dick) at dick
A run of sim_m2 finding 7209 similarities raised the question of
the appropriateness of the linear sort in sort_pos(). Profiling
showed that in this case sorting takes all of 7.5 % of the total
time. Putting the word register in in the right places in
sort_pos() lowered this number to 4.6%.
20-Apr-1989 Dick Grune (dick) at dick
Moved the test for MayBeStartOfRun() from compare.c (where it is
done again and again) to hash.c, where its effect is incorporated in
the forward reference chain.
14-Apr-1989 Dick Grune (dick) at dick
Replaced elem_of() by bit tables, headers[] and trailers[], to be
prefilled from Headers[] and Trailers[] by a call of
InitLanguage(). This saves a few percents.
13-Apr-1989 Dick Grune (dick) at dick
Implemented the -e and the -S option, by putting yet another loop
in compare.c
13-Apr-1989 Dick Grune (dick) at dick
The -- option (displaying the tokens) will now handle more than one
file.
20-Jan-1989 Dick Grune (dick) at dick
After the modification of 19-Dec-88, 12% of the time went into
updating the positions in the chunks, as they were produced by the
matching process. This matching process identifies runs (matches)
by token position, which has to be recalculated to lseek positions
and line numbers. To this end the files are read again, and for
each line all positions found were checked to see if they applied
to this line; this was a awfully stupid algorithm, but since much
more time was spent elsewhere, it did not really matter. With all
the saving below, however, it had risen to second position, after
yylook() with 35%.
Th solution was, to sort the positions in the same order in which
they would be met by the reading of the files. The process is then
linear. This required some extensive hacking in pass2.c
06-Jan-1989 Dick Grune (dick) at dick
The modification below did indeed save 25%. The newline information
is now reduced to 2 shorts; 2 chars were not enough, since some
lines are longer that 127 bytes, and a char and a short together
take as much room as two shorts.
19-Dec-1988 Dick Grune (dick) at dick
To avoid reading the files twice (which is still taking 25% of the
time), the first pass will now collect newline information for the
second pass in a buffer called nl_buff[]. This buffer, and the
original token buffer now named TokenArray[], are managed by the file
buff.c, which implements a layer between stream.h and pass?.c. This
layer provides OpenText(), NextTextToken() and CloseText(), each
with a parameter telling which pass it is.
06-Dec-1988 Dick Grune (dick) at dick
As an introduction to removing the second pass altogether, the
first and second scan were unified, i.e., their input is identical.
This also means that the call sim -[12] has now been replaced by
one call: sim --.
23-Sep-1988 Dick Grune (dick) at dick
Dynamic allocation of line buffers in pass 3. This removes the
restriction on the page width.
22-Sep-1988 Dick Grune (dick) at dick
In order to give better messages on incorrect calls to sim, the
whole option handling has been concentrated in a file option.c and
separated from the options and their messages themselves. See sim.c
07-Sep-1988 Dick Grune (dick) at dick
For long text sequences (say hundreds of thousands of tokens),
the hashing is not really efficient any more since too many
spurious matches occur. Therefore, the forward reference table is
scanned a second time, eliminating from any chain all references to
runs that do not end in the same token. For the UNIX manuals this
reduced the number of matches from 91.9% to 1.9% (of which 0.06%
were genuine).
30-Aug-1988 Dick Grune (dick) at dick
For compatibility, NextTop has been rewritten to yield true or
false and to accept a pointer to a run as a parameter.
30-Aug-1988 Dick Grune (dick) at dick
When trying to find line-number and lseek position to beginnings
and ends of runs found, the whole set of runs was scanned for each
line in each file. Now only the runs belonging to that file are
scanned; to this end another linked list has been braided through
the data structures (tx_chunk).
30-Aug-1988 Dick Grune (dick) at dick
The longest-common-substring algorithm was called much too often,
mainly because the forward references made by hashing suffered from
pollution. If you have say 1000 tokens and a hash range of say
10000, about 5 % of the hashings will be false matches, i.e. 50
matches, which is quite a lot on a natural number of 2 to 3 matches.
Improved by doing a second check in make_forw_ref().
12-Jun-1988 Dick Grune (dick) at dick
Installed a Lisp version supplied by Gertjan Akkerman.
15-Jan-1988 Dick Grune (dick) at dick
Added register declarations all over the place.
14-Jan-1988 Dick Grune (dick) at dick
It is often useful to match a piece of code exactly, especially
when function names (or, even more so, macro names) are involved.
What one would want is having all the letters in the text array,
but this is kind of hard, since each entry is one lexical item.
This means that under the -F option each letter is a lex item, and
normally each tag is a lex item; this requires two lex grammars in
one program; no good. So, on the -F flag we hash the identifier
into one lex item, which is hopefully characteristic enough. It
works.
30-Sep-1987 Dick Grune (dick) at dick
Some cosmetics.
31-Aug-1987 Dick Grune (dick) at dick
Moved the whole thing to the SUN (while testing on a VAX and a
MC68000)
16-Aug-1987 Dick Grune (dick) at dick
The test program lang.c is no longer a main program, but rather a
subroutine called in main() in sim.c, through the command line
option -1 or -2.
23-Apr-1987 Dick Grune (dick) at tjalk
Changed the name 'index' into 'elem_of', because of compatibility
problems on different Unices. Added a declaration for it in
the file algollike.c
10-Mar-1987 Dick Grune (dick) at tjalk
Changed the printing of the header of a run so that:
- long file names will no longer be truncated
- the run length is displayed
27-Jan-1987 Dick Grune (dick) at tjalk
Switched it right off again! Getting them in textual order is
still more unpleasant, since now you cannot find the important
ones if their are more than a few runs.
27-Jan-1987 Dick Grune (dick) at tjalk
Going to experiment with leaving out the sorting; just all the
runs, in the order we meet them. Should be as good or better.
Comparisons of more than 100 runs are very rare anyway, so the
fact that those over a 100 are rejected is probably no great
help. Getting them in a funny order is a nuisance, however. Down
with featurism. Just to be safe, present version saved as
870127.SV
26-Dec-1986 Dick Grune (dick) at tjalk
Names of overall parameters in params.h changed to more uniformity.
26-Dec-1986 Dick Grune (dick) at tjalk
Since the top package and the instantiation system have grown
apart so much, I have integrated the old top package into sim,
i.e., done the instantiation by hand. This removes top.g and
top.p, and will save outsiders from wondering what is going on
here.
23-Dec-1986 Dick Grune (dick) at tjalk
Use setbuf to print unbuffered while reading the files (lex core
dumps, other mishaps) and print buffered while printing the real
output (for speed).
30-Nov-1986 Dick Grune (dick) at tjalk
Various small changes in *lang.l:
; ignored conditionally (!options['f'])
new format for tokens in struct idf
cosmetics: macro Layout, macro UnsafeComChar, no \n
in character denotations, more than one char
in a char denotations in Pascal, etc.
30-Nov-1986 Dick Grune (dick) at tjalk
Added a Modula-2 version.
29-Nov-1986 Dick Grune (dick) at tjalk
Restricting tokens to the ASCII95 character set is really too
severe: some languages have many more reserved words (COBOL!).
Corrected this by adding a couple of '&0377' in strategic places.
Added a routine for printing the 8-bit beasties: show_token().
15-Aug-1986 Dick Grune (dick) at tjalk
Since the ; is superfluous in both C and Pascal, it is now ignored
by clang.l and pascallang.l
15-Aug-1986 Dick Grune (dick) at tjalk
The code in CheckRun in Xlang.l was incorrect in that it used the
wrong criterion for throwing away trailing garbage. I've taken
CheckRun etc. out of the Xlang.l-s and turned them into a module
"algollike.c". Made a cleaner interface and avoided duplication of
code.
02-Jul-1986 Dick Grune (dick) at tjalk
Looking backwards in compare.c to see if we are in the middle of a
run is an atavism. You can be and still be all right, e.g., if
part of the run was rejected as not fitting for a function.
Removed from compare.c.
10-Jun-1986 Dick Grune (dick) at tjalk
The function hash_code() in hash.c could yield a negative value;
corrected.
09-Jun-1986 Dick Grune (dick) at tjalk
Changed the name of the file text.h to sim.h. Sim.h is more
appropriate and text.h sounds as if it belongs to text.l, with
which it has no connection.
04-Jun-1986 Dick Grune (dick) at tjalk
After having looked at a couple of hash functions and having done
some calculations on the number of duplicates normally encountered
in hash functions, I conclude that our function in hash.c is quite
good. Removed all the statistics-gathering stuff.
Actually, hash_table[] is not the hash table at all; it is a
forward reference table; likewise, the real hash table was called
last[]. Renamed both.
There is a way to keep the hash table local without putting it on
the stack: use malloc().
02-Jun-1986 Dick Grune (dick) at tjalk
Added a simple lex file for text: each word is condensed into a
hash code which is mapped on the ASCII95 character set. This
turns out to be quite effective.
01-Jun-1986 Dick Grune (dick) at tjalk
The macros cput(tk) and c_eol() both have a return in them, so any
code after them may not be executed -> they have to be last in an
entry. But they weren't, in many places; I can't imagine why it
all worked nevertheless. They have been renamed return_tk(tk) and
return_eol() and the entries have been restructured.
30-May-1986 Dick Grune (dick) at tjalk
Moved the string and character entries in clang.l and pascallang.l
to a place behind the comment entries, to avoid strings (and
characters) being recognized inside comments. I first thought
this would not happen, but as Maarten pointed out, if both
interpretations have the same length, lex will take the first
entry. Now this will happen if the string occupies the whole line
that would otherwise be taken as a comment. In short,
/*
"hallo"
*/
would return ".
28-May-1986 Dick Grune (dick) at tjalk
Added -d option, to display the output in diff(1) format (courtesy
of Maarten van der Meulen).
Rewrote the lexical parsing of comments (likewise courtesy Maarten
van der Meulen).
20-May-1986 Dick Grune (dick) at tjalk
Added a routine to convert identifiers to lower case in
pascallang.l .
19-May-1986 Dick Grune (dick) at tjalk
Added -a option, to quickly check antecedent of a file (courtesy
of Maarten van der Meulen).
18-May-1986 Dick Grune (dick) at tjalk
Brought everything under RCS/CVS.
18-Mar-1986 Dick Grune (dick) at tjalk
Added modifications by Paul Bame (hp-lsd!paul@hp-labs) to have an
option -w to set the page width.
21-Feb-1986 Dick Grune (dick) at tjalk
Took array last[N_HASH] out of make_hash() in hash.c, due to stack
overflow on the Gould (reported by George Walker
tekig4!georgew@mcvax.uucp)
16-Feb-1986 Dick Grune (dick) at tjalk
Corrected some subtractions that caused unsigned ints to turn
pseudo-negative. (Reported by jaap@mcvax)
11-Jan-1986 Dick Grune (dick) at tjalk
Touched up for distribution.
10-Jan-1986 Dick Grune (dick) at tjalk
Fill_line was not called for empty lines, which caused them to be
printed as repetitions of the previous line.
24-Dec-1985 Dick Grune (dick) at tjalk
Reduced hash table to a single array of indices; it is used only
in one place, which makes it very easy to make it (the hash table)
optional. General tune-up of everything. This seems to be
another stable "final" version.
14-Dec-1985 Dick Grune (dick) at tjalk
Some experiments with hash formulas:
h = (h OP CST) + *p++ OP CST yields right wrong
* 96 - 32 205 562
* 96 - 2 205 560
* 96 205 560
* 97 205 559
<< 0 66 3128
<< 1 203 555
<< 2 205 536
<< 7 203 540
Conclusion: it doesn't matter, unless you do it wrong.
01-Oct-1983 Dick Grune (dick) at vu44
Oldest known files.
# This file is part of the software similarity tester SIM.
# Written by Dick Grune, Vrije Universiteit, Amsterdam.
# $Id: ChangeLog,v 2.21 2012-11-28 20:49:51 Gebruiker Exp $
#