Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on changing ktcp timer resolution #71

Closed
ghaerr opened this issue Jul 27, 2024 · 71 comments
Closed

Thoughts on changing ktcp timer resolution #71

ghaerr opened this issue Jul 27, 2024 · 71 comments
Labels
Discussion Technical discussion related to other issues

Comments

@ghaerr
Copy link

ghaerr commented Jul 27, 2024

In previous comments, I made the statement:

I would treat making a single system call per ktcp cycle as very fast.
The system call overhead is almost identical to the hw interrupt processing overhead ... and of course the 100 timer calls/second are thought as minuscule overhead.

After looking at the code generated for the kernel sys_gettimeofday routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling __udivsi3) plus an additional 32-bit modulo (by calling __umodsi3) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).

Then, after that, ktcp's time wrapper function timer_get_time, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to __udivsi3 to convert microseconds into 1/16 seconds by dividing by 16:

timeq_t timer_get_time(void)
{
    struct timezone tz;
    struct timeval  tv;

    gettimeofday(&tv, &tz);

    /* return 1/16 second ticks, 1,000,000/16 = 62500*/
    return (tv.tv_sec << 4) | ((unsigned long)tv.tv_usec / 62500U);
}

Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.

(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).

In another post I stated:

Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better

After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.

Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.

In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.

Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?

If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.

One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:

    return (tv.tv_sec << 8) | (((unsigned long)tv.tv_usec >> 12) & 0xFF);

I haven't yet figured out how to measure how long the __udivsi3 or __umodsi3 routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.

What do you think?

@Mellvik Mellvik added the Discussion Technical discussion related to other issues label Jul 27, 2024
@Mellvik
Copy link
Owner

Mellvik commented Jul 27, 2024

Very interesting @ghaerr, and very timely indeed. Great discussion - and I think we're converging on a solution.

Unsurprisingly I've been struggling with this over the past few days as I've been experimenting with simple ways of implementing (and measuring) slow start & congestion avoidance. There are plenty surprises in this mess, some actually pleasant in the sense that smart things have been done, smart choices made - such as the original 1/16s now obviously being there for a good reason. Also, there is a lot to learn about how ktcp actually works, very valuable as we head into weeding out the final (?) problems and ridding it of the double-ACKs, which has suddenly become an issue - if not very practical, at least a load thing.

Re. timer: I started with microseconds, found it messy in the sense that lots of zeroes didn't contribute to anything, 10ms being the actual resolution anyway. Also, I got the obvious wraparound problem with 12 bit seconds messing me up. Then I moved to 1ms which improved things, but didn't eliminate the wraparound issue for seconds. And of course the convergence algorithm for RTT needed adjustment as the time resolution changed. As your message arrived I was back to 1/16s, postponing the timer resolution issue until the algorithms are under control.

That said, having increased resolution was helpful in providing a clearer picture of actual round trip times and delays, a desirable change.

Your research and analysis was very educating - and interestingly lands very close to where I was sitting before we started this discussion: Not being even close to the task of creating a new system call, I played with the idea of creating an ioctl in tcpdev to get the current jiffies value. The thinking was that the gettimeofday call was probably expensive and I was (incorrectly) thinking getting an updated time value more often that once every ktcp loop was useful.

Now we're here, and I strongly support the idea of getting jiffies via a system call. Simplicity at its best. And safe. I may be missing something, but from my perspective - here's the problem with using gettimeofday: There is always going to be a wraparound issue with seconds when we truncate - unless we subtract a constant from the seconds value before creating the combined 32bit time value. Even with 28 bits of seconds ORed in, there is a chance that those 28 bits will go from a very big to a very small number between 2 calls to gettimeofday. jiffies are guaranteed to not have that problem.

IOW - jiffies have my vote, and I'll keep the 1/16 setup until we have a solution. This is good!

@ghaerr
Copy link
Author

ghaerr commented Jul 28, 2024

Interesting @Mellvik, I figured you might be in the middle of a timer rewrite, glad to hear your comments!

here's the problem with using gettimeofday: There is always going to be a wraparound issue with seconds when we truncate

jiffies are guaranteed to not have that problem.

Well, yes when truncating the 52 bit gettimeofday result to 32 bits, but we have the same wrapping problem with jiffies but in a different way: the 32-bit jiffies stores sub second hw timer ticks (=10ms = 100/second) effectively in 6.5 (lets call in 7) bits (=128). So the upper 25 bits are pretty much seconds and the lower 7 bits are the count of 10ms ticks. This means that jiffies will wrap every 2^25 = 32M/86,400 = 388 days. If you're worried about 1/256 second wrapping every 194 days, I'd be as worried about 388 days.

Nonetheless, getting jiffies directly for timing purposes saves lots of time, and is definitely more accurate, considering that __udivsi3/__umodsi3 divide by looping thus adding variable microseconds to each measurement. We would need to calculate how many instructions per second a slow 8086 executes code at and then count instructions in the udivs3 routine(s) in order to get a more accurate idea of how slow it all is. I would guess on the order of 50-200 microseconds; not much when our timer resolution is 10,000 microseconds. Still - it all adds up.

I played with the idea of creating an ioctl in tcpdev to get the current jiffies value.

Actually, I think this is a better idea than writing a new system call. Since ktcp already does some special ioctls, another ioctl would be very simple. And then code the ioctl to return the address of the kernel jiffies variable itself (a pointer), rather than the value. Then a far pointer could be used to read jiffies directly - how's that for speed?

The ioctl only needs to return the kernel address (DS offset) of jiffies, since MEM_GETDS is already implemented to get the kernel DS segment value.

Note that the kernel itself doesn't concern itself with jiffles wrapping - most kernel wait code adds a value to jiffies without checking for wraparound. However, if one really wanted to be correct, something like the following could be used:

    unsigned int kds, jaddr;
    unsigned long __far *jp;
    unsigned long lasttick = 0;
    int fd = open("/dev/kmem", O_RDONLY);
    ioctl(fd, MEM_GETDS, &kds);
    ioctl(fd, MEM_GETJIFFADDR, &jaddr);
    jp = _MK_FP(kds, jadå);

    /* get current jiffies */
    unsigned long now = *jp;
    /* get jiffies difference */
    long diff = now - lasttick;
    /* jiffies wrap check */
    if (diff < 0)
        diff = 0xFFFFFFFFUL - lasttick + now;

    /* do something with diff */
   ...
   lasttick = now;

Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds

Thinking further, my above suggestion of dividing microseconds by 256 (quickly using a right shift instead of slow divide) to get 1/256 second won't work properly - microseconds are 0-999,999 (1M) and not evenly divisible by 256, like 1M is by 16. Thus the result tick count would be incorrect for the computed range 1,046,576 (=2^20) above 1M (=46,576). Since 1M/256 = 3,906 then 46576/3906 = 11.9 there would be about 12 incorrect ticks every 256 ticks. No good.

However, 1M is exactly divisible by 64 (=15,625) like it is with 16, so a timer change to 1/64 second would work without error, even though it'd still be slow. A 1/64 second timer is ~15.6ms and wrap would occur at 2^32/2^6 = 26 bits for seconds, which is 64M/86400 = 776 days.

A 15ms timer resolution using 1/64 second isn't nearly as nice as 10ms jiffies, and we still can explore the potential option of reprogramming the 8254 PIT to either timeout quicker or start another timer and read it directly. The PIT reprogramming could be turned on only when ktcp is running in debug mode for more accurate timing, on the PC platform.

@Mellvik
Copy link
Owner

Mellvik commented Jul 28, 2024

Thank you @ghaerr - very educational indeed, a much appreciated run-through.

To me, simplicity is key and ktcp's requirements are really 'unsophisticated'. Actually, now that the 1/16s metric is back in my code, I suddenly find it more workable - probably because I now understand more about the inner workings of the entire 'package'. That said, I'm still eager to increase resolution for the same reasons as before: Better understanding, easier reading, more leeway in tuning and more.

In fact there are likely many places in the system that would benefit more from the availability of a 'reasonable' resolution and performancewise cheap timer value. IP and TCP were created to be lax about time and some early RFCs refer to 0.5 seconds as a reasonable resolution. I had to smile when reading that.

Back to the case, your run-through and examples make me even more convinced that jiffies is the thing. I was thinking using tcpdev for an ioctl to get it since tcpdev is already open. That benefit obviously becomes irrelevant when the call is made only once - to return a pointer. So /dev/kmem definitely makes more sense. As to the wrapping, my concern has not been the timespan we can store in a long but what happens when there is a wrap between two subsequent calls to the timer (which I ran into when using microseconds and had only 12 bits for seconds). When the chance of that happening is minimal and the consequences non-catastrophic, my concern seems exaggerated. Also, jiffies being good enough for the kernel, they are good enough for ktcp - without the wraparound check. The simplicity of just accessing jiffies from a far memory location is unbeatable - and 10ms resolution is fine.

@Mellvik
Copy link
Owner

Mellvik commented Jul 28, 2024

... we still can explore the potential option of reprogramming the 8254 PIT to either timeout quicker or start another timer and read it directly. The PIT reprogramming could be turned on only when ktcp is running in debug mode for more accurate timing, on the PC platform.

This is a very interesting idea, @ghaerr - let's keep this one warm, even while changing ktcp to use jiffies.

@Mellvik
Copy link
Owner

Mellvik commented Jul 29, 2024

@ghaerr,
Let me elaborate on my last comment about making available another timer via the PIT, preferably one with microsecond resolution: The kernel - drivers in particular - contain many delay loops. However undesirable these do-nothing-but-wait loops are required by physical interfaces, a requirement proportional with age. In most cases the delays occur in reset/initialization code outside what we can call 'mormal operation', and as such they may seem insignificant. However, and again proportional with age, things go astray when dealing with physical stuff whether networks or media, and recovery is required - resets, initialization, etc., meaning the delay loops are used much more often than what may be obvious. Curious fact: The EL3 NIC actually has a timer for this purpose - unused by our driver.

The delay loops in use are typically unscaled, the delay they deliver vary wildly between processors and systems, in most cases slowing down already slow systems while having negligible effect on faster iron. Which is why I introduced the idea of reviving BOGOMIPS a while back - the ability to scale delay loops to the iron.

So there are actually 2 issues here: One is to replace all occurrences of delay loops scattered all over the place with a library routine, and two: enabling exact delays instead of 'add-an-order-of-magnitude-to-be-safe-on-slow-iron'.

On the original issue, I'm adding the ioctl to the kmem driver per your example today, and changeing ktcp accordingly, at first converting 10ms to 60ms. Then when everything else is working, move the scaling into the code.

@ghaerr
Copy link
Author

ghaerr commented Jul 29, 2024

Agreed that delay loops should not be "unscaled", but rather associated with a realtime timer so that delays are exact, at least to the resolution of the timer. I changed all the ELKS kernel delay loops to use a jiffie-based timer loop a while back, and IIRC the only non-scaled loops remaining are in the NIC drivers and the reset_floppy DF driver routine. The NIC drivers weren't changed since I'm unable to test them. The DF routine can be easily changed once the actual real time required for reset is known.

Which is why I introduced the idea of reviving BOGOMIPS a while back - the ability to scale delay loops to the iron.

IMO, better to use reference to a realtime timer or volatile jiffies value than a scaled "bogomips-derived" timer delay count, as kernel interrupts can occur during a delay count down and use up more real time than is accounted for in bogomips loops.

@Mellvik
Copy link
Owner

Mellvik commented Jul 29, 2024

The reason your PIT reference caught my interest was the need for a high res timer for some of these delay loops. Having access to a realtime timer with, say, .1ms resolution would be really valuable, including for performance measurement.

@ghaerr
Copy link
Author

ghaerr commented Jul 30, 2024

I've researched the 8253/8254 PIT and have found a way to get fast precision timing, although only within a certain elapsed time window, due to 32-bit overflow issues and avoiding software divide routines.

Long story short, the kernel PIT has an input clock frequency of 1,193,181.8, which divided by 100 to set 1/100 second timer interrupts (=HZ) uses an initial timer countdown value of 11932. With the PIT you can read the current countdown value directly from the chip, and that means that the time between each counter decrement is 1/11932 sec = 0.8381 usecs = 838.1 nanoseconds. (see elks/arch/i86/kernel/timer-8254.c for more detail).

The original IBM PC master oscillator of 14.31818 MHz was divided by 3 for the CPU and by 4 for the CGA, and then using an AND gate gives the divide-by-12 PIT frequency above for the PIT reference frequency. For 4.77 Mhz 8088s which have a 4-cycle minimum instruction execution, this somewhat conveniently gives a (4-cycle) base 0.8381 usec/instruction timing as well. (However, I've been reading that the average instruction takes 13.67 cycles).

In order to not overflow 32-bit longs and use fast hardware integer multiplies only, the PIT countdown can be read and multiplied by 84 (100 times 0.8381 rounded) to get "nticks", my new term denoting 10ns intervals (84 nticks would equal 838 nsecs or .838 usecs = 4 8088 CPU cycles). The kernel itself gets 10ms interrupts which we'll call "jticks", which are 1,000,000 nticks. This allows jticks and nticks to be added together to produce a total from which an ntick difference can easily be calculated.

So, the precision timer resolution is ~838 nanoseconds, or .838 microseconds (or 84 nticks), and is 11,932 times smaller than a jiffy. (This would also seem to be the max number of opcodes that could possibly be executed between hardware clock interrupts on a 4.77Mhz 8088).

The entire calculation can be performed in nticks as the following pseudocode (not tested):

#define MAX_TICK        11932       /* PIT reload value for 10ms (100 HZ) */
#define NANOSECS_TICK   838         /* nsecs per tick */

unsigned int count, lastcount = MAX_TICK;
unsigned int tickdiff;              /* PIT reload value for 10ms (100 HZ) */
unsigned long jiffies;              /* kernel 10ms timer */
unsigned long lastjiffies;
unsigned long nticks;               /* real time elapsed in nticks (10ns) */

    /* latch timer value for reading */
    outb(0, TIMER_CMDS_PORT);
    /* read timer countdown */
    count = inb(TIMER_DATA_PORT);
    count |= inb(TIMER_DATA_PORT) << 8;
    tickdiff = lastcount - count;
    /* handle wrap */
    if ((int)tickdiff < 0)
        tickdiff = MAX_TICK - count + lastcount;

    nticks = (jiffies - lastjiffies) * 1000000L + (tickdiff * 84L);
    lastcount = count;
    lastjiffies = jiffies;

The code produced is quite small and pretty fast, but uses 4 integer multiply instructions.

Using 32-bit longs, there is the potential for overflow in only 4294 (=2^32/1M) jiffies, which is ~42.94 seconds. This should work well for precision timing, but I'm trying to think of way to turn on and off high precision without changing ktcp timing constants. A bit more thought on that needed.

I plan on playing around more with this and coming up with a working example.

@Mellvik
Copy link
Owner

Mellvik commented Jul 30, 2024

That's a great piece of research, @ghaerr - this is obviously right down your alley.
'nticks' it is. I'm really looking forward to the next instalment!

This will open up a previously unavailable tool chest for - like you say - precision timing! Combining it with jiffies is really smart. While ktcp is fine with jiffies in an operational sense, this will enable measurements and tuning along a completely new scale.

I'm curious - how do you get this (2 long multiplications) to become only 4 int mult's?

Also (BTW, thanks for the reference on instruction timings, an educating refresh on a lot more than timings :-) ), how do you get to the 4 clock cycle minimum instruction execution cycle? Is that the fetch+execution (mov reg,reg is 2 cycles IIRC)?

@ghaerr
Copy link
Author

ghaerr commented Jul 30, 2024

how do you get this (2 long multiplications) to become only 4 int mult's?

GCC does a nice job with it. In the case of multiplying a short by a short constant to produce a long result, only a single multiply is needed (as 8086 takes AX times 16-bit reg/mem and produces DX:AX. For short tickdiff ticknsecs = tickdiff * 838L:

   0:	56                   	push   %si
   1:	b8 46 03             	mov    $0x346,%ax
   4:	f7 26 00 00          	mulw   tickdiff
   8:	a3 00 00             	mov    %ax,ticknsecs
   b:	89 16 02 00          	mov    %dx,ticknsecs+2
   f:	5e                   	pop    %si
  10:	c3                   	ret

In the case of a short times a long constant, ticknsecs = tickdiff * 0x54321L:

   0:	56                   	push   %si
   1:	8b 1e 00 00          	mov    tickdiff,%bx
   5:	b8 05 00             	mov    $0x5,%ax
   8:	f7 e3                	mul    %bx
   a:	91                   	xchg   %ax,%cx
   b:	be 21 43             	mov    $0x4321,%si
   e:	93                   	xchg   %ax,%bx
   f:	f7 e6                	mul    %si
  11:	01 d1                	add    %dx,%cx
  13:	a3 00 00             	mov    %ax,ticknsecs
  16:	89 0e 02 00          	mov    %cx,ticknsecs+2
  1a:	5e                   	pop    %si
  1b:	c3                   	ret

For the above two cases, I kept tickdiff to be unsigned short as it is effectively the 8254 PIT elapsed count difference and saves some instructions. In the sample precision timer code above, we still save one multiply because tickdiff remains unsigned short even though the multiply by 84L isn't stored into a variable.

For the case of a long times a non-constant long, its a more complicated ticknsecs = jiffies * nsecs:

00000000 <x>:
   0:	56                   	push   %si
   1:	a1 02 00             	mov    jiffies+2,%ax
   4:	f7 26 00 00          	mulw   nsecs
   8:	93                   	xchg   %ax,%bx
   9:	a1 02 00             	mov    nsecs+2,%ax
   c:	f7 26 00 00          	mulw   jiffies
  10:	91                   	xchg   %ax,%cx
  11:	01 d9                	add    %bx,%cx
  13:	a1 00 00             	mov    jiffies,%ax
  16:	f7 26 00 00          	mulw   nsecs
  1a:	01 d1                	add    %dx,%cx
  1c:	a3 00 00             	mov    %ax,ticknsecs
  1f:	89 0e 02 00          	mov    %cx,ticknsecs+2
  23:	5e                   	pop    %si
  24:	c3                   	ret

With the OpenWatcom compiler, the same thing is done but a helper routine __I4M is called.

@ghaerr
Copy link
Author

ghaerr commented Jul 30, 2024

how do you get to the 4 clock cycle minimum instruction execution cycle?

I read it somewhere. I also read that timing starts at 5 cycles for Base or Index effective address calculation and goes up to 12 cycles. The bare minimum is 4 when using register-to-register but this can sometimes change because there is some variety to instruction prefetching. (Of course all this varies wildly per processor variation). The interesting thing in the referenced link is that 4 more cycles are added for accessing words at odd addresses... much more a performance hit than I had thought. I plan on looking at various produced .o files to ensure that mixed char and int globals aren't stuffed together into possible odd addresses for ints. I know that "common" (.comm) variables aren't allocated that way (BSS only) but not sure about DATA section variables, based on their C declaration.

For our estimates of speed, I found it surprising that the average is 13.67 cycles/instruction. However, I'm inclined to believe it and am building the following "simple outlook" on imagining how time is spent in an 8086/8088 ELKS system:

With the kernel PIT running a 11932 countdown to produce 10ms interrupts, each countdown (stated previously) amounts to 0.838 microseconds, which is 4 cycles.

Thus, rounding to 12,000 countdowns (4 cycles each or one basic instruction) per 10ms clock interrupt would mean 12,000 instructions before a task switch. But with the real instruction cycle average being 12 (rounded down from 13.67), this amounts to 4,000 average instructions executed each time slice. If the average instruction is 12 cycles (for simplicity), this would mean each instruction is 3 * 4 cycles = 3 * 0.838 usecs = ~2.5 usecs/instruction.

Finally, at average ~2.5usecs/instruction, 4000 instructions is 10,000 usecs which equals, you got it - 10ms - one jiffie.

So there you have it. Precision timing samples starting at a very fast 4-cycle instruction, with three samples per average instruction. 4000 average instructions later we have a jiffie increment and possible task switch. This keener precision approach to imagining what is going on at the process time slice and kernel driver level should allow us to better understand what the real costs of interrupt service routines are and also quantify in which method a kernel busy-loop might be coded to allow other code to run or not (same for interrupt disabling).

@Mellvik
Copy link
Owner

Mellvik commented Jul 31, 2024

Again @ghaerr, this is both very interesting and very useful.

Thanks for the run-down on the mul instructions, now I understand. So we have 6 muls which is pretty good. This exercise immediately turned on the -O flag in my head: Inspired me to take a closer look at the ktcp-algorithm that adjusts the rtt for every ack received - a simple 'slow convergence' algorithm that uses a weighing factor (TCP_RTT_ALPHA) to ease the changes. The algorithm uses both multiplication and division - on longs, which is why it caught my attention, the initial question being whether it can be changed to use shifts instead.

n->cb->rtt = (TCP_RTT_ALPHA * n->cb->rtt + (100 - TCP_RTT_ALPHA) * rtt) / 100;

That question is still open, but I did change the rtt and rto variables from timeq_t (u32) to unsigned- they never exceed 100 anyway, which changed the code dramatically.

Old code (TCP_RTT_ALPHA is currently set to 40, 0x28):

454:   e8 fe ff                call   455 <tcp_retrans_expire+0xb5>
                        455: R_386_PC16 fprintf
 457:   83 c4 08                add    $0x8,%sp
 45a:   8b 5e fa                mov    -0x6(%bp),%bx
 45d:   89 d8                   mov    %bx,%ax
 45f:   8b 4e fe                mov    -0x2(%bp),%cx
 462:   09 c8                   or     %cx,%ax
 464:   74 4b                   je     4b1 <tcp_retrans_expire+0x111>
 466:   8b 7c 12                mov    0x12(%si),%di
 469:   b8 28 00                mov    $0x28,%ax
 46c:   f7 65 16                mulw   0x16(%di)
 46f:   89 46 fe                mov    %ax,-0x2(%bp)
 472:   b8 28 00                mov    $0x28,%ax
 475:   f7 65 14                mulw   0x14(%di)
 478:   89 46 fa                mov    %ax,-0x6(%bp)
 47b:   89 56 fc                mov    %dx,-0x4(%bp)
 47e:   8b 46 fe                mov    -0x2(%bp),%ax
 481:   01 46 fc                add    %ax,-0x4(%bp)
 484:   b8 3c 00                mov    $0x3c,%ax
 487:   f7 e1                   mul    %cx
 489:   91                      xchg   %ax,%cx
 48a:   93                      xchg   %ax,%bx
 48b:   bb 3c 00                mov    $0x3c,%bx
 48e:   f7 e3                   mul    %bx
 490:   01 ca                   add    %cx,%dx
 492:   8b 4e fa                mov    -0x6(%bp),%cx
 495:   01 c1                   add    %ax,%cx
 497:   8b 5e fc                mov    -0x4(%bp),%bx
 49a:   11 d3                   adc    %dx,%bx
 49c:   31 d2                   xor    %dx,%dx
 49e:   52                      push   %dx
 49f:   ba 64 00                mov    $0x64,%dx
 4a2:   52                      push   %dx
 4a3:   53                      push   %bx
 4a4:   51                      push   %cx
 4a5:   e8 fe ff                call   4a6 <tcp_retrans_expire+0x106>
                        4a6: R_386_PC16 __udivsi3
 4a8:   83 c4 08                add    $0x8,%sp
 4ab:   89 45 14                mov    %ax,0x14(%di)
 4ae:   89 55 16                mov    %dx,0x16(%di)
 4b1:   56                      push   %si
 4b2:   e8 fe ff                call   4b3 <tcp_retrans_expire+0x113>
                        4b3: R_386_PC16 rmv_from_retrans
 4b5:   89 c6                   mov    %ax,%si

New code:

401:   e8 fe ff                call   402 <tcp_retrans_expire+0x9d>
                        402: R_386_PC16 fprintf
 404:   83 c4 06                add    $0x6,%sp
 407:   85 ff                   test   %di,%di
 409:   74 1b                   je     426 <tcp_retrans_expire+0xc1>
 40b:   8b 5c 10                mov    0x10(%si),%bx
 40e:   b8 28 00                mov    $0x28,%ax
 411:   f7 67 14                mulw   0x14(%bx)
 414:   91                      xchg   %ax,%cx
 415:   b8 3c 00                mov    $0x3c,%ax
 418:   f7 e7                   mul    %di
 41a:   01 c8                   add    %cx,%ax
 41c:   b9 64 00                mov    $0x64,%cx
 41f:   31 d2                   xor    %dx,%dx
 421:   f7 f1                   div    %cx
 423:   89 47 14                mov    %ax,0x14(%bx)
 426:   56                      push   %si
 427:   e8 fe ff                call   428 <tcp_retrans_expire+0xc3>
                        428: R_386_PC16 rmv_from_retrans
 42a:   89 c6                   mov    %ax,%si

... which seems like a pretty dramatic improvement right there. Actually, this along with the use of jiffies (complete elimination of calls to idiv?? and several mults @ about 115 cycles each are gone, which probably shaves off nearly 2000 cycles per main-loop in ktcp.

how do you get to the 4 clock cycle minimum instruction execution cycle?

I read it somewhere. I also read that timing starts at 5 cycles for Base or Index effective address calculation and goes up to 12 cycles. The bare minimum is 4 when using register-to-register but this can sometimes change because there is some variety to instruction prefetching. (Of course all this varies wildly per processor variation). The interesting thing in the referenced link is that 4 more cycles are added for accessing words at odd addresses... much more a performance hit than I had thought.

Yes, all this matches up with my impressions (and I agree with you, there are some surprises here, in particular the one about alignment) except for the minimum cycles per instruction - unless the counting of cycles varies. The way I read it is that the fastest instructions - like reg-reg-move - execute in 2 cycles. If you add the fetching, it may be 3 or 4 - or zero if already in the prefetch queue. I'm clipping in a couple of pages from the ultimate source (!) on the issue, Hummel's Technical Reference, an extremely valuable resource these days (got it on eBay when I got involved with ELKS some 7 years ago) - recommended.

IMG_9930
X86-timing.pdf

For our estimates of speed, I found it surprising that the average is 13.67 cycles/instruction. However, I'm inclined to believe it and am building the following "simple outlook" on imagining how time is spent in an 8086/8088 ELKS system:

Agreed, this value seems reasonable and the method they used to determine it, very convincing - actually pretty smart! Valid for 8086/8088, less so for later architectures. Then again, that's what we're talking about - 4.77MHz and 86/88.

With the kernel PIT running a 11932 countdown to produce 10ms interrupts, each countdown (stated previously) amounts to 0.838 microseconds, which is 4 cycles.

Thus, rounding to 12,000 countdowns (4 cycles each or one basic instruction) per 10ms clock interrupt would mean 12,000 instructions before a task switch. But with the real instruction cycle average being 12 (rounded down from 13.67), this amounts to 4,000 average instructions executed each time slice. If the average instruction is 12 cycles (for simplicity), this would mean each instruction is 3 * 4 cycles = 3 * 0.838 usecs = ~2.5 usecs/instruction.

Finally, at average ~2.5usecs/instruction, 4000 instructions is 10,000 usecs which equals, you got it - 10ms - one jiffie.

So there you have it. Precision timing samples starting at a very fast 4-cycle instruction, with three samples per average instruction. 4000 average instructions later we have a jiffie increment and possible task switch. This keener precision approach to imagining what is going on at the process time slice and kernel driver level should allow us to better understand what the real costs of interrupt service routines are and also quantify in which method a kernel busy-loop might be coded to allow other code to run or not (same for interrupt disabling).

This is indeed enlightening. The 12 cycles average being more important than whether the fastest instruction takes 2 or 4 cycles. 4000 instructions per time slice really doesn't get much done, does it? It may seem that this is part of the challenge with very slow machines: So little get done per time slice. And suddenly the difference between my XT and my V20/8MHz is big - twice as much gets done per slice. It would be very interesting to find the actual price of a task switch (and like you mention, the interrupt overhead). No wonder ktcp is slow on the old iron. The challenge now is to get it to work (well) anyway.

Thank you.

@ghaerr
Copy link
Author

ghaerr commented Jul 31, 2024

except for the minimum cycles per instruction - unless the counting of cycles varies. The way I read it is that the fastest instructions - like reg-reg-move - execute in 2 cycles. If you add the fetching, it may be 3 or 4 - or zero if already in the prefetch queue.

I'm not sure, but I think you're going in the right direction when counting instruction fetch. IIRC a word access requires 4 cycles, so perhaps reg/reg move is done within memory fetch 4 cycles. I'd like to hear more about what you figure out, I'll leave this for you to figure out more fully.

4000 instructions per time slice really doesn't get much done, does it?

Yes, that's not much, is it? If we can figure out the average execution time per byte of code emitted, that might allow us to quickly estimate how long a function takes, without loops being taken into account. And yes the point about 8Mhz being (obviously) twice as fast hits home when considering the smaller amount of instructions being executed in the first place. Its nice to be getting conceptually closer to the metal in these timing discussions.

It would be very interesting to find the actual price of a task switch

That's easily done: just count instructions in _irqit (not cycles). Use ~2.5usecs as average instruction time. Note that all system calls and hardware interrupts (except CONFIG_FAST_IRQ4 serial I/O) go through a nearly identical _irqit experience. When first writing the fast serial driver 4+ years ago, I learned that the interrupt/syscall/trap overhead is too high for 38.4K serial baud.

So we have 6 muls which is pretty good

Yes, that would seem great - except that after this discussion it might still be too slow, as MUL is very slow on 8088's, right? (Of course, 8088's just can't do much which is why few run them anymore lol; the long divisions required to do the timing printf's will really knock us out!)

I did change the rtt and rto variables from timeq_t (u32) to unsigned- they never exceed 100 anyway, which changed the code dramatically.

Very interesting! I seem to remember a quick look at the timeq_t declarations when you first brought this up, but didn't think about the code required to support it. I would say that if your assumptions are correct about RTT/RTO remaining a small number that they should definitely be changed to unsigned short for ktcp!!

Actually, this along with the use of jiffies (complete elimination of calls to idiv?? and several mults @ about 115 cycles each are gone, which probably shaves off nearly 2000 cycles per main-loop in ktcp.

Wow. Getting a handle on ktcp's main loop elapsed time seems important to get fast back-to-back NIC packets sent w/o buffering. 2000 cycles is a lot (wow, MUL is super slow, 118-143 cycles!!), even though there are 4000 * 12 = 48,000 cycles per hardware timer interrupt. So that's the equivalent of 2000/12 = 166 instructions out of 4000 saved in ktcp's main loop per task switch = 4%, huge improvement with no change in functionality!

@ghaerr
Copy link
Author

ghaerr commented Jul 31, 2024

It would be very interesting to find the actual price of a task switch

That's easily done: just count instructions in _irqit (not cycles). Use ~2.5usecs as average instruction time.

This has got me thinking: I'll bet the irqit overhead is larger, possibly significantly if C routines and signal processing is taken into account (but not including actual task switch code), than 5% which was my previous WAG. That hardware interrupt handling code for the system timer is only to update jiffies if there's no other task to run.

So perhaps a 10ms timer is too fast for old 8088s; less interrupt overhead may allow for 10% more to get done when running single application. I've added all this to my list to look further into. It would be quite easy to change the HZ frequency, although there is a bit of a mess checking out that the clock/date code still works and doesn't make any assumptions.

@ghaerr
Copy link
Author

ghaerr commented Aug 1, 2024

I spent a while counting instructions and tracing _irqit, and here are the results! Depending on exactly what is going on, lots of instructions can be executed. In particular, I traced IRQ 0 in detail. Tracing was done by counting instructions in routines disassembled using ia16-elf-objdump -D -r -Mi8086 and following possibly called functions.

ELKS overhead by instruction count (approximate)

Routine                                     Instruction count

_irqit (syscall)                            54 
    stack_check                             15+
    syscall                                 5+syscall
    do_signal                               16+
total                                       90+syscall

_irqit (hw interrupt, IRQ 0 for example)    85
    do_IRQ
        timer_tick (irq handler)
            do_timer                        23
                run_timer_list
                    cli/sti                 *
                    if timers
                        del_timer               27
                        sti/cli                 *
                        call timer_func         ??
            calc_cpu_usage                  14
                if count>200                    68 every 200x
            if disk activity and dirconsole
                if count>8
                    pokeb                       7 every 8x
        if count>5
            BIOS IRQ 0  x5                      29 on ibmpc, every 5x
    if was user mode:
        schedule                            93
            cli/sti
            del_from_runqueue ?                 32
            add_timer       ?                   29
            tswitch                         13
            del_timer       ?                   27
        do_signal                           16+
            cli                                 *
            call user mode handler ?
total                                       128+
                                                106+ if sched

Average interrupt overhead 128+ instructions = 320usecs
Average interrupt w/reschedule 234+ instructions = 585usecs
Basic 8088 runs ~12 cycles/instruction, ~2.5 usecs/instruction
Basic 8088 executes 4000 instructions/timer interrupt (@100HZ = 10ms = 10,000 usecs)

Precision 8254 timer is 0.838 usecs per countdown = 4 cyles = 1/3 average instruction
8088 MUL instruction takes ~120 cycles = ~25 usecs, 10 times longer than avg instruction

8088 Kernel Interrupt overhead 3.2%
8088 Kernel Interrupt w/reschedule overhead 5.6%
    When multiple processes running, reschedule is slower
    Reschedule only occurs with timer interrupt from user mode
    Any kernel timers need processing check each timer interrupt
    Various interrupt handlers/reschedule and kernel timers disable interrupts
    All other kernel drivers may take up time (DF, HD, NIC) away from ktcp and app

Add to all of this the thought that the average MUL instruction is say, 120 cycles (10 times an average instruction time), so my previous comments about a "fast" precision timer with only 5 multiplies means that a precision time sampling takes ~125+ usecs minimum, which is a full 30 0.838 usec countdowns of the 8254 precision timer itself!! Basically, a "precision" timer calculation takes something less than half as long as the kernel interrupt overhead on 8088.

I think MULs are heck of a lot faster on 386+ CPUs so perhaps this isn't a big deal except on real 8088's themselves.

@Mellvik
Copy link
Owner

Mellvik commented Aug 1, 2024

Wow, that's indeed a lot to devour :-) - and again both interesting and useful, not the least to understand system behaviour and possible bottlenecks.

Some observations:

  • Cycles and instructions: The instruction timings from Intel (I'm using the iAPX86/88 User's Manual (1985) and Hummel's book) - like the ~120 you refer to for MUL - are the numbers used to calculate execution time, even in Intel's timing diagrams. Since fetch and execution are parallel operations, fetch is generally ignored - an approximation, but not by much. The instruction timings themselves vary wildly depending on factors we've already discussed, addressing modes etc. This means that the objdump listings can be used to determine timings exactly if needed.
  • The average instruction time from the paper you mentioned early on, approx. 12 cycles, come from a collection of applications and is probably too high for an OS like TLVC/ELKS. Not that it's much different but my guess is that 10 cycles is closer to reality.

... the thought that the average MUL instruction is say, 120 cycles (10 times an average instruction time), so my previous comments about a "fast" precision timer with only 5 multiplies means that a precision time sampling takes ~125+ usecs minimum, which is a full 30 0.838 usec countdowns of the 8254 precision timer itself!! Basically, a "precision" timer calculation takes something less than half as long as the kernel interrupt overhead on 8088.

That's indeed a good point. Maybe we need to rethink this one completely in order to make it viable on slow iron.

I think MULs are heck of a lot faster on 386+ CPUs so perhaps this isn't a big deal except on real 8088's themselves.

That's also true. mul reg,reg is as you said almost 120 on 86/88, while 21 on the 286, 12-25 on the 386. Then again, 86/88 is the target and the challenge...

When first writing the fast serial driver 4+ years ago, I learned that the interrupt/syscall/trap overhead is too high for 38.4K serial baud.

If this is on the 86/88, it's actually pretty good. And since ELKS and TLVC have their own drivers, we're able to take advantage of the UART FIFOs when available, which at least theoretically should make it possible to exceed this limit, right? Process several chars per interrupt.

On a general note, I think we should stick with the 10ms clock even on the slow iron, proclaiming it as 'good enough' and continue along the line that this thread started, optimizing where it really matters. Actually, off the top of my head, 3.2% interrupt overhead isn't all that bad, neither is 5.6% for interrupt&resched. I would be surprised if a PDP11/20 with Unix v6 was much better. Of course they had more IO offloading but then again they were supporting many real and concurrent - users too ...

Apropos timing, the ktcp update is getting there, the magic key turned out to be (yes, timing) an improved RTT. And quite a few other things, but a 'truer' RTT turned out to be the most important one. In hindsight completely unsurprising. More about that when the PR comes.

BTW - I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

@ghaerr
Copy link
Author

ghaerr commented Aug 1, 2024

The instruction timings themselves vary wildly depending on factors

Yes, good point, lots of factors. I was reading another timing reference that showed a register word INC to take only 2 cycles, not including instruction fetch. So I'm not sure which table(s) to use for instruction cycles, and when one needs to apply instruction fetch timing (after every jump)? Perhaps best to actually try to measure it using "precision'' timing (more on that below).

approx. 12 cycles, come from a collection of applications and is probably too high for an OS like TLVC/ELKS.

I tend to agree, I was thinking about that as I calculated the irqit timing and those instructions did not seem like ~12 average. However, one must consider an "average" instruction timing would have to include MUL, so that likely pulls the average lots higher. Perhaps we should research and calculate an ELKS/TLVC average instruction that does not include high-time instructions, and watch for and flag the high-time instructions separately (including the C code that generates them).

Maybe we need to rethink this one completely in order to make it viable on slow iron.

Indeed. The 5 MULs were the result of creating a virtual "ntick" and converting PIT countdowns and jiffies to the same base units so they could be added. A better approach might be just to store the 4-byte jiffies and the 2-byte countdown as a "measurement", and delay multiplications, or possibly just display jiffies and countdown separately. Also, for small measurements, the jiffies diff might be 0 and just countdown (wrap handling still needed). So some fast code could perhaps calculate jiffie and countdown diff values and just store them. Of course, once it comes time to actually see the result, the printf/printk routine will be so slow it'll wreck all runtime timing.

Which brings me to my next idea - kernel or application profiling. I am thinking that a super tool would be to store the CS:IP of executing code every timer tick into a far memory area of 64K bytes. Then, process or kernel symbol tables could be consulted to display the function being executed at each clock tick and a profile generated. The same approach could possibly be used for precision timing, where the jiffies/PITcount would just be stored somewhere and analyzed later.

I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

That'd be great - go for it. I'm interested in a better "average" instruction time for sequences that don't include very slow instructions. For reference, I divided irqit into two streams, syscall and hardware interrupt, and traced the function calls, but didn't try to trace the probably execution path, instead just counting all instructions. So the most-used code path is likely smaller than my research suggested.

@Mellvik
Copy link
Owner

Mellvik commented Aug 1, 2024

The instruction timings themselves vary wildly depending on factors

Yes, good point, lots of factors. I was reading another timing reference that showed a register word INC to take only 2 cycles, not including instruction fetch. So I'm not sure which table(s) to use for instruction cycles, and when one needs to apply instruction fetch timing (after every jump)? Perhaps best to actually try to measure it using "precision'' timing (more on that below).

Actually, all sources I have show the same timings - the Intel manual, the Hummel book, the one you reference which I incidentally looked at this morning. So we do have the exact timings even with every possible addressing mode etc. Still, like Hummel points out in the pdf I included yesterday, there are more variables. Like a DMA going on, a fetch colliding with an execution unit memory reference, an instructions exception etc. So, what Hummel concludes it this (also in the pdf I think):

"Due to all the factors that cannot be predicted with certainty, the average program will take 5-10% [corrected] longer to execute than would be indicated by the timings." Sounds good to me.

approx. 12 cycles, come from a collection of applications and is probably too high for an OS like TLVC/ELKS.

I tend to agree, I was thinking about that as I calculated the irqit timing and those instructions did not seem like ~12 average. However, one must consider an "average" instruction timing would have to include MUL, so that likely pulls the average lots higher. Perhaps we should research and calculate an ELKS/TLVC average instruction that does not include high-time instructions, and watch for and flag the high-time instructions separately (including the C code that generates them).

It would indeed be interesting to see a top 25 list of used instructions in the kernel, which wouldn't be all that hard to do with objdump and an AWK script ...

Maybe we need to rethink this one completely in order to make it viable on slow iron.

Indeed. The 5 MULs were the result of creating a virtual "ntick" and converting PIT countdowns and jiffies to the same base units so they could be added. A better approach might be just to store the 4-byte jiffies and the 2-byte countdown as a "measurement", and delay multiplications, or possibly just display jiffies and countdown separately. Also, for small measurements, the jiffies diff might be 0 and just countdown (wrap handling still needed). So some fast code could perhaps calculate jiffie and countdown diff values and just store them. Of course, once it comes time to actually see the result, the printf/printk routine will be so slow it'll wreck all runtime timing.

Which brings me to my next idea - kernel or application profiling. I am thinking that a super tool would be to store the CS:IP of executing code every timer tick into a far memory area of 64K bytes. Then, process or kernel symbol tables could be consulted to display the function being executed at each clock tick and a profile generated. The same approach could possibly be used for precision timing, where the jiffies/PITcount would just be stored somewhere and analyzed later.

I think this is a great approach, much akin to what they did in the instruction timing document!

I'm inclined to head into _irqit and actually calculate the cycles from objdump. It would be interesting to compare to the estimate to get a feel for how close we really are.

That'd be great - go for it. I'm interested in a better "average" instruction time for sequences that don't include very slow instructions. For reference, I divided irqit into two streams, syscall and hardware interrupt, and traced the function calls, but didn't try to trace the probably execution path, instead just counting all instructions. So the most-used code path is likely smaller than my research suggested.

OK; sounds like a printout, a pencil, a book and a sunny day on the terrace! It's on.

@ghaerr
Copy link
Author

ghaerr commented Aug 1, 2024

he average program will take 510% longer to execute than would be indicated by the timings.

Wow, kind of unbelievable!!! But then again, perhaps ~13.67 cycles/instruction does factor correctly when 510% is considered.

I found copy of Hummel's book on Amazon and ordered it last night. Looking forward to reading it, thanks.

@ghaerr
Copy link
Author

ghaerr commented Aug 4, 2024

@Mellvik, I've create a precision timer library in ghaerr/elks#1953 based on discussions here. Let me know what you think of it, it should work pretty well for measuring elapsed time in the kernel, but I haven't tested it on real hardware. If you can find time to test it out, that would be great.

Coming after this will be getting the routines working outside the kernel, e.g. in ktcp. I would like to know how useful this is before going through that effort.

@Mellvik
Copy link
Owner

Mellvik commented Aug 5, 2024

Very exciting @ghaerr - thank you.
That's some creative ways to get around the limitations imposed by reality (well, technology).
Looking forward to test it.

-M

@Mellvik
Copy link
Owner

Mellvik commented Aug 7, 2024

@ghaerr,
not much screen time recently and I admit this has been a very fast one, so I may have misunderstood how to use the precision timer routines.
For convenience I'm timing a simple outb loop in init/main.c during boot. The code looks like this:

#ifdef CONFIG_PREC_TIMER
void INITPROC testloop(unsigned timer)
{
        unsigned pticks, x = timer;

        get_time_10ms();
        while (x--) outb(x, 0x80);
        pticks = get_time_10ms();
        printk("hptimer %u: %k\n", timer, pticks);
}       
#endif 

Weirdly gcc puts the outb instruction outside the loop, at the end of the routine and jumps to it from the loop, then back, but that's a different issue.

I've calculated the expected timing of the loop itself to ~775ns on the 40HMz 386. Which means that running 10,000 loops may exceed the 10ms limit, but I tried anyway and went down from there. Here's what I got, the first number is the loop count.

hptimer 10000: 5723ms
hptimer 3000: 4731ms
hptimer 500: 790234us

I haven't looked at the code so I'm not attempting any analysis at this point. What am I missing?

@ghaerr
Copy link
Author

ghaerr commented Aug 7, 2024

Hello @Mellvik,

Thanks for the testing. And, you will find, it is very easy to get confused with timing and timing results, I sure have been! It looks though that your code is correct. I have the following comments to try to understand what's going on, as the output timing does look way off, possibly due to the potential variations described below. Of course, the PIT timer should be operating at the same input frequency on your 40Mhz 386 as an 8088, right?

  • Where exactly is this in init/main.c? I suggest moving the test to after all initialization is complete, and just before the idle task runs, which would be as follows:
    /*
     * We are now the idle task. We won't run unless no other process can run.
     */
    timer_test();  // display test printf %k results
    testloop(10000);
    testloop(3000);
    testloop(500);
  • Remove INITPROC in the declaration of tastloop. This will remove the long-winded thunking code when calling get_timer_10ms (a near proc) from a far proc. This shouldn't change the loop timing too much, but does add time to the overall loop and also makes the disassembly much easier to read to see what's going on.

  • We need to ensure that the %k formatting is correctly ported, as its quite tricky internally. For that, I suggest making the following change to the printk in testloop:

    printk("hptimer %u: %k (%u)\n", timer, pticks, pticks);

What this does is display pticks as an unsigned decimal. In this way, you can mulitply the pticks directly by 0.838 to get microseconds. If the pticks value is >= 11932, then the measurement is invalid for get_time_10ms because the PIT has counted down more than once past its 10ms (=11932 pticks * 0.838 = 10,000usecs) limit. The multiply result should always be correct. More on this below on the results of timer_test().

[EDIT: something definitely strange is going on here, as get_time_10ms should never return a value > 11931, as it just compares two PIT values and compares the difference. The only way it could be greater is the PIT start countdown value is > 11932, but that should not be the case as irq_init() initializes the PIT countdown to 11932 for a 10ms interrupt.]

  • Remember that the timed results may be wildly different, as the time tests are done with interrupts on, and a hardware timer interrupt may occur, which will add significant delay. If you want to test with interrupts disabled (this isn't quite supported yet for various technical problems), you can disable interrupts beforehand and enable after only if get_time_10ms is called. If get_time_50ms or get_time are called, the set_irq must be removed in those routines also, which otherwise protect a critical section for reading the tick/jiffie counts.

I added your test code, with the printk mod above and ran it on QEMU, and got the following results, different each time because of the timer interrupt delay: (serial console on):

hptimer 10000: 998.058us (1191)
hptimer 3000: 253.076us (302)
hptimer 500: 48.604us (58)

hptimer 10000: 662.858us (791)
hptimer 3000: 195.254us (233)
hptimer 500: 31.844us (38)

hptimer 10000: 1.139ms (1360)
hptimer 3000: 183.522us (219)
hptimer 500: 31.844us (38)

With the above numbers, the parenthesized number are the pticks, and multiplying them by 0.838usecs does yield the displayed time value. I don't know the extent as to QEMU's emulation of the 8254 PIT nor its relation to real-time so it might be interesting to see your QEMU results from the above. I'm testing on a 3.1Ghz MacBook Pro.

Finally, here is the output from timer_test, which should be exactly the same on yours. It displays various ptick values and the %k result:

1 = .838us
2 = 1.676us
3 = 2.514us
100 = 83.800us
500 = 419.000us
1000 = 838.000us
1192 = 998.896us
1193 = 999.734us
1194 = 1.000ms
10000 = 8.380ms
59659 = 49.994ms
59660 = 49.995ms
59661 = 49.995ms
100000 = 83.800ms
5*59660 = 249.975ms
359953 = 301.640ms
19*59660 = 949.906ms
20*59660 = 999.901ms
21*59660 = 1.049s
60*59660 = 2.999s
84*59660 = 4.199s
600*59660 = 30.06s
3000000 = 2.514s
30000000 = 25.20s
35995300 = 30.23s
36000000 = 30.24s
51000000 = 42.84s
51130563 = 42.94s
51130564 = 0us

@Mellvik
Copy link
Owner

Mellvik commented Aug 8, 2024

... the PIT timer should be operating at the same input frequency on your 40Mhz 386 as an 8088, right?

Right, otherwise the ISA bus wouldn't work (and BTW, the clock would be off...)

  • Where exactly is this in init/main.c?

I put it at the end of kernel_init() which is just before the idle task init, so we're fine on that one.

  • We need to ensure that the %k formatting is correctly ported, as its quite tricky internally.

printk has been (re)ported in verbatim. Turns out the decimal point was missing from the first runs. That's confusing as hell.

The printk test output matches your list exactly. That's a good sign ...

[EDIT: something definitely strange is going on here, as get_time_10ms should never return a value > 11931, as it just compares two PIT values and compares the difference. The only way it could be greater is the PIT start countdown value is > 11932, but that should not be the case as irq_init() initializes the PIT countdown to 11932 for a 10ms interrupt.]

Makes sense - I'm getting 'sane' values back now.

  • Remember that the timed results may be wildly different, as the time tests are done with interrupts on, and a hardware timer interrupt may occur, which will add significant delay.

Actually, the numbers I'm getting have minimal variations - which of Cours may be because they're still wrong :-)

I added your test code, with the printk mod above and ran it on QEMU, and got the following results, different each time because of the timer interrupt delay: (serial console on):

hptimer 10000: 998.058us (1191)
hptimer 3000: 253.076us (302)
hptimer 500: 48.604us (58)

hptimer 10000: 662.858us (791)
hptimer 3000: 195.254us (233)
hptimer 500: 31.844us (38)

hptimer 10000: 1.139ms (1360)
hptimer 3000: 183.522us (219)
hptimer 500: 31.844us (38)

With the above numbers, the parenthesized number are the pticks, and multiplying them by 0.838usecs does yield the displayed time value. I don't know the extent as to QEMU's emulation of the 8254 PIT nor its relation to real-time so it might be interesting to see your QEMU results from the above. I'm testing on a 3.1Ghz MacBook Pro.

My M1 MB Pro says:

hptimer 10000: 409.782us (489)
hptimer 3000: 118.158us (141)
hptimer 500: 18.436us (22)

hptimer 10000: 387.156us (462)
hptimer 3000: 113.968us (136)
hptimer 500: 18.436us (22)

... sensible numbers that given the variance in your case, are surprisingly stable. And indicating the machine is about twice your speed when doing Intel emulation.

The new 'real' numbers:

(40MHZ 386SX)

hptimer 10000: 6.900ms (8234)
hptimer 3000: 5.064ms (6044)
hptimer 500: 848.894us (1013)

hptimer 10000: 6.899ms (8233)
hptimer 3000: 5.064ms (6043)
hptimer 500: 848.056us (1012)

12.5MHz 286:

hptimer 10000: 2.506ms (2991)
hptimer 3000: 9.767ms (11656)
hptimer 500: 1.707ms (2037)

The 286 numbers for 3000 and 500 loops make some sense given the loop in question (the 10000 exceeds 10ms anyway so it's invalid).

In order to get better control over expectations, I added an assembly loop (conveniently lifted from the el3 driver, temporarily 'hidden' at the end of irqtab.S):

test_udelay:

        push    %bp
        mov     %sp,%bp
        mov     4(%bp),%cx      // count
dly_loop:
        outb    $0x80
        loop    dly_loop

        pop     %bp
        ret

The loop itself has only two instructions consuming

out: 11,12,3
loop: 13,17,10

cycles - for 386, 8088, 286 respectively (interesting how fast the out instruction is on the 286, I should look it up to make sure it isn't a typo).
So loop time is 600ns[386], 6.1us[8088], 1.04us[286], leaving expectations at something like

386: 600ns x 3000 + 10% =  2ms
286: 1us x 3000 + 10% = 3.3ms
88: 6.1 x 3000 + 10% = 20.1ms

What we get is this - first QEMU:

hptimer 10000: 31.006us (37)
hptimer 3000: 4.190us (5)
hptimer 500: 4.190us (5)

[386]

hptimer 10000: 118.996us (142)
hptimer 3000: 121.510us (145)
hptimer 500: 120.672us (144)

[286]

hptimer 10000: 167.600us (200)
hptimer 3000: 168.438us (201)
hptimer 500: 169.276us (202)

Bottom line is that except for QEMU, the numbers don't make much sense. What say you @ghaerr?

@ghaerr
Copy link
Author

ghaerr commented Aug 8, 2024

Bottom line is that except for QEMU, the numbers don't make much sense. What say you

Wow, I say this is very frustrating!

After reading your post and all the data in it, I would say it seems that reading and/or interpreting the 8254 PIT countdown register on real hardware isn't working with my code, and that QEMU possibly doesn't do a good job of emulating the PIT real time.

I checked the 80286 Intel reference manual for OUT and LOOP and it says 3 cycles for OUT and either 8 (jump taken) or 4, so there's some possible discrepancy with instruction timing estimates (instead of your 10). But these are quite minor considering the radical difference in timings reported by QEMU, which runs faster and slower on different hosts, and your last two real hardware tests of test_udelay, where there's almost no difference between 500, 3000 and 10000 loops in terms of PIT timer countdowns returned (!!!).

So lets take the latter - the 286 and 386 results running test_udelay: all ~200 for 286 and ~144 for 386. It would seem to me that this is saying that reading the PIT timer somehow returns a constant difference between two readings - very strange. Except then when running your original testloop beforehand the results do vary. Extremely strange that QEMU doesn't vary between 500 and 3000 using test_udelay. I must admit I can't figure that out.

Can you post your main.c, printk.c and prectimer.c files used for the test_udelay? I would like to see if I'm missing something as I can't see the actual code you're executing.

Other thoughts are somehow the timer loop may be overflowing, perhaps change to using a different instruction (like XCHG or XOR instead of OUT) to keep the system bus out of the equation. Another thought would be to just display the PIT count value in an infinite loop, to check that it's decrementing. It should be decrementing every 0.838usecs on all systems using a real 8254:

    outb(0, TIMER_CMDS_PORT);       /* latch timer value */
    lo = inb(TIMER_DATA_PORT);
    hi = inb(TIMER_DATA_PORT) << 8;
    count = lo | hi;

I checked the ptick <-> us/ms conversion and that's accurate. So we're getting a bad ptick difference between reading the PIT countdown register, if that's even working on real hardware. Another thought would be to write the above PIT read code in ASM, save it, run a known loop and read it again to see what the different values are. The PIT timer is pretty basic - you send a zero to the commands port to "latch" the current value and then the next two reads from the data port read the low and high bytes of the countdown. When the count transitions from 1 to 0, an IRQ 0 is fired and 11932 is reloaded into the countdown register by the hardware.

Do both your 286 and 386 systems use a real 8254 PIT? I'm kind of grasping at straws here trying to come up with something that may be amiss...

Thanks for your testing. I'm out of town or I'd test this on my real hardware, which I can do next week.

@Mellvik
Copy link
Owner

Mellvik commented Aug 11, 2024

I'm suggesting you boot the enclosed TLVC image in your QEMU version without the new options. It should deliver erratic results. If not, there is still a rabbit at large.

I'm not quite sure what this is testing, but here are the results without the HVF option:

This simple verifies that both systems/kernels are delivering the same results in your QEMU version, eliminating the chance that there are other elements involved. Thanks for testing. The results are vey conclusive.

I do, however, suggest you upgrade to a more recent version

I'm looking at doing that using homebrew but have found various statements from people finding QEMU not working at all after an update. Combined with not knowing what version I'll be getting and Catalina not supported anymore by homebrew, I may just leave well enough alone for the time being 🤷‍♂️.

It seems a new update qemu would work fine and will give you (at least) version 7.1. I just started such an update on my Mini, which is running an older MacOS than yours, and it seems to run fine (the update that is).

 elks git:(master)$ brew upgrade qemu
Warning: You are using macOS 10.14.
We (and Apple) do not provide support for this old version.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
old version.

==> Upgrading 1 outdated package:
qemu 7.0.0 -> 7.1.0

It's repeatedly complaining about Xcode and its command tools needing to be updated, but it's not a requirement (I'll update the command tools manually later).

The process takes hours (zillions of packages that need upgrades/updates), and it's not done yet. I'll let you know how it ends. The good thing is, brew doesn't break anything along the way. If the upgrade fails, the old should be fine (my experience).

How do you suggest we add a microsecond exact delay loop using this code?

Getting a true precision delay loop to microsecond accuracy will be hard, given what we've learned with the various huge speed differences between processors, the need to disable interrupts in a driver for accuracy (bad!), and while executing the setup code required for the actual timing.

The idea would be to calculate the number of 8254 PIT countdowns (= 0.838us pdiffs) needed by first dividing the microseconds wanted by 0.838. Then reading the current countdown, taking into account wraps, until the count difference >= pdiffs. Dividing by 0.838 will have to be done as multiplying the microseconds by 1000 first, then dividing by 838. This should be able to be done with a single 16x16=32 multiply and 32/16 divide, but I'm not sure how to take into account the execution of the MUL/DIV instruction timings into the equation, unless the timeout is a constant and can be pre-calculated. I suppose the cycles counts of each the setup instructions could be calculated for 8086/286/386+ and then subtracted from the required microseconds requested based on the running cputype.

There is obviously a limit to the real/practical accuracy such a function can deliver, the key goal would simply be 'as close as possible' given the constraints - one being simplicity. This would still be a hell of a lot better than the delay loops we have today, which - between the XT and the 40MHz 386 - vary by orders of magnitude. Also it would have to be a goal to avoid both MULs and DIVs, and I have an idea about how to do that. I'll take a stab and see if I can get it to work.

@ghaerr
Copy link
Author

ghaerr commented Aug 11, 2024

Above that (and actually quite a bit below), milliseconds wouldn't make sense because of all the other stuff going on on the system

I finally realized that we can have 0.838us accuracy, all the way from 0.838us to 42s. That is, the elapsed timer reads the wrap-adjusted PIT countdown (0-11931), then adds the jiffies difference times 11932, since jiffies is always incremented on counter reload. I got confused initially thinking we needed a very accurate, low-overhead sub-10ms timer and created three routines, which turns out is not needed when the new single routine returns an unsigned long, and sub-10ms accuracy is tricky when interrupts are enabled and on fast machines.

I'm assuming that would be in addition to the get_time_50ms

No, we just need one routine, I now have it working. The get_time_50ms name is now bad, and get_time isn't a great name for confusion with other C library clock time routines like gettimeofday. I was thinking time_elapsed, or maybe get_time_elapsed, might be a better name?

A related question is what is the lowest/shortest meaningful measurement

I'm thinking of using taking 3-4 adjacent elapsed time samples (with interrupts disabled) and then calculating an average, which could then be used as the average time the time_elapsed function needs to run. This would then be the shortest meaningful measurement, and could be subtracted from each elapsed time result in order to get more precision. The C library version will need an init routine (time_elapsed_init?, to get the kernel jiffies pointer), this could calculate the average and return it.

I have a working time_elapsed function working in the kernel now, and am thinking of submitting a PR when I get it all working and tested for the C library so we don't have to go through (yet) another revision of the routines...

@Mellvik
Copy link
Owner

Mellvik commented Aug 12, 2024

How do you suggest we add a microsecond exact delay loop using this code?

Getting a true precision delay loop to microsecond accuracy will be hard, given what we've learned with the various huge speed differences between processors, the need to disable interrupts in a driver for accuracy (bad!), and while executing the setup code required for the actual timing.

Agreed, true precision at microsecond level is impossible given the factors you mention and the fact that an XT/8088 cannot complete even the simplest loop in 1us: Clock cycle 210ns, loop instruction requiring 17 cycles. So we're talking 10us resolution at best and semi-precision - which will have to do. Being orders of magnitude better than what we've got, it's OK.

@Mellvik
Copy link
Owner

Mellvik commented Aug 12, 2024

I finally realized that we can have 0.838us accuracy, all the way from 0.838us to 42s. That is, the elapsed timer reads the wrap-adjusted PIT countdown (0-11931), then adds the jiffies difference times 11932, since jiffies is always incremented on counter reload. I got confused initially thinking we needed a very accurate, low-overhead sub-10ms timer and created three routines, which turns out is not needed when the new single routine returns an unsigned long, and sub-10ms accuracy is tricky when interrupts are enabled and on fast machines.

I believe your initial thinking was better than you give yourself credit for. There is a good reason for staying away from longs when possible - they are extremely expensive on pre-286 machines. Which is why I'm concerned that a single version - long only - is not going to work well/be all that useful on older systems when measuring short intervals.

Also there is the concern about complexity in the time_elapsed routines, which - since I haven't seen them - may of course be entirely superfluous. For example, while latching the timer as early as possible when getting the time elapsed is good, it works the opposite way when creating the starting point using the same routine.

A related question is what is the lowest/shortest meaningful measurement

I'm thinking of using taking 3-4 adjacent elapsed time samples (with interrupts disabled) and then calculating an average, which could then be used as the average time the time_elapsed function needs to run. This would then be the shortest meaningful measurement, and could be subtracted from each elapsed time result in order to get more precision. The C library version will need an init routine (time_elapsed_init?, to get the kernel jiffies pointer), this could calculate the average and return it.

Yes, I think this is an important step - incidentally my calibrated delay loop does the same, although no multiple calls to get an average yet. The calibration happens at boot time, but it may still be good (even necessary) to make multiple calls to even out 'noise'. I get differences between two runs just calling get_time_50ms twice (on the 286) exceeding 100us.

As to naming, I agree that the current naming is not ideal. Given that using a timer always requires two steps, set and get, I suggest two different calls, even if they end up calling the exact same routine. Like 'set_ptimer()' and 'get_ptimer()- 'p' meaning/indicating 'precision'. And finally init_ptimer(). If set_ptimer()andget_ptimer()` were indeed different routines, latching the timer at the optimal point would be possible for both cases.

@ghaerr
Copy link
Author

ghaerr commented Aug 13, 2024

@Mellvik: I've been traveling again, but managed to push PR ghaerr/elks#1962 showing the working refactored code for combining the previous 3 precision timer routines into a single routine: get_ptime. I think this gives us the best of all the variations we're looking for (more discussion on that below).

Which is why I'm concerned that a single version - long only - is not going to work well/be all that useful on older systems when measuring short intervals.

Good point. In get_ptime, for measurements < 10ms (that is, less than a clock interrupt), all arithmetic is done in 16-bit, and no longs are manipulated or stored. If you know the measurement will be small, then you can just cast the result and get a very fast measurement: unsigned int ptick = (unsigned int)get_ptime().

If you're concerned about elapsed times between 10ms and 50ms, there's always going to be a clock interrupt (at least in real execution) within the measurement, and the kernel IRQ 0 handling has shown to vary elapsed times by quite a bit, so there wouldn't be a super-accurate measurement being made anyways. In the new routine, there is only a single MUL, which is executed only when elapsed time > 10ms (through 42s).

So, the same routine can be used for sub-10ms or 0-42s measurements with no extra overhead for < 10ms. The only thing to watch for is that %lk needs to be used when displaying an unsigned long ptick vs before.

For example, while latching the timer as early as possible when getting the time elapsed is good, it works the opposite way when creating the starting point using the same routine.

In general, moving the latch code as first or last complicates matters and would necessitate having two almost-identical routines, which also would have the disadvantage of not being able to be used to compute a more reliable average execution time, and also double the amount of kernel code, just for a few cycles difference.

If measuring elapsed time really really matters (something I have yet to see as really important given the accuracy of our varied results so far as well as the fact that ELKS/TLVC are not real time systems), I suggest another way: writing the small portion of the 8254 counter-reading routine as a GCC static inline routine, which would then be inserted directly into the code being measured. This would remove all procedure call overhead and read the PIT directly from the measured function. More on that later should we really want/need it.

I suggest two different calls, even if they end up calling the exact same routine.

I opted against that idea, as it adds another function call to every measurement and there's no real need for a separately named start and stop function, especially if the elapsed time function is used in a loop, where this idea wouldn't work.

@Mellvik
Copy link
Owner

Mellvik commented Aug 13, 2024

@ghaerr,

what an educating and interesting journey it has been since I asked this question about 3 weeks ago - in #69 (comment):

As to the time a get_packet call inside a driver takes, jiffies don't have the resolution to measure that. @ghaerr, what's the smart way to get that ...

... including - off all things - a short tutorial on system startup in #71 (comment) :-)

And don't worry about timing (pun intended). This side project is not holding up anything and I'm having very limited screen time these days too.

It's easy - and we've been there several times - to lose perspective on reality when digging around low level like this. As you point out, there are severe limits to the useable accuracy we can get given the way the system operates - clock, interrupts, scheduling etc. - and not the least the capabilities of the systems at hand. An XT has a cycle time of 210ns which means that even the tightest loop - a nop and a loop instruction - will take (3+17)*210ns = 4.2us (measured@1000 loops: 5.6us). Also, a minimum call/return without even looping is 26us (measured: 30us). Which makes the idea of a microsecond semi-exact delay loop, 'unviable' to say the least. Even a u10delay() is a stretch if the delay count is low, but I may be going for that anyway, thinking that the XT cannot dictate the choice. Different discussion for a different place though.

That said, my work on the delay idea has turned into a great testing ground for the precision timer tools - for several reasons, including the variations between runs and time consumed by the tools themselves, which must be recorded and corrected for when establishing the metric for the delay loop. A great baseline when testing the 'old' 50ms routine against the new get_ptime(). I have recorded comparable numbers for call duration and various loop times for all 4 system types - XT/4.77, XT-V20/8MHz, 286/12.5MHz, 386SX/40MHz. It's indeed very educational. And I'm getting a 'somewhat' accurate u10delay function across the platforms, which for obvious reasons becomes more accurate the longer the delay. A boot time calibration generates a speed-index for the inner loop of the delay routine.

If measuring elapsed time really really matters (something I have yet to see as really important given the accuracy of our varied results so far as well as the fact that ELKS/TLVC are not real time systems), I suggest another way: writing the small portion of the 8254 counter-reading routine as a GCC static inline routine, which would then be inserted directly into the code being measured. This would remove all procedure call overhead and read the PIT directly from the measured function. More on that later should we really want/need it.

This is a good idea - somewhat similar to what I was thinking when getting frustrated with the variations in the delay function, but discarded for lack of insight into how to do it properly.

Back to where I started, we're getting close to having reliable and reasonably accurate timing for the likes of get_packet(). That's something!

@Mellvik
Copy link
Owner

Mellvik commented Aug 13, 2024

@ghaerr, an update on updating QEMU:

It was ultimately unsuccessful - and it didn't destroy anything except the symlinks in /usr/local/bin.

Your system is newer, it may succeed and it's relatively risk free. I can send you a tar file with the 7.0.0 tree which should work if you run brew update to get the rest of the system as up to date as possible.

@ghaerr
Copy link
Author

ghaerr commented Aug 14, 2024

@Mellvik, I've added user mode precision timing in ghaerr/elks#1963, and it all seems to work. Initial testing was done by timing ktcp's main loop. It will be very interesting to see what this looks like on real NIC hardware!

@Mellvik
Copy link
Owner

Mellvik commented Aug 16, 2024

Thank you @ghaerr - looking forward to it. I haven't pulled across the final get_timer yet, so that comes first.

I kinda went to town on the scaled delay idea, encountering so many interesting things along the way, and lost track of time (pun intended) in the process. With 4 different processors and clock speeds to test and compare on, there is indeed a lot to be learned when diving all the way down to the clock cycle level (not to mention how to do the math without floats and with the least possible # of MUL/DIVs in the calibration, plus the effect on timings when using INITPROC vs local for the calibration).

Anyway, here's the one that hit me hardest in the tit took quite some time to figure out. Again it's QEMU and my version is 7.2.0. The following C statement works - temp is unsigned long, d0 is unsigned:

        temp = d0*838UL/10000UL;

while this does not, that is, it fails on QEMU, works fine on real HW, all types:

        temp = d0*838L/10000L;

No comments at all from gcc, and here's the thing: The difference in the object file:

➜  TLVC git:(misc) ✗ diff /tmp/fails.out /tmp/works.out     
124c124
< 			d3: R_386_PC16	__divsi3
---
> 			d3: R_386_PC16	__udivsi3
➜  TLVC git:(misc) ✗ 

Clearly the problem is in QEMU, and just as clearly, this is good to know for a 'heavy' QEMU user like yourself (the symptom is a stack screwup). It might be of interest to 'our' compiler guru too...

@Mellvik
Copy link
Owner

Mellvik commented Aug 18, 2024

@ghaerr - there is apparently more to the 'div problem' described above than initially met the eye - and I'm coming back to that later. What led to the problem in the first place is equally interesting:
There are important experiences to be made when using get_ptime on QEMU. Unsurprisingly - as we've discovered before - they don't go together well. In fact there may be cases in which they don't go together at all except in the context of loops - getting averages over many runs. The same piece of code will have vastly different execution times in different contexts - apparently related to caching.

In short, the caching effects in QEMU (possibly in the ia64 processors being emulated) are nothing short of amazing. We can time the same piece of code twice and get results that differ by a factor of 4. Not every once in a while, but consistently. Example: I have placed this piece of code at 3 different places in a routine, the latter 2 are adjacent:

        get_ptime();
        u10delay(0);
        d0 = (unsigned) get_ptime();
        printk("d0=%u (%k)\n", d0, d0);

The values I get are 46, 14 and 9 pticks.
When running the following code:

        //-----
        get_ptime();
        u10delay(0);
        d0 = (unsigned) get_ptime();
        printk("d0=%u (%k)\n", d0, d0);
        //-----
        i = 0;
        for (d0 = 0; d0 < 3; d0++) {
                get_ptime();
                i += (unsigned)get_ptime();
        }
        //printk("l=%u (%k)\n", i, i);
        adj = i/3;
        printk("adj %u -- ", adj);

first without, then with the part delineated by //-------, the value of adj is 10 and 2 pticks respectively.
Attempting to pre-cache the calls didn't have much effect. Testing on the Intel Mac w QEMU 7.0 gave similar results, sometimes with even higher difference between runs - order of magnitude (using -accel hvf).

These seemingly random differences in measured times were the root cause of the div problem reported above, causing div by zero, which apparently is not tested for in the long div routines, whether signed or unsigned.

My conclusion from this experience is that precision time measurement on QEMU can never be reliable - and that's OK. QEMU is not the environment in which such measurements are meaningful anyway. The same goes for the delay stuff I've been working on. I just have to catch the presence of QEMU (from the speed) to avoid the weird values screwing up the calculations. The actual delays don't matter on QEMU.

@ghaerr
Copy link
Author

ghaerr commented Aug 18, 2024

I'm in agreement with you about QEMU, the differences between execution times with and without HVF or KVM acceleration for its internal JITter along with what you've seen show that show that "elapsed" time measurements using the 8254 timer are unreliable at best, I would say +/- ~10 pticks (sometimes way more). But I think the real hardware measurement are pretty accurate (more on this below), right?

With regards to trying to accurately measure delays on real hardware, agreed an average over several loops should be made. Interrupts have to be disabled around get_ptime or the kernel timer interrupt code might be counted on one of the loops, wildly throwing off the results. I would also recommend using a power of two (4 or 8) for the loop count so that a simple right shift can be used instead of a possibly much longer DIV executed.

were the root cause of the div problem reported above, causing div by zero, which apparently is not tested for in the long div routines, whether signed or unsigned.

Are you talking about the i/3 or something else? That should be a 16-bit divide. For 32-bit divides signed divides are slower than unsigned, and Line 11 says nothing is done about divide-by-zero. Was the divide by zero occurring because of integer overflow on the averaging routine? Not sure how this could happen otherwise.

@Mellvik
Copy link
Owner

Mellvik commented Aug 18, 2024

Are you talking about the i/3 or something else? That should be a 16-bit divide. For 32-bit divides signed divides are slower than unsigned, and Line 11 says nothing is done about divide-by-zero. Was the divide by zero occurring because of integer overflow on the averaging routine? Not sure how this could happen otherwise

It's as simple as

unsigned long i, j, k;
i = 80000UL;  // any value really
j = 0;
k = i/j;

Which either cause a hang or sends the the kernel into some interesting places.

The code above has no relation to this problem - I tested the div_by_zero separately after discovering the huge timing discrepancies which cause 'guaranteed' positive values to become negative - and divisors to become 0.

I'm not sure what to make of this. On the one hand, a compiler library gcc level should handle divide by zero, on the other the programmer (which would be me in this case) should make damned sure it cannot happen.

@ghaerr
Copy link
Author

ghaerr commented Aug 18, 2024

It's as simple as
Which either cause a hang or sends the the kernel into some interesting places.

I see what you mean now.

I'm looking at divmodsi3.s function __udivsi3 (which is called from the above code). It essentially pulls the arguments off the stack and prepares to divide BX:AX by CX:DI in ldivmod.s function ludivmod. Go ahead and look at it. On lines 72-75 CX:DI is tested for zero, and will jump to divzero, where the input sign bit (stored in DX) is tested (it will be 0, or NS not signed), and jumps to return.

So the GCC library routine (if its the same as this one, they were once in the kernel but are now in a GCC library separately) does check for dividing by zero (and returns the input as output).

You might test this by adding back in ldivmod.s and divmodsi3.s into the kernel build, to get to the very bottom of this.

@ghaerr
Copy link
Author

ghaerr commented Aug 18, 2024

to get to the very bottom of this.

Guess what!? Turns out we're no longer running the soft-divide I referenced above. Further digging shows that the updated ia16-elf-gcc uses a different routine which uses actual DIV instructions. So you've just identified another kernel problem - apparently the 8086 hardware divide-by-zero interrupt vector isn't initialized by the BIOS (our kernel doesn't set s trap vector) and bad stuff is happening!!!

There is _irqit special handling for hardware traps like this, but the kernel doesn't write the divide by zero vector on startup, so yep - on your system the system is crashing on hardware divide by zero.

I'll look into a fix.

@ghaerr
Copy link
Author

ghaerr commented Aug 20, 2024

@Mellvik: I added ghaerr/elks#1969 to catch user and kernel mode DIV/IDIV divide by zero and overflow hardware fault handling. It is able to precisely determine the CS:IP of the offending instruction, although that probably isn't necessary, as fault recovery isn't very useful.

@Mellvik
Copy link
Owner

Mellvik commented Aug 20, 2024

That's great news, @ghaerr - another 'cavity' filled, another step toward adulthood for both systems. I'll surely panic alot tomorrow!
Thanks!

@ghaerr
Copy link
Author

ghaerr commented Sep 9, 2024

On the business of displaying precision timer values (or any other long or short decimal value) in the kernel: I've been deep into comparing software vs hardware DIV routines as a result of the rabbit hunt in #76 as well as some DIV/MUL speed issues we're looking into now that someone has ported Doom to ELKS.

When GCC sees a 32-bit divide or modulo, it calls a signed or unsigned long divide return (__divsi3 or __udivsi3), and when needed modulo, the same (__modsi3 or __umodsi3) to get the job done. This is for both the kernel and user mode. As you know, these routines used to be soft with no DIV/MULs but have been replaced by versions executing one or two DIVs and MULs each, depending on the input. And the kernel and libc source have the old versions laying around but not used, adding some confusion.

In many cases, including our own printk and libc printf, a number-to-ascii routine will be written taking the modulo base 10, then a division base 10, in order to generate the ascii digits in a number:

    unsigned long val = something;
    do {
        int c = val % 10;   
        val = val / 10;
        output(c + '0');   // in real life more complicated because this generates in reverse order
    } while (val != 0);

As it turns out, real HW DIV instructions generate both the division result as well as the remainder in one instruction. But GCC generates two separate calls for the above code, one to __umodsi3, then another to __udivsi3. What would be nice would be a single __divmod routine that calculates both in a single call, thus saving lots of CPU time. What would be even nicer would be a __divmod10 routine that is optimized for dividing/modulo 10. I'm working on that, but GCC won't generate calls to them directly, although I have found that when compiling with -Ofast instead of -Os, it can produce a long divide by 10 not using any DIVs or MULs at all. Research continues.

For cases where we want maximum speed, e.g. precision timing output, we might want something like this, with __divmod10 optimized in ASM:

    unsigned long val = something;
    do {
        int rem;
        val = __divmod10(val, &rem);  // return division and store remainder in rem
        output(rem + '0'); 
    } while (val != 0);

So, looking at the kernel printk and guess what?? The way its written (looks like all the way back to the 1990's, it operates in the worse possible case and is guaranteed slow: it uses an algorithm to avoid reversing the decimal output pointed out above that starts by dividing by 1,000,000,000 and then div 10 mod 10 and so on in a for loop for TEN TIMES, calling both __udivsi3 and __umodsi3 every time!!!
This same approach is also used to display 16-bit hex numbers, using a variable for base = 16 vs base = 10, thus disallowing any optimizations and also running worst possible case. It is taking as long to display the number 0 or 0x1234 as a number with value 4GB and 10 digits.

So, the kernel is taking absolute worst time to display calculated number in any base. No wonder printk's change the timing of the kernel, they are taking ridiculously long every time and we never even knew it!!!

I'm working on a rewrite for this as well as the user mode printf, although the latter is not nearly as bad. However, %k ptick output is subject to this massive slowdown in printk as well.

I thought you'd get a kick out of this.

@Mellvik
Copy link
Owner

Mellvik commented Sep 9, 2024

Incredible. Yes you're right, this is up my alley (I'm still tweaking the delay calibration, unbelievable). This is research plus creativity. Looking forward to the next episode! Thanks.

@ghaerr
Copy link
Author

ghaerr commented Sep 10, 2024

I have written a very short, custom ASM __divmod10 which runs as fast as possible, and executes one DIV if it can, otherwise two, to compute both the 32-bit quotient and 16-bit remainder of a 32-bit number divided by 10 in the same routine. The next step is testing it and then rewriting printk followed by printf, ultostr, ltostr and ltoa so they're as fast as possible.

This will speed up printk immensely, which is probably very good for older systems, although we may lose the delays previously used to affect timing-specific system behavior - we'll see just how slow it was and whether that mattered.

In particular I have figured out how to combine the %k processing which required comma output for thousands grouping with the mainline numeric decimal conversion routine, which should reduce code size and also give us thousands grouping for other applications, which is something I've wanted for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Technical discussion related to other issues
Projects
None yet
Development

No branches or pull requests

2 participants