-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on changing ktcp timer resolution #71
Comments
Very interesting @ghaerr, and very timely indeed. Great discussion - and I think we're converging on a solution. Unsurprisingly I've been struggling with this over the past few days as I've been experimenting with simple ways of implementing (and measuring) slow start & congestion avoidance. There are plenty surprises in this mess, some actually pleasant in the sense that smart things have been done, smart choices made - such as the original 1/16s now obviously being there for a good reason. Also, there is a lot to learn about how Re. timer: I started with microseconds, found it messy in the sense that lots of zeroes didn't contribute to anything, 10ms being the actual resolution anyway. Also, I got the obvious wraparound problem with 12 bit seconds messing me up. Then I moved to 1ms which improved things, but didn't eliminate the wraparound issue for seconds. And of course the convergence algorithm for RTT needed adjustment as the time resolution changed. As your message arrived I was back to 1/16s, postponing the timer resolution issue until the algorithms are under control. That said, having increased resolution was helpful in providing a clearer picture of actual round trip times and delays, a desirable change. Your research and analysis was very educating - and interestingly lands very close to where I was sitting before we started this discussion: Not being even close to the task of creating a new system call, I played with the idea of creating an Now we're here, and I strongly support the idea of getting IOW - |
Interesting @Mellvik, I figured you might be in the middle of a timer rewrite, glad to hear your comments!
Well, yes when truncating the 52 bit gettimeofday result to 32 bits, but we have the same wrapping problem with jiffies but in a different way: the 32-bit jiffies stores sub second hw timer ticks (=10ms = 100/second) effectively in 6.5 (lets call in 7) bits (=128). So the upper 25 bits are pretty much seconds and the lower 7 bits are the count of 10ms ticks. This means that jiffies will wrap every 2^25 = 32M/86,400 = 388 days. If you're worried about 1/256 second wrapping every 194 days, I'd be as worried about 388 days. Nonetheless, getting jiffies directly for timing purposes saves lots of time, and is definitely more accurate, considering that
Actually, I think this is a better idea than writing a new system call. Since ktcp already does some special ioctls, another ioctl would be very simple. And then code the ioctl to return the address of the kernel jiffies variable itself (a pointer), rather than the value. Then a far pointer could be used to read jiffies directly - how's that for speed? The ioctl only needs to return the kernel address (DS offset) of jiffies, since MEM_GETDS is already implemented to get the kernel DS segment value. Note that the kernel itself doesn't concern itself with jiffles wrapping - most kernel wait code adds a value to jiffies without checking for wraparound. However, if one really wanted to be correct, something like the following could be used:
Thinking further, my above suggestion of dividing microseconds by 256 (quickly using a right shift instead of slow divide) to get 1/256 second won't work properly - microseconds are 0-999,999 (1M) and not evenly divisible by 256, like 1M is by 16. Thus the result tick count would be incorrect for the computed range 1,046,576 (=2^20) above 1M (=46,576). Since 1M/256 = 3,906 then 46576/3906 = 11.9 there would be about 12 incorrect ticks every 256 ticks. No good. However, 1M is exactly divisible by 64 (=15,625) like it is with 16, so a timer change to 1/64 second would work without error, even though it'd still be slow. A 1/64 second timer is ~15.6ms and wrap would occur at 2^32/2^6 = 26 bits for seconds, which is 64M/86400 = 776 days. A 15ms timer resolution using 1/64 second isn't nearly as nice as 10ms jiffies, and we still can explore the potential option of reprogramming the 8254 PIT to either timeout quicker or start another timer and read it directly. The PIT reprogramming could be turned on only when ktcp is running in debug mode for more accurate timing, on the PC platform. |
Thank you @ghaerr - very educational indeed, a much appreciated run-through. To me, simplicity is key and ktcp's requirements are really 'unsophisticated'. Actually, now that the 1/16s metric is back in my code, I suddenly find it more workable - probably because I now understand more about the inner workings of the entire 'package'. That said, I'm still eager to increase resolution for the same reasons as before: Better understanding, easier reading, more leeway in tuning and more. In fact there are likely many places in the system that would benefit more from the availability of a 'reasonable' resolution and performancewise cheap timer value. IP and TCP were created to be lax about time and some early RFCs refer to 0.5 seconds as a reasonable resolution. I had to smile when reading that. Back to the case, your run-through and examples make me even more convinced that |
This is a very interesting idea, @ghaerr - let's keep this one warm, even while changing |
@ghaerr, The delay loops in use are typically unscaled, the delay they deliver vary wildly between processors and systems, in most cases slowing down already slow systems while having negligible effect on faster iron. Which is why I introduced the idea of reviving BOGOMIPS a while back - the ability to scale delay loops to the iron. So there are actually 2 issues here: One is to replace all occurrences of delay loops scattered all over the place with a library routine, and two: enabling exact delays instead of 'add-an-order-of-magnitude-to-be-safe-on-slow-iron'. On the original issue, I'm adding the ioctl to the |
Agreed that delay loops should not be "unscaled", but rather associated with a realtime timer so that delays are exact, at least to the resolution of the timer. I changed all the ELKS kernel delay loops to use a jiffie-based timer loop a while back, and IIRC the only non-scaled loops remaining are in the NIC drivers and the reset_floppy DF driver routine. The NIC drivers weren't changed since I'm unable to test them. The DF routine can be easily changed once the actual real time required for reset is known.
IMO, better to use reference to a realtime timer or volatile jiffies value than a scaled "bogomips-derived" timer delay count, as kernel interrupts can occur during a delay count down and use up more real time than is accounted for in bogomips loops. |
The reason your PIT reference caught my interest was the need for a high res timer for some of these delay loops. Having access to a realtime timer with, say, .1ms resolution would be really valuable, including for performance measurement. |
I've researched the 8253/8254 PIT and have found a way to get fast precision timing, although only within a certain elapsed time window, due to 32-bit overflow issues and avoiding software divide routines. Long story short, the kernel PIT has an input clock frequency of 1,193,181.8, which divided by 100 to set 1/100 second timer interrupts (=HZ) uses an initial timer countdown value of 11932. With the PIT you can read the current countdown value directly from the chip, and that means that the time between each counter decrement is 1/11932 sec = 0.8381 usecs = 838.1 nanoseconds. (see elks/arch/i86/kernel/timer-8254.c for more detail). The original IBM PC master oscillator of 14.31818 MHz was divided by 3 for the CPU and by 4 for the CGA, and then using an AND gate gives the divide-by-12 PIT frequency above for the PIT reference frequency. For 4.77 Mhz 8088s which have a 4-cycle minimum instruction execution, this somewhat conveniently gives a (4-cycle) base 0.8381 usec/instruction timing as well. (However, I've been reading that the average instruction takes 13.67 cycles). In order to not overflow 32-bit longs and use fast hardware integer multiplies only, the PIT countdown can be read and multiplied by 84 (100 times 0.8381 rounded) to get "nticks", my new term denoting 10ns intervals (84 nticks would equal 838 nsecs or .838 usecs = 4 8088 CPU cycles). The kernel itself gets 10ms interrupts which we'll call "jticks", which are 1,000,000 nticks. This allows jticks and nticks to be added together to produce a total from which an ntick difference can easily be calculated. So, the precision timer resolution is ~838 nanoseconds, or .838 microseconds (or 84 nticks), and is 11,932 times smaller than a jiffy. (This would also seem to be the max number of opcodes that could possibly be executed between hardware clock interrupts on a 4.77Mhz 8088). The entire calculation can be performed in nticks as the following pseudocode (not tested):
The code produced is quite small and pretty fast, but uses 4 integer multiply instructions. Using 32-bit longs, there is the potential for overflow in only 4294 (=2^32/1M) jiffies, which is ~42.94 seconds. This should work well for precision timing, but I'm trying to think of way to turn on and off high precision without changing ktcp timing constants. A bit more thought on that needed. I plan on playing around more with this and coming up with a working example. |
That's a great piece of research, @ghaerr - this is obviously right down your alley. This will open up a previously unavailable tool chest for - like you say - precision timing! Combining it with jiffies is really smart. While I'm curious - how do you get this (2 long multiplications) to become only 4 int mult's? Also (BTW, thanks for the reference on instruction timings, an educating refresh on a lot more than timings :-) ), how do you get to the 4 clock cycle minimum instruction execution cycle? Is that the fetch+execution ( |
GCC does a nice job with it. In the case of multiplying a short by a short constant to produce a long result, only a single multiply is needed (as 8086 takes AX times 16-bit reg/mem and produces DX:AX. For short tickdiff
In the case of a short times a long constant,
For the above two cases, I kept tickdiff to be unsigned short as it is effectively the 8254 PIT elapsed count difference and saves some instructions. In the sample precision timer code above, we still save one multiply because tickdiff remains unsigned short even though the multiply by 84L isn't stored into a variable. For the case of a long times a non-constant long, its a more complicated
With the OpenWatcom compiler, the same thing is done but a helper routine |
I read it somewhere. I also read that timing starts at 5 cycles for Base or Index effective address calculation and goes up to 12 cycles. The bare minimum is 4 when using register-to-register but this can sometimes change because there is some variety to instruction prefetching. (Of course all this varies wildly per processor variation). The interesting thing in the referenced link is that 4 more cycles are added for accessing words at odd addresses... much more a performance hit than I had thought. I plan on looking at various produced .o files to ensure that mixed char and int globals aren't stuffed together into possible odd addresses for ints. I know that "common" (.comm) variables aren't allocated that way (BSS only) but not sure about DATA section variables, based on their C declaration. For our estimates of speed, I found it surprising that the average is 13.67 cycles/instruction. However, I'm inclined to believe it and am building the following "simple outlook" on imagining how time is spent in an 8086/8088 ELKS system: With the kernel PIT running a 11932 countdown to produce 10ms interrupts, each countdown (stated previously) amounts to 0.838 microseconds, which is 4 cycles. Thus, rounding to 12,000 countdowns (4 cycles each or one basic instruction) per 10ms clock interrupt would mean 12,000 instructions before a task switch. But with the real instruction cycle average being 12 (rounded down from 13.67), this amounts to 4,000 average instructions executed each time slice. If the average instruction is 12 cycles (for simplicity), this would mean each instruction is 3 * 4 cycles = 3 * 0.838 usecs = ~2.5 usecs/instruction. Finally, at average ~2.5usecs/instruction, 4000 instructions is 10,000 usecs which equals, you got it - 10ms - one jiffie. So there you have it. Precision timing samples starting at a very fast 4-cycle instruction, with three samples per average instruction. 4000 average instructions later we have a jiffie increment and possible task switch. This keener precision approach to imagining what is going on at the process time slice and kernel driver level should allow us to better understand what the real costs of interrupt service routines are and also quantify in which method a kernel busy-loop might be coded to allow other code to run or not (same for interrupt disabling). |
Again @ghaerr, this is both very interesting and very useful. Thanks for the run-down on the
That question is still open, but I did change the Old code (TCP_RTT_ALPHA is currently set to 40, 0x28):
New code:
... which seems like a pretty dramatic improvement right there. Actually, this along with the use of
Yes, all this matches up with my impressions (and I agree with you, there are some surprises here, in particular the one about alignment) except for the minimum cycles per instruction - unless the counting of cycles varies. The way I read it is that the fastest instructions - like reg-reg-move - execute in 2 cycles. If you add the fetching, it may be 3 or 4 - or zero if already in the prefetch queue. I'm clipping in a couple of pages from the ultimate source (!) on the issue, Hummel's Technical Reference, an extremely valuable resource these days (got it on eBay when I got involved with ELKS some 7 years ago) - recommended.
Agreed, this value seems reasonable and the method they used to determine it, very convincing - actually pretty smart! Valid for 8086/8088, less so for later architectures. Then again, that's what we're talking about - 4.77MHz and 86/88.
This is indeed enlightening. The 12 cycles average being more important than whether the fastest instruction takes 2 or 4 cycles. 4000 instructions per time slice really doesn't get much done, does it? It may seem that this is part of the challenge with very slow machines: So little get done per time slice. And suddenly the difference between my XT and my V20/8MHz is big - twice as much gets done per slice. It would be very interesting to find the actual price of a task switch (and like you mention, the interrupt overhead). No wonder Thank you. |
I'm not sure, but I think you're going in the right direction when counting instruction fetch. IIRC a word access requires 4 cycles, so perhaps reg/reg move is done within memory fetch 4 cycles. I'd like to hear more about what you figure out, I'll leave this for you to figure out more fully.
Yes, that's not much, is it? If we can figure out the average execution time per byte of code emitted, that might allow us to quickly estimate how long a function takes, without loops being taken into account. And yes the point about 8Mhz being (obviously) twice as fast hits home when considering the smaller amount of instructions being executed in the first place. Its nice to be getting conceptually closer to the metal in these timing discussions.
That's easily done: just count instructions in
Yes, that would seem great - except that after this discussion it might still be too slow, as MUL is very slow on 8088's, right? (Of course, 8088's just can't do much which is why few run them anymore lol; the long divisions required to do the timing printf's will really knock us out!)
Very interesting! I seem to remember a quick look at the timeq_t declarations when you first brought this up, but didn't think about the code required to support it. I would say that if your assumptions are correct about RTT/RTO remaining a small number that they should definitely be changed to unsigned short for ktcp!!
Wow. Getting a handle on ktcp's main loop elapsed time seems important to get fast back-to-back NIC packets sent w/o buffering. 2000 cycles is a lot (wow, MUL is super slow, 118-143 cycles!!), even though there are 4000 * 12 = 48,000 cycles per hardware timer interrupt. So that's the equivalent of 2000/12 = 166 instructions out of 4000 saved in ktcp's main loop per task switch = 4%, huge improvement with no change in functionality! |
This has got me thinking: I'll bet the irqit overhead is larger, possibly significantly if C routines and signal processing is taken into account (but not including actual task switch code), than 5% which was my previous WAG. That hardware interrupt handling code for the system timer is only to update jiffies if there's no other task to run. So perhaps a 10ms timer is too fast for old 8088s; less interrupt overhead may allow for 10% more to get done when running single application. I've added all this to my list to look further into. It would be quite easy to change the HZ frequency, although there is a bit of a mess checking out that the clock/date code still works and doesn't make any assumptions. |
I spent a while counting instructions and tracing
Add to all of this the thought that the average MUL instruction is say, 120 cycles (10 times an average instruction time), so my previous comments about a "fast" precision timer with only 5 multiplies means that a precision time sampling takes ~125+ usecs minimum, which is a full 30 0.838 usec countdowns of the 8254 precision timer itself!! Basically, a "precision" timer calculation takes something less than half as long as the kernel interrupt overhead on 8088. I think MULs are heck of a lot faster on 386+ CPUs so perhaps this isn't a big deal except on real 8088's themselves. |
Wow, that's indeed a lot to devour :-) - and again both interesting and useful, not the least to understand system behaviour and possible bottlenecks. Some observations:
That's indeed a good point. Maybe we need to rethink this one completely in order to make it viable on slow iron.
That's also true.
If this is on the 86/88, it's actually pretty good. And since ELKS and TLVC have their own drivers, we're able to take advantage of the UART FIFOs when available, which at least theoretically should make it possible to exceed this limit, right? Process several chars per interrupt. On a general note, I think we should stick with the 10ms clock even on the slow iron, proclaiming it as 'good enough' and continue along the line that this thread started, optimizing where it really matters. Actually, off the top of my head, 3.2% interrupt overhead isn't all that bad, neither is 5.6% for interrupt&resched. I would be surprised if a PDP11/20 with Unix v6 was much better. Of course they had more IO offloading but then again they were supporting many real and concurrent - users too ... Apropos timing, the BTW - I'm inclined to head into |
Yes, good point, lots of factors. I was reading another timing reference that showed a register word INC to take only 2 cycles, not including instruction fetch. So I'm not sure which table(s) to use for instruction cycles, and when one needs to apply instruction fetch timing (after every jump)? Perhaps best to actually try to measure it using "precision'' timing (more on that below).
I tend to agree, I was thinking about that as I calculated the irqit timing and those instructions did not seem like ~12 average. However, one must consider an "average" instruction timing would have to include MUL, so that likely pulls the average lots higher. Perhaps we should research and calculate an ELKS/TLVC average instruction that does not include high-time instructions, and watch for and flag the high-time instructions separately (including the C code that generates them).
Indeed. The 5 MULs were the result of creating a virtual "ntick" and converting PIT countdowns and jiffies to the same base units so they could be added. A better approach might be just to store the 4-byte jiffies and the 2-byte countdown as a "measurement", and delay multiplications, or possibly just display jiffies and countdown separately. Also, for small measurements, the jiffies diff might be 0 and just countdown (wrap handling still needed). So some fast code could perhaps calculate jiffie and countdown diff values and just store them. Of course, once it comes time to actually see the result, the printf/printk routine will be so slow it'll wreck all runtime timing. Which brings me to my next idea - kernel or application profiling. I am thinking that a super tool would be to store the CS:IP of executing code every timer tick into a far memory area of 64K bytes. Then, process or kernel symbol tables could be consulted to display the function being executed at each clock tick and a profile generated. The same approach could possibly be used for precision timing, where the jiffies/PITcount would just be stored somewhere and analyzed later.
That'd be great - go for it. I'm interested in a better "average" instruction time for sequences that don't include very slow instructions. For reference, I divided irqit into two streams, syscall and hardware interrupt, and traced the function calls, but didn't try to trace the probably execution path, instead just counting all instructions. So the most-used code path is likely smaller than my research suggested. |
Actually, all sources I have show the same timings - the Intel manual, the Hummel book, the one you reference which I incidentally looked at this morning. So we do have the exact timings even with every possible addressing mode etc. Still, like Hummel points out in the pdf I included yesterday, there are more variables. Like a DMA going on, a fetch colliding with an execution unit memory reference, an instructions exception etc. So, what Hummel concludes it this (also in the pdf I think): "Due to all the factors that cannot be predicted with certainty, the average program will take 5-10% [corrected] longer to execute than would be indicated by the timings." Sounds good to me.
It would indeed be interesting to see a top 25 list of used instructions in the kernel, which wouldn't be all that hard to do with objdump and an AWK script ...
I think this is a great approach, much akin to what they did in the instruction timing document!
OK; sounds like a printout, a pencil, a book and a sunny day on the terrace! It's on. |
Wow, kind of unbelievable!!! But then again, perhaps ~13.67 cycles/instruction does factor correctly when 510% is considered. I found copy of Hummel's book on Amazon and ordered it last night. Looking forward to reading it, thanks. |
@Mellvik, I've create a precision timer library in ghaerr/elks#1953 based on discussions here. Let me know what you think of it, it should work pretty well for measuring elapsed time in the kernel, but I haven't tested it on real hardware. If you can find time to test it out, that would be great. Coming after this will be getting the routines working outside the kernel, e.g. in ktcp. I would like to know how useful this is before going through that effort. |
Very exciting @ghaerr - thank you. -M |
@ghaerr,
Weirdly I've calculated the expected timing of the loop itself to ~775ns on the 40HMz 386. Which means that running 10,000 loops may exceed the 10ms limit, but I tried anyway and went down from there. Here's what I got, the first number is the loop count.
I haven't looked at the code so I'm not attempting any analysis at this point. What am I missing? |
Hello @Mellvik, Thanks for the testing. And, you will find, it is very easy to get confused with timing and timing results, I sure have been! It looks though that your code is correct. I have the following comments to try to understand what's going on, as the output timing does look way off, possibly due to the potential variations described below. Of course, the PIT timer should be operating at the same input frequency on your 40Mhz 386 as an 8088, right?
What this does is display pticks as an unsigned decimal. In this way, you can mulitply the pticks directly by 0.838 to get microseconds. If the pticks value is >= 11932, then the measurement is invalid for get_time_10ms because the PIT has counted down more than once past its 10ms (=11932 pticks * 0.838 = 10,000usecs) limit. The multiply result should always be correct. More on this below on the results of [EDIT: something definitely strange is going on here, as get_time_10ms should never return a value > 11931, as it just compares two PIT values and compares the difference. The only way it could be greater is the PIT start countdown value is > 11932, but that should not be the case as irq_init() initializes the PIT countdown to 11932 for a 10ms interrupt.]
I added your test code, with the printk mod above and ran it on QEMU, and got the following results, different each time because of the timer interrupt delay: (serial console on):
With the above numbers, the parenthesized number are the pticks, and multiplying them by 0.838usecs does yield the displayed time value. I don't know the extent as to QEMU's emulation of the 8254 PIT nor its relation to real-time so it might be interesting to see your QEMU results from the above. I'm testing on a 3.1Ghz MacBook Pro. Finally, here is the output from timer_test, which should be exactly the same on yours. It displays various ptick values and the %k result:
|
Right, otherwise the ISA bus wouldn't work (and BTW, the clock would be off...)
I put it at the end of
The
Makes sense - I'm getting 'sane' values back now.
Actually, the numbers I'm getting have minimal variations - which of Cours may be because they're still wrong :-)
My M1 MB Pro says:
... sensible numbers that given the variance in your case, are surprisingly stable. And indicating the machine is about twice your speed when doing Intel emulation. The new 'real' numbers: (40MHZ 386SX)
12.5MHz 286:
The 286 numbers for 3000 and 500 loops make some sense given the loop in question (the 10000 exceeds 10ms anyway so it's invalid). In order to get better control over expectations, I added an assembly loop (conveniently lifted from the el3 driver, temporarily 'hidden' at the end of
The loop itself has only two instructions consuming
cycles - for 386, 8088, 286 respectively (interesting how fast the out instruction is on the 286, I should look it up to make sure it isn't a typo).
What we get is this - first QEMU:
[386]
[286]
Bottom line is that except for QEMU, the numbers don't make much sense. What say you @ghaerr? |
Wow, I say this is very frustrating! After reading your post and all the data in it, I would say it seems that reading and/or interpreting the 8254 PIT countdown register on real hardware isn't working with my code, and that QEMU possibly doesn't do a good job of emulating the PIT real time. I checked the 80286 Intel reference manual for OUT and LOOP and it says 3 cycles for OUT and either 8 (jump taken) or 4, so there's some possible discrepancy with instruction timing estimates (instead of your 10). But these are quite minor considering the radical difference in timings reported by QEMU, which runs faster and slower on different hosts, and your last two real hardware tests of test_udelay, where there's almost no difference between 500, 3000 and 10000 loops in terms of PIT timer countdowns returned (!!!). So lets take the latter - the 286 and 386 results running test_udelay: all ~200 for 286 and ~144 for 386. It would seem to me that this is saying that reading the PIT timer somehow returns a constant difference between two readings - very strange. Except then when running your original testloop beforehand the results do vary. Extremely strange that QEMU doesn't vary between 500 and 3000 using test_udelay. I must admit I can't figure that out. Can you post your main.c, printk.c and prectimer.c files used for the test_udelay? I would like to see if I'm missing something as I can't see the actual code you're executing. Other thoughts are somehow the timer loop may be overflowing, perhaps change to using a different instruction (like XCHG or XOR instead of OUT) to keep the system bus out of the equation. Another thought would be to just display the PIT count value in an infinite loop, to check that it's decrementing. It should be decrementing every 0.838usecs on all systems using a real 8254:
I checked the ptick <-> us/ms conversion and that's accurate. So we're getting a bad ptick difference between reading the PIT countdown register, if that's even working on real hardware. Another thought would be to write the above PIT read code in ASM, save it, run a known loop and read it again to see what the different values are. The PIT timer is pretty basic - you send a zero to the commands port to "latch" the current value and then the next two reads from the data port read the low and high bytes of the countdown. When the count transitions from 1 to 0, an IRQ 0 is fired and 11932 is reloaded into the countdown register by the hardware. Do both your 286 and 386 systems use a real 8254 PIT? I'm kind of grasping at straws here trying to come up with something that may be amiss... Thanks for your testing. I'm out of town or I'd test this on my real hardware, which I can do next week. |
This simple verifies that both systems/kernels are delivering the same results in your QEMU version, eliminating the chance that there are other elements involved. Thanks for testing. The results are vey conclusive.
It seems a
It's repeatedly complaining about Xcode and its command tools needing to be updated, but it's not a requirement (I'll update the command tools manually later). The process takes hours (zillions of packages that need upgrades/updates), and it's not done yet. I'll let you know how it ends. The good thing is,
There is obviously a limit to the real/practical accuracy such a function can deliver, the key goal would simply be 'as close as possible' given the constraints - one being simplicity. This would still be a hell of a lot better than the delay loops we have today, which - between the XT and the 40MHz 386 - vary by orders of magnitude. Also it would have to be a goal to avoid both MULs and DIVs, and I have an idea about how to do that. I'll take a stab and see if I can get it to work. |
I finally realized that we can have 0.838us accuracy, all the way from 0.838us to 42s. That is, the elapsed timer reads the wrap-adjusted PIT countdown (0-11931), then adds the jiffies difference times 11932, since jiffies is always incremented on counter reload. I got confused initially thinking we needed a very accurate, low-overhead sub-10ms timer and created three routines, which turns out is not needed when the new single routine returns an unsigned long, and sub-10ms accuracy is tricky when interrupts are enabled and on fast machines.
No, we just need one routine, I now have it working. The get_time_50ms name is now bad, and get_time isn't a great name for confusion with other C library clock time routines like gettimeofday. I was thinking time_elapsed, or maybe get_time_elapsed, might be a better name?
I'm thinking of using taking 3-4 adjacent elapsed time samples (with interrupts disabled) and then calculating an average, which could then be used as the average time the time_elapsed function needs to run. This would then be the shortest meaningful measurement, and could be subtracted from each elapsed time result in order to get more precision. The C library version will need an init routine (time_elapsed_init?, to get the kernel jiffies pointer), this could calculate the average and return it. I have a working time_elapsed function working in the kernel now, and am thinking of submitting a PR when I get it all working and tested for the C library so we don't have to go through (yet) another revision of the routines... |
Agreed, true precision at microsecond level is impossible given the factors you mention and the fact that an XT/8088 cannot complete even the simplest loop in 1us: Clock cycle 210ns, |
I believe your initial thinking was better than you give yourself credit for. There is a good reason for staying away from longs when possible - they are extremely expensive on pre-286 machines. Which is why I'm concerned that a single version - long only - is not going to work well/be all that useful on older systems when measuring short intervals. Also there is the concern about complexity in the time_elapsed routines, which - since I haven't seen them - may of course be entirely superfluous. For example, while latching the timer as early as possible when getting the time elapsed is good, it works the opposite way when creating the starting point using the same routine.
Yes, I think this is an important step - incidentally my calibrated delay loop does the same, although no multiple calls to get an average yet. The calibration happens at boot time, but it may still be good (even necessary) to make multiple calls to even out 'noise'. I get differences between two runs just calling get_time_50ms twice (on the 286) exceeding 100us. As to naming, I agree that the current naming is not ideal. Given that using a timer always requires two steps, set and get, I suggest two different calls, even if they end up calling the exact same routine. Like 'set_ptimer()' and 'get_ptimer() |
@Mellvik: I've been traveling again, but managed to push PR ghaerr/elks#1962 showing the working refactored code for combining the previous 3 precision timer routines into a single routine:
Good point. In get_ptime, for measurements < 10ms (that is, less than a clock interrupt), all arithmetic is done in 16-bit, and no longs are manipulated or stored. If you know the measurement will be small, then you can just cast the result and get a very fast measurement: If you're concerned about elapsed times between 10ms and 50ms, there's always going to be a clock interrupt (at least in real execution) within the measurement, and the kernel IRQ 0 handling has shown to vary elapsed times by quite a bit, so there wouldn't be a super-accurate measurement being made anyways. In the new routine, there is only a single MUL, which is executed only when elapsed time > 10ms (through 42s). So, the same routine can be used for sub-10ms or 0-42s measurements with no extra overhead for < 10ms. The only thing to watch for is that
In general, moving the latch code as first or last complicates matters and would necessitate having two almost-identical routines, which also would have the disadvantage of not being able to be used to compute a more reliable average execution time, and also double the amount of kernel code, just for a few cycles difference. If measuring elapsed time really really matters (something I have yet to see as really important given the accuracy of our varied results so far as well as the fact that ELKS/TLVC are not real time systems), I suggest another way: writing the small portion of the 8254 counter-reading routine as a GCC static inline routine, which would then be inserted directly into the code being measured. This would remove all procedure call overhead and read the PIT directly from the measured function. More on that later should we really want/need it.
I opted against that idea, as it adds another function call to every measurement and there's no real need for a separately named start and stop function, especially if the elapsed time function is used in a loop, where this idea wouldn't work. |
what an educating and interesting journey it has been since I asked this question about 3 weeks ago - in #69 (comment):
... including - off all things - a short tutorial on system startup in #71 (comment) :-) And don't worry about timing (pun intended). This side project is not holding up anything and I'm having very limited screen time these days too. It's easy - and we've been there several times - to lose perspective on reality when digging around low level like this. As you point out, there are severe limits to the useable accuracy we can get given the way the system operates - clock, interrupts, scheduling etc. - and not the least the capabilities of the systems at hand. An XT has a cycle time of 210ns which means that even the tightest loop - a That said, my work on the delay idea has turned into a great testing ground for the precision timer tools - for several reasons, including the variations between runs and time consumed by the tools themselves, which must be recorded and corrected for when establishing the metric for the delay loop. A great baseline when testing the 'old' 50ms routine against the new
This is a good idea - somewhat similar to what I was thinking when getting frustrated with the variations in the delay function, but discarded for lack of insight into how to do it properly. Back to where I started, we're getting close to having reliable and reasonably accurate timing for the likes of |
@ghaerr, an update on updating QEMU: It was ultimately unsuccessful - and it didn't destroy anything except the symlinks in /usr/local/bin. Your system is newer, it may succeed and it's relatively risk free. I can send you a tar file with the 7.0.0 tree which should work if you run |
@Mellvik, I've added user mode precision timing in ghaerr/elks#1963, and it all seems to work. Initial testing was done by timing ktcp's main loop. It will be very interesting to see what this looks like on real NIC hardware! |
Thank you @ghaerr - looking forward to it. I haven't pulled across the final I kinda went to town on the scaled delay idea, encountering so many interesting things along the way, and lost track of time (pun intended) in the process. With 4 different processors and clock speeds to test and compare on, there is indeed a lot to be learned when diving all the way down to the clock cycle level (not to mention how to do the math without floats and with the least possible # of MUL/DIVs in the calibration, plus the effect on timings when using INITPROC vs local for the calibration). Anyway, here's the one that hit me hardest in the tit took quite some time to figure out. Again it's QEMU and my version is 7.2.0. The following C statement works -
while this does not, that is, it fails on QEMU, works fine on real HW, all types:
No comments at all from gcc, and here's the thing: The difference in the object file:
Clearly the problem is in QEMU, and just as clearly, this is good to know for a 'heavy' QEMU user like yourself (the symptom is a stack screwup). It might be of interest to 'our' compiler guru too... |
@ghaerr - there is apparently more to the 'div problem' described above than initially met the eye - and I'm coming back to that later. What led to the problem in the first place is equally interesting: In short, the caching effects in QEMU (possibly in the ia64 processors being emulated) are nothing short of amazing. We can time the same piece of code twice and get results that differ by a factor of 4. Not every once in a while, but consistently. Example: I have placed this piece of code at 3 different places in a routine, the latter 2 are adjacent:
The values I get are 46, 14 and 9 pticks.
first without, then with the part delineated by These seemingly random differences in measured times were the root cause of the div problem reported above, causing div by zero, which apparently is not tested for in the long div routines, whether signed or unsigned. My conclusion from this experience is that precision time measurement on QEMU can never be reliable - and that's OK. QEMU is not the environment in which such measurements are meaningful anyway. The same goes for the delay stuff I've been working on. I just have to catch the presence of QEMU (from the speed) to avoid the weird values screwing up the calculations. The actual delays don't matter on QEMU. |
I'm in agreement with you about QEMU, the differences between execution times with and without HVF or KVM acceleration for its internal JITter along with what you've seen show that show that "elapsed" time measurements using the 8254 timer are unreliable at best, I would say +/- ~10 pticks (sometimes way more). But I think the real hardware measurement are pretty accurate (more on this below), right? With regards to trying to accurately measure delays on real hardware, agreed an average over several loops should be made. Interrupts have to be disabled around get_ptime or the kernel timer interrupt code might be counted on one of the loops, wildly throwing off the results. I would also recommend using a power of two (4 or 8) for the loop count so that a simple right shift can be used instead of a possibly much longer DIV executed.
Are you talking about the |
It's as simple as
Which either cause a hang or sends the the kernel into some interesting places. The code above has no relation to this problem - I tested the div_by_zero separately after discovering the huge timing discrepancies which cause 'guaranteed' positive values to become negative - and divisors to become 0. I'm not sure what to make of this. On the one hand, a compiler library gcc level should handle divide by zero, on the other the programmer (which would be me in this case) should make damned sure it cannot happen. |
I see what you mean now. I'm looking at divmodsi3.s function So the GCC library routine (if its the same as this one, they were once in the kernel but are now in a GCC library separately) does check for dividing by zero (and returns the input as output). You might test this by adding back in ldivmod.s and divmodsi3.s into the kernel build, to get to the very bottom of this. |
Guess what!? Turns out we're no longer running the soft-divide I referenced above. Further digging shows that the updated ia16-elf-gcc uses a different routine which uses actual DIV instructions. So you've just identified another kernel problem - apparently the 8086 hardware divide-by-zero interrupt vector isn't initialized by the BIOS (our kernel doesn't set s trap vector) and bad stuff is happening!!! There is I'll look into a fix. |
@Mellvik: I added ghaerr/elks#1969 to catch user and kernel mode DIV/IDIV divide by zero and overflow hardware fault handling. It is able to precisely determine the CS:IP of the offending instruction, although that probably isn't necessary, as fault recovery isn't very useful. |
That's great news, @ghaerr - another 'cavity' filled, another step toward adulthood for both systems. I'll surely panic alot tomorrow! |
[kernel] Fixed missing div by zero handling reported in #71
On the business of displaying precision timer values (or any other long or short decimal value) in the kernel: I've been deep into comparing software vs hardware DIV routines as a result of the rabbit hunt in #76 as well as some DIV/MUL speed issues we're looking into now that someone has ported Doom to ELKS. When GCC sees a 32-bit divide or modulo, it calls a signed or unsigned long divide return (__divsi3 or __udivsi3), and when needed modulo, the same (__modsi3 or __umodsi3) to get the job done. This is for both the kernel and user mode. As you know, these routines used to be soft with no DIV/MULs but have been replaced by versions executing one or two DIVs and MULs each, depending on the input. And the kernel and libc source have the old versions laying around but not used, adding some confusion. In many cases, including our own printk and libc printf, a number-to-ascii routine will be written taking the modulo base 10, then a division base 10, in order to generate the ascii digits in a number:
As it turns out, real HW DIV instructions generate both the division result as well as the remainder in one instruction. But GCC generates two separate calls for the above code, one to __umodsi3, then another to __udivsi3. What would be nice would be a single __divmod routine that calculates both in a single call, thus saving lots of CPU time. What would be even nicer would be a __divmod10 routine that is optimized for dividing/modulo 10. I'm working on that, but GCC won't generate calls to them directly, although I have found that when compiling with -Ofast instead of -Os, it can produce a long divide by 10 not using any DIVs or MULs at all. Research continues. For cases where we want maximum speed, e.g. precision timing output, we might want something like this, with __divmod10 optimized in ASM:
So, looking at the kernel printk and guess what?? The way its written (looks like all the way back to the 1990's, it operates in the worse possible case and is guaranteed slow: it uses an algorithm to avoid reversing the decimal output pointed out above that starts by dividing by 1,000,000,000 and then div 10 mod 10 and so on in a for loop for TEN TIMES, calling both __udivsi3 and __umodsi3 every time!!! So, the kernel is taking absolute worst time to display calculated number in any base. No wonder printk's change the timing of the kernel, they are taking ridiculously long every time and we never even knew it!!! I'm working on a rewrite for this as well as the user mode printf, although the latter is not nearly as bad. However, %k ptick output is subject to this massive slowdown in printk as well. I thought you'd get a kick out of this. |
Incredible. Yes you're right, this is up my alley (I'm still tweaking the delay calibration, unbelievable). This is research plus creativity. Looking forward to the next episode! Thanks. |
I have written a very short, custom ASM __divmod10 which runs as fast as possible, and executes one DIV if it can, otherwise two, to compute both the 32-bit quotient and 16-bit remainder of a 32-bit number divided by 10 in the same routine. The next step is testing it and then rewriting printk followed by printf, ultostr, ltostr and ltoa so they're as fast as possible. This will speed up printk immensely, which is probably very good for older systems, although we may lose the delays previously used to affect timing-specific system behavior - we'll see just how slow it was and whether that mattered. In particular I have figured out how to combine the %k processing which required comma output for thousands grouping with the mainline numeric decimal conversion routine, which should reduce code size and also give us thousands grouping for other applications, which is something I've wanted for a while. |
In previous comments, I made the statement:
After looking at the code generated for the kernel
sys_gettimeofday
routine, I would say "Oops - that's only part of the story, its actually not fast". In order to prepare the returned system time into a struct timeval, a 32-bit divide (by calling__udivsi3
) plus an additional 32-bit modulo (by calling__umodsi3
) are made, along with a couple other 32-bit adds. This is all slow as heck. (Unbundled sources for each are here and here and they loop).Then, after that, ktcp's time wrapper function
timer_get_time
, which returns a 32-bit time in 1/16 second resolution, goes through yet another slow call to__udivsi3
to convert microseconds into 1/16 seconds by dividing by 16:Besides being quite slow, 1/16 second = 62.5 msecs, 6 times longer than the 10ms clock resolution and the 5-10 msecs for an ACK response, discussed here.
(I also notice in the above code as I'm writing this that the second ("&tz") param to gettimeofday is not used and wastes more time returning it - should be NULL).
In another post I stated:
After thinking more on this and looking at the timer_get_time routine - that answer also gives only half the story. The problem is that for efficiency reasons, any "ktcp" time needs to fit in 32 bits, not a timeval struct, in order to do a quick subtraction and not end up calling gcc long div/mod routines each time its used.
Thus the tradeoff: the more resolution given to micro or milliseconds, the less bits given to the "seconds" portion of a timeq_t.
In the current timeq_t, 28 bits are seconds and 4 bits are 1/16 seconds. This allows for 2^28 = 256M/86,400 = ~3106 days of timer. (A timeq_t may have to be in effect for the total duration of a TCP connection, I believe). To move to microsecond resolution would require 20 bits (1M) for microseconds, leaving only 12 bits (=4096) for seconds, a little over an hour. So that won't work, and the kernel hw timer has max 10ms (=HZ = 1/100) resolution anyways.
Considering that we stay with an approx ~10ms timer, which would at least measure ACK responses favorably on a 286/12.5Mhz or V20/8Mhz, how would the math work, given that the gettimeofday/timer_get_time is already quite slow?
If we considered 1/256 second resolution (=~3.9ms) or 1/128 resolution (=~7.8ms), this would help a lot and the timer_get_time routine could be made faster (see below). If 1/256sec is used, that leaves 24 bits for seconds (=16M/86400 = ~194 days). 1/128 would be 25 bits with 388 days. But we should be time sampling with less than half the timer resolution, so I would say 1/256 second resolution would be best, and TLVC/ELKS is likely just fine with an open TCP connection for 194 days before worrying about RTT miscalculations.
One might ask, why not just use 1/100 second (=10ms) resolution directly? Well, the original timer_get_time routine is somewhat magic in that it uses the fact that 1M/16 = 62,500, so a microsecond divide can exactly fit into 4 bits (=1/16 sec). Dividing by 100 will still be quite slow but won't fit neatly into 7 bits. And if we're going to use 10ms resolution, why not make a special new system call and just return jiffies directly, which would be quite a bit faster, and only use a compare and subtract or two to compute elapsed time.
Another thought would be to just use a right-shift-12 (=1M/256 = 2^20/2^8) for roughly converting struct timeval microseconds to 1/256 seconds, this would leave the last line of the routine as:
I haven't yet figured out how to measure how long the
__udivsi3
or__umodsi3
routines take, so I'm not fully sure just how important all this is, given the kernel's 10ms max resolution.What do you think?
The text was updated successfully, but these errors were encountered: