[kernel] Enhance buildimages.sh and emulator scripts, add DMASEGEND in config.h #2091

ghaerr · 2024-11-01T01:20:18Z

This work is a precursor to splitting DMASEG/DMASEGSZ into two buffers; a bounce buffer for XMS or 64k-wrap I/O, and a floppy disk track cache buffer for reading contiguous sectors used in the BIOS and DF drivers.

Currently the shared use of this area for a bounce buffer causes the track cache to be invalidated since both use the same DMASEG start address in low memory. In addition, the BIOS HD driver has to invalidate the floppy track cache since it also performs sector I/O using DMASEG. By splitting off a separate DMASEG from CACHESEG, neither the BIOS HD driver nor BIOS FD or direct DF driver will have to invalidate the cache for XMS or HD I/O.

While this will cost 1K of low memory, testing and enhancements being performed by @Mellvik in Mellvik/TLVC#88 is showing that best floppy cache performance is obtained by disabling the cache entirely for 386+ (fast) computers and using an approx 6K cache for (other, slow) PC/XT systems. This saves 3K from the current max track cache of 9K (including a shared bounce buffer). In the case of a disabled cache, 6K could be released to the kernel for general memory use.

The DMASEG buffer could be shared across the BIOS FD and DF drivers, since the system won't work with both simultaneously. As mentioned above, the BIOS HD and FD driver can share DMASEG since the I/O is always synchronous. However, sharing DMASEG between BIOS HD and direct DF won't work and is currently a major bug, as the DF driver can receive async requests during BIOS HD I/O that require the simultaneous use of DMASEG when XMS is in use, or when I/O is requested into a non-wrap-protected 1K L1 buffer. ~~Without XMS, there shouldn't be contention since the DF driver won't receive any I/O requests with 64k address wrapping since all L1 and L2 non-XMS buffers are wrap-protected.~~

[EDIT: Actually only L2 buffers are wrap protected, as they're allocated using seg_alloc's SEG_FLAG_ALIGN1K. L1 buffers are allocated via heap_alloc which doesn't yet have the ability to align an allocation on a block boundary.] Should the DF be enhanced to handle raw device requests, address wrap could occur unless the upper level splits the I/O into multiple segments (preferred).

When HD and DF compete, the solution is probably a kernel mutex around DMASEG; that will be addressed after these next enhancements.

The kernel code segment start address REL_SYSSEG is now calculated in this PR, potentially causing trouble without testing ROM and PC98 versions heavily. In order to more quickly test this on all systems (IBM PC, PC-98, 8088 ROM and 8018X ROM) the buildimages.sh script has been enhanced to allow for rapid compilation of kernels only using ./buildimages.sh fast. Testing of that showed deficiencies in qemu.sh and emu86.sh (for ROM desktop emulation) which are updated, and dosbox.sh is added for PC-98 testing. A new file pc98-1232-nc.config is added to produce a non-compressed image for speed.

…esting Add dosbox.sh for PC-98 image testing Fix qemu.sh for macOS Enhance emu86.sh for use after buildimages.sh

ghaerr · 2024-11-01T03:31:53Z

After more careful code inspection, the next step of splitting DMASEG into two parts IMO should be considered more carefully. While there are potentially good future reasons to have a separate always-available protected single-block DMASEG, the current BIOS driver already handles HD I/O separately directly into the requested buffer without disturbing the track cache, unless XMS is on. When XMS is on, the FD I/O has the same issue of requiring DMASEG, not using the track cache but invalidating the track cache. Separating DMA and track cache buffers would give an advantage to combined HD/FD copy operations, but only when XMS is off. I'm not sure what real advantage that has, at an extra dedicated 1K low memory cost.

Also, should either BIOS driver start using DMASEG separately for XMS and not invalidate the cache, this would require an additional check and fmemcpy to update the cache when also writing the block. The DF driver already does this. A potential issue is that for older (slow, non-386) machines, this extra memcpy is quite slow and there's no guarantee the updated track cache contents will ever actually be used. All this possibly feeds into the smaller 6K cache size found by @Mellvik to be most optimal (testing on DF driver only).

In any case, separating buffers without more cache and XMS buffer analysis is currently deemed both complicated and risking introducing subtle bugs, for which we don't have sufficient regression testing. Thus for the time being, this PR cleans up config.h considerably which is great, but will likely stop there with regards to multiple I/O buffers.

Now the next step will be adding the ability to dynamically set cache size on DF and BIOS, with the ability to turn the cache off on fast systems, for optimal speed seen on real hardware.

Mellvik · 2024-11-01T13:01:03Z

@ghaerr, thank you for some really useful 'loud thinking' about this issue. This is as important on TLVC as it is on ELKS, although possibly for slightly different reasons: We have more DMA devices while BIOS IO is a low priority 'add on', with ELKS it's the other way around.

Anyway, here are some considerations that come to mind ('thinking loud' back, and generally ignoring BIOS IO):

Separating the bounce buffer from the floppy cache is a good idea and a simplification now that the cache goes away for faster systems. Actually, maybe the low memory bounce buffer can be eliminated - on some systems. More about that below.
XMS buffers require continuous use of bounce buffers, regardless of device - and quite possibly concurrently, by (say) some hd drive and a floppy. This means that non-floppy storage devices really need their own 'private' (per driver) XMS bounce buffer (the TLVC directhd driver allocates such buffer from the heap on open). There is no 64k boundary restriction on this buffer. There would be for xd type (MFM) drives, but they are XTclass, where there is no XMS anyway.
That's an important point: XTclass systems will never have XMS so there is only one 'bounce situation' to take care of: DMA.
A DMA bounce buffer - regardless of platform - is required only occasionally - when a 64k boundary is crossed by the source/destination IObuffer. Also, in ELKS, DMA is used by floppy IO only, which makes me think that invalidating and using the cache for such bounces is a small price to pay for the simplicity of no bounce buffer at all (which incidentally is the way it was with the track buffer). If there is no fdcache (386+ system), a low RAM DMA bounce buffer would be required for floppyIO and would double as XMS bounce buffer since they would not be needed simultaneously (raw access disallowed when block device open).
That would mean that on XTclass systems, a reserved DMA bounce buffer would not make sense - unless there is a XD drive on the system.
On ATclass systems, the LANCE network interface would have the same requirement. In both cases, that would be a system configuration (menuconfig) option.

Should the DF be enhanced to handle raw device requests, address wrap could occur unless the upper level splits the I/O into multiple segments (preferred).

I don't agree, first because this is handled by the directfd driver already (using the dma bounce buffer) and secondly because moving hw related issues upstairs is generally discouraged.

Chances are there are situations I haven't accounted for when thinking about this. When trying to summarize, I ended up with this list of specific requirements and the platform they belong to:

Driver local XMS bounce (heap_alloc 1k in driver, no 64k boundary requirements) - (AT+)
Floppy cache with DMA bounce: DMASEG typically 6k (XT)
No floppy cache, low memory DMA bounce (DMASEG, 1k) (386+)
XD DMA bounce (DMASEG, 1k) (XT)
Lance DMA bounce (DMASEG, 1k) (AT+)

Most if not all these requirements can be implemented via menuconfig, even the latter two (which never occur concurrently), allocated below the floppy cache. The most interesting (and worst case) scenario would be a 286 AT, which may have XMS (rare, like the Compaq) and floppy cache and Lance ethernet. In this case the Lance DMA bounce would be the first 1k in DMASEG and would double as its XMS bouncer. Next in DMASEG would be a combined floppy DMA/XMS bounce buffer, followed by the fdcache, which in this case would not have to double as DMA bouncer. Total DMASEG 8k.

A generic catch-all/cover-all configuration would be to set aside 6k DMASEG, plus 1k if either LANCE or XD is configured. (BTW - implicit assumption: I think it's possible, but I'm not planning to make the LANCE driver available for 8bit ISA).

This does indeed sound complicated, but in terms of code it's quite simple and most of the stuff is already in place.

My $0.02 (or maybe 0.04...)

ghaerr · 2024-11-01T17:04:21Z

Great comments @Mellvik, agreed on everything you're saying, and only realized after writing the post that XMS isn't applicable to PC/XT systems (nor 286 systems, since the LOADALL method of setting the shadow GDT registers isn't implemented).

I'm still convinced it's a complicated decision on what changes to make going forward, and am continuing to try to arrange the matrix before starting changes. Thinking more and especially after reading your post, I had the following additional thoughts/problems to consider:

The solution of sharing a dedicated DMASEG buffer (outside of the floppy track cache buffer) still won't work for the DF driver, even without floppy cache. This is because my idea of protecting DMASEG with a kernel mutex won't work: a mutex (for synchronous I/O) implies putting the caller to sleep until buffer available, but the DF driver is purely interrupt driven and sometimes gets its next request at interrupt time. This means sleeping for a buffer is not possible.
Thus the DF driver (or any other pure interrupt-driven driver) may need its own dedicated buffer, and a fixed low-memory location if DMA address wrap or XMS is required/supported.
Adding multiple DMASEG buffers in low memory is now possibly easily configurable since this PR adds calculation capabilities to the segment definitions, but adding a dedicated 1k per driver is costly. Allocating a buffer through heap_alloc makes good sense as the memory is only used when the driver is in use. Best solution here sounds like adding an ALIGN1K flag to heap_alloc, which is a bit messy, but solves the shared access issue.
Using heap_alloc for a track cache isn't a great idea, since that would also have to be 1K aligned, and fragmentation and excess kernel data segment would be a problem with a large alloc.
Thus as you suggested, keeping separate DMASEG and track cache for DF driver's exclusive use seems the best answer.
If DF driver is not in use, then BIOS FD & HD can share DMASEG. If DF driver is in use, then separate DMASEG for BIOS driver is necessary for BIOS HD. This could be either allocated or permanent. (Is this complicated, or what?!)

Is the XD driver asynchronous? Lance is synchronous, right?

A final important point is that for 386+ systems with cache turned off, XMS should always be ON. Almost every 386+ system will have XMS memory available, and it doesn't make much sense to try to improve I/O speeds for 386 systems when with XMS one can have up to 2500 buffers (2.5M, almost twice a full floppy) of data cached. For this reason I'm considering rethinking the decision to have XMS turned off in the default shipping configuration. I can't recall but think we might have had a couple early XMS issues with some 386 systems with regards to A20 line management.

This does indeed sound complicated

It's still complicated! I can't quite get my hands around how important dedicated low memory is for compiled-in drivers. We could switch to using seg_alloc and allocate everything out of main memory w/ALIGN1K, now that I think of it. If the driver were opened at or very near boot, this might pack all the DMASEG and track caches together, with no fragmentation. Perhaps I should test that concept.

this might pack all the DMASEG and track caches together

[EDIT: The track cache, being larger than 1K, can't be allocated w/ALIGN1K to prevent address wrap, so the floppy driver(s) can't use seg_alloc. But could be better idea for dedicated buffers for other drivers requiring only 1K buffers.]

Thank you!

Mellvik · 2024-11-01T19:06:00Z

Thanks @ghaerr, this is useful. I'll ponder you comments overnight :-), more comments tomorrow. For now just a couple of things.

... only realized after writing the post that XMS isn't applicable to PC/XT systems (nor 286 systems, since the LOADALL method of setting the shadow GDT registers isn't implemented).

Actually, and as mentioned in my post, the 286 does implement loadall and works fine with xms - vendor permitting. I have no clue about the magic behind this (you implemented it), but XMS buffers work fine on the compaq 286 portable III - 2500 buffers and all.

Is the XD driver asynchronous? Lance is synchronous, right?

All interrupt driven drivers are asynchronous. That includes the network drivers in ELKS. Lance doesn't work yet, but if possible, it's even more async than the other network drivers because it's using dma. All the others are PIO.

ghaerr · 2024-11-01T20:37:29Z

I have no clue about the magic behind this (you implemented it), but XMS buffers work fine on the compaq 286

That's undoubtably because XMS on those machines is configured and implemented via the extended INT 15 protected block move (not unreal mode), which happens to be required on all Compaq BIOSes, because the BIOS itself runs in protected mode, which interferes with unreal mode. I'm aware of LOADALL, it is required on 286 (and still would not work on Compaq), my point is that ELKS doesn't implement it. ELKS only implements XMS via unreal mode or INT 15.

All interrupt driven drivers are asynchronous. That includes the network drivers in ELKS.

No - they don't all accept requests "asynchronously at interrupt time", which was my definition of "asynchronous". Divided into two classifications - drivers using interrupts (many) and drivers that accept I/O requests from an interrupt handler (DF & SSD) - none of the network drivers do the latter. The entire TCP/IP network subsystem is synchronous, and has to wait for ktcp to process each request before starting another (ktcp technically hangs in select so it can process another "request", but only gets one application request at a time, and there's only one NIC open at a time, and only one /etc/tcpdev file descriptor). Also, none of the network drivers have a request queue. An async driver can't return a direct result code to its caller, unless the caller is an async subsystem itself (only the block I/O subsystem is).

Given that explanation, I'm assuming the TLVC XD driver is async, DF of course is, and the TLVC HD driver is synchronous, along with ELKS BIOS FD and HD. The Lance driver puts the calling process to sleep until DMA is complete, not able to accept another request in the meantime, right? So its synchronous also. All NIC drivers receive requests through the character filesystem sub driver read/write entry point, which require a result to be returned to the caller synchronously.

So what I'm talking about is the ability for asynchronous I/O requests to be received (e.g. getting the next actual I/O request at interrupt time), or not - because as I mentioned above, "asynchronous" drivers can't share a buffer with anything else, since they can't put the calling process to sleep if not available. We can't reliably share DMA segments between async and sync drivers, but sync drivers could share a DMA segment, providing their read/write entry points won't ever be called/in-use simultaneously. I believe this means sync disk drivers could share DMA, but only amongst themselves, but only if the driver doesn't sleep a process. Since the NIC drivers all sleep the calling process (which is always ktcp) they can likely share a DMA segment until we implement multiple NIC cards in use at the same time. (Is that called multi-homing?)

Complicated and messy. Still trying to figure whether statically allocate, share, or dynamically allocate DMA buffers given the issues above.

Mellvik · 2024-11-02T11:06:31Z

I have no clue about the magic behind this (you implemented it), but XMS buffers work fine on the compaq 286

That's undoubtably because XMS on those machines is configured and implemented via the extended INT 15 protected block move (not unreal mode), which happens to be required on all Compaq BIOSes, because the BIOS itself runs in protected mode, which interferes with unreal mode. I'm aware of LOADALL, it is required on 286 (and still would not work on Compaq), my point is that ELKS doesn't implement it. ELKS only implements XMS via unreal mode or INT 15.

OK, I remember the INT15 trick - doing the job of switching from real mode to virtual mode (possibly unreal mode?) and back. Very useful. My point was that XMS is indeed available on (many) 286 systems, which is relevant for this discussion: Which system groups must have XMS bounce buffers.

All interrupt driven drivers are asynchronous. That includes the network drivers in ELKS.

No - they don't all accept requests "asynchronously at interrupt time", which was my definition of "asynchronous". Divided into two classifications - drivers using interrupts (many) and drivers that accept I/O requests from an interrupt handler (DF & SSD) - none of the network drivers do the latter. The entire TCP/IP network subsystem is synchronous, and has to wait for ktcp to process each request before starting another (ktcp technically hangs in select so it can process another "request", but only gets one application request at a time, and there's only one NIC open at a time, and only one /etc/tcpdev file descriptor). Also, none of the network drivers have a request queue. An async driver can't return a direct result code to its caller, unless the caller is an async subsystem itself (only the block I/O subsystem is).

Actually I don't understand your definition and I don't think this is the optimal venue for that discussion. But I'm sure we can agree that the most fundamental prerequisite for asynchronicity in this context is the availability of buffers. The storage IO system achieves asynchronicity by passing buffers back and forth, allocating, releasing, queuing, waiting, sending wakeups etc. In most OSes there is a similar system for network requests, which TLVC and ELKS doesn't have and cannot reasonably afford. So ktcp with all its blessings and limitations, does IO directly with the driver level, preventing asynchronicity. Good for resource preservation, bad for performance, an acceptable balance - not the least because there are still buffers involved, although on a different (lower) level. When ktcp sends a packet, the driver hands it off to the NIC buffer where it sits until the NIC can process it - asynchronously. All NICs (except Lance) have more than one transmit buffer and with the exception of really fast (vintage scale) systems, the network is faster than the IO system, so even 2 NIC transmit buffers is a lot. Does this make the driver asynchronous? Not really, but - since the same is the case for reads, the NIC buffering the packets until ktcp finds the time and opportunity to read them, the subsystem itself may be called asynchronous.

By introducing buffers at the driver level, the network subsystem becomes completely asynchronous even though ktcp is synchronous, which brings me to the point: All network drivers have most of the logic in place to handle their own buffers. Some - specifically the ne2k and the Lance drivers - have this implemented, others can have that in a matter of minutes. No structural changes, just a few lines of code. So - without diving back into the definition challenge, we're both right - and we seemingly have slightly different definitions.

While (possibly) interesting in itself, the asynchronicity and drivers issue seems completely irrelevant to this discussion (about how the use the precious resource of low memory RAM). What I suggested in my previous post is to not share DMA segments, since there is no reason to:

If XT or 286 system, use the floppy cache for bouncing in the odd case of a DMA wrap. 386 and up: Use the XMS bounce buffer if configured otherwise configure one for DMA wrap only. (The 286 may be a special case as discussed before.)
If XD disk is configured, set aside a dedicated bounce buffer.
If Lance is configured, use the same bounce buffer that would be used for XD, they're never present on the same system. If the the entire Lance packet buffer is allocated in 'safe' memory, there is no need for a bounce buffer. Currently the Lance packet buffer is allocated from the heap.

There are conveniently compile time definitions/configurations, very simple.

More about the DMASEG allocation in the next post.

Thank you.

Mellvik · 2024-11-02T11:32:24Z

I'm still convinced it's a complicated decision on what changes to make going forward, and am continuing to try to arrange the matrix before starting changes. Thinking more and especially after reading your post, I had the following additional thoughts/problems to consider:

The solution of sharing a dedicated DMASEG buffer (outside of the floppy track cache buffer) still won't work for the DF driver, even without floppy cache. ...

Thus the DF driver (or any other pure interrupt-driven driver) may need its own dedicated buffer, and a fixed low-memory location if DMA address wrap or XMS is required/supported.

I agreee, this is in line with my suggestion. And - If I'm thinking clearly - if there is an XMS bounce buffer, there is no need for a DMA bounce.

Adding multiple DMASEG buffers in low memory is now possibly easily configurable since this PR adds calculation capabilities to the segment definitions, but adding a dedicated 1k per driver is costly.

It is. And this calculation is already in place in TLVC, using the menuconfig'ured fdcache size to set the boundaries.

Allocating a buffer through heap_alloc makes good sense as the memory is only used when the driver is in use. Best solution here sounds like adding an ALIGN1K flag to heap_alloc, which is a bit messy, but solves the shared access issue.

For floppy this may be the case, thinking that cases where floppies are ROOT devices or continuously mounted will be rare. I was thinking about adding such a flag to heap_alloc when first fixing the DMA wrap problem in the xd driver, but ended up with what I think may be a simpler solution: Testing the returned block for wrap, allocating a new if it failed, then releasing the first. A bit of heap fragmentation will occur, but that would happen if using a ALIGN1K wrapper too. The xd driver now uses a bounce buffer @ 0x90 and moves DMASEG up by 1k.

Using heap_alloc for a track cache isn't a great idea, since that would also have to be 1K aligned, and fragmentation and excess kernel data segment would be a problem with a large alloc.

Agreed.

Thus as you suggested, keeping separate DMASEG and track cache for DF driver's exclusive use seems the best answer.

If DF driver is not in use, then BIOS FD & HD can share DMASEG. If DF driver is in use, then separate DMASEG for BIOS driver is necessary for BIOS HD. This could be either allocated or permanent. (Is this complicated, or what?!)

I'm assuming 'not in use' means 'not configured'. And I do see this complication (although well hidden in a config.in file...). Another case for either/or: I did allow mixing direct and BIOS drivers for a while in TLVC, then ditched it for exactly this reason: DMASEG sharing became too complicated (or dangerous if not covering all the bases).

A final important point is that for 386+ systems with cache turned off, XMS should always be ON. Almost every 386+ system will have XMS memory available, and it doesn't make much sense to try to improve I/O speeds for 386 systems when with XMS one can have up to 2500 buffers (2.5M, almost twice a full floppy) of data cached. For this reason I'm considering rethinking the decision to have XMS turned off in the default shipping configuration. I can't recall but think we might have had a couple early XMS issues with some 386 systems with regards to A20 line management.

I agree and - although I have not used XMS for a very long time - I think it should be the default (the additional code is likely minimal). When I did use it, the use was extensive - on 286 and 386, and I do believe it's safe. That said, I came to use no more than 100 XMS buffers because of the syncing - blowing away the IO subsystem for elongated periods. A different discussion - again.

We could switch to using seg_alloc and allocate everything out of main memory w/ALIGN1K, now that I think of it. If the driver were opened at or very near boot, this might pack all the DMASEG and track caches together, with no fragmentation. Perhaps I should test that concept.

It's an interesting idea, but keep in mind that whatever is done/chosen along the alternatives discussed here, is still less than the FloppyTrackBuffer we've had since forever.

Thank you.

ghaerr · 2024-11-02T16:25:58Z

and we seemingly have slightly different definitions.

Agreed, and my choice of using the word "asynchronous" to describe both the workings of the method(s) of requesting I/O as well as the inner workings of various drivers falls short. As well as going off on a tangent about it all. Useful discussion though as I'd forgot about some earlier conversations about how adding a buffer pool to the NIC drivers might increase throughput. Which begs the question of even more buffers and where to put them... We can leave that to another day.

Thanks for all your other comments about DMA, XMS and XT vs 386. (And I think 286 will fall where it may after your continued testing - either treated like XT or 386 for purposes of caching and associated buffer management).

I'll try summarizing (again) what might work for ELKS and TLVC, as I'm still interested in a common driver interface between the systems.

Keep things simple and not overly complicated. The systems remain very configurable, so assumptions do not have to be made for every user/use case. With options in either bootopts, menuconfig, config.h or the drivers themselves, most use requirements can be met.
Both the DF and BIOS FD floppy drivers should manage DMA, XMS and track caching entirely within their own buffer(s). These will likely remain allocated in low memory with a max size fixed at compile time. Various runtime options will allow for changing cache options, decreasing cache sizes, etc.
For drivers that need to handle DMA wrap, and/or for drivers that handle XMS, or for drivers that might be configurable to share a buffer, allocate at compile time a single 1K DMASEG buffer in low memory. The use of this buffer could get complicated but will be controlled by configuration options by those who know what they're doing. This buffer could also be shared using a compile-time technique (shown below).
Otherwise, the driver should allocate its own buffer at open time using heap_alloc or seg_alloc.

That's it. The separation of a DMASEG from the floppy driver(s) entirely (since they'll use their own track cache) costs 1K bytes, but keeps things simple by not having the floppy driver(s) get too far into the way of the other driver's buffer requirements, and allows for the possibility of configuration a single extra shared 1K buffer for DMA or XMS.

This buffer could also be shared using a compile-time technique (shown below).

Have you seen the mechanism used in the ELKS DF driver for configuring its DMA and track cache buffers? This compile-time configurable method allows for using a separate or combined DMA buffer in or outside the track cache (for example purposes):

/* locations for cache and bounce buffers */
#define CACHE_SEG       DMASEG  /* track cache at DMASEG:0 (shared with BIOS driver) */
#define CACHE_OFF       0
#define BOUNCE_SEG      DMASEG  /* share bounce buffer with track cache at DMASEG:0 */
#define BOUNCE_OFF      0

The segment and offset are defined separately for cache and bounce buffers, allowing any memory area to be segmented or overlapped for use as cache or DMA buffer. Later in the driver, the track cache is invalidated only if the track cache is shared with the DMA buffer:

            dma_addr = LINADDR(BOUNCE_SEG, BOUNCE_OFF);
#if (BOUNCE_SEG == CACHE_SEG) && (BOUNCE_OFF == CACHE_OFF)
            invalidate_cache();
#endif

This is just for example as a possible means of allowing a drivers buffers to be fully configurable outside the normal configuration mechanism.

Testing the returned block for wrap, allocating a new if it failed, then releasing the first. A bit of heap fragmentation will occur, but that would happen if using a ALIGN1K wrapper too.

Nice idea! I'll use it instead of trying to enhance heap_alloc, as the fragmentation result would likely be equal, with the exception of a possible "first-fit" rather than "best-fit" initial placement, as is used now.

If XD disk is configured
If Lance is configured

Agreed. The TLVC configuration will remain a bit more complicated depending on the final choice made for the driver's buffer allocations, but should all still be quite workable. I'm thinking ultimately what we'll really want is a Configuration Guide, or at least some documentation of what options are available on a per-driver basis, if this ever comes up.

if there is an XMS bounce buffer, there is no need for a DMA bounce.

Yes - although probably the other way around: if a DMA buffer is required, it can always also be used for XMS (but not vice versa). Our agreed point is that in all cases, the current drivers use the same buffer for DMA and XMS, I think.

I did allow mixing direct and BIOS drivers for a while in TLVC, then ditched it for exactly this reason: DMASEG sharing became too complicated (or dangerous if not covering all the bases).

Yes, except that for ELKS I think a very useful use case will be using the DF driver as standard simultaneously while allowing the use of the BIOS driver for HD access. This is especially important since the TLVC HD driver(s) are under heavy development and aren't really ready for release yet. By allowing DF and BIOS HD along with a separate DMASEG for BIOS HD, the next ELKS release could move to DF with continued HD reliability. There is some question as to whether the BIOS itself requires a DMA-wrap buffer when requesting HD I/O, but I'm thinking its not worth taking any chances as to whether PIO is always used internally, as the BIOS could be doing anything.

I came to use no more than 100 XMS buffers because of the syncing - blowing away the IO subsystem for elongated periods.

Geez - that's a serious problem. I've made a note of it. I'm thinking something along the lines of writing a sync_some_buffers(int n) which would effectively be sync_buffers up to a limit of N buffers to get around this big problem. Another thought is to change/add the bootopts sync= option to handle a buffer count, rather than a number of seconds (or a number of seconds and number of buffers).

What do you think, do the above bullet points cover what we need for the next step in separating DMASEG/CACHE buffers into two, or is there more that needs to be added?

Mellvik · 2024-11-03T16:05:04Z

I'll try summarizing (again) what might work for ELKS and TLVC, as I'm still interested in a common driver interface between the systems.

Keep things simple and not overly complicated. The systems remain very configurable, so assumptions do not have to be made for every user/use case. With options in either bootopts, menuconfig, config.h or the drivers themselves, most use requirements can be met.

Both the DF and BIOS FD floppy drivers should manage DMA, XMS and track caching entirely within their own buffer(s). These will likely remain allocated in low memory with a max size fixed at compile time. Various runtime options will allow for changing cache options, decreasing cache sizes, etc.

For drivers that need to handle DMA wrap, and/or for drivers that handle XMS, or for drivers that might be configurable to share a buffer, allocate at compile time a single 1K DMASEG buffer in low memory. The use of this buffer could get complicated but will be controlled by configuration options by those who know what they're doing. This buffer could also be shared using a compile-time technique (shown below).

Otherwise, the driver should allocate its own buffer at open time using heap_alloc or seg_alloc.

Sounds good! Agreed.

Have you seen the mechanism used in the ELKS DF driver for configuring its DMA and track cache buffers? This compile-time configurable method allows for using a separate or combined DMA buffer in or outside the track cache (for example purposes):
/* locations for cache and bounce buffers */
#define CACHE_SEG       DMASEG  /* track cache at DMASEG:0 (shared with BIOS driver) */
#define CACHE_OFF       0
#define BOUNCE_SEG      DMASEG  /* share bounce buffer with track cache at DMASEG:0 */
#define BOUNCE_OFF      0
The segment and offset are defined separately for cache and bounce buffers, allowing any memory area to be segmented or overlapped for use as cache or DMA buffer. Later in the driver, the track cache is invalidated only if the track cache is shared with the DMA buffer:
            dma_addr = LINADDR(BOUNCE_SEG, BOUNCE_OFF);
#if (BOUNCE_SEG == CACHE_SEG) && (BOUNCE_OFF == CACHE_OFF)
            invalidate_cache();
#endif

Pretty neat, I didn't think about that one. That said, I keep thinking that the rarity of a DMA wrap would make it perfectly reasonable to always share the cache and bounce buffer for that purpose and just invalidate the cache and go on in the rare cases when wrap happens.

I may have taken it to the extreme in terms of simplicity (and there is no provision for BIOS HD DSMA wrap here), but this is the setup I'm testing right now:

#if defined(CONFIG_BLK_DEV_XD) || defined(CONFIG_ETH_LANCE)
#define XD_LANCE_BOUNCE_SEG     DMASEG  /* bounce buffer for XD driver */
#define XD_LANCE_BOUNCE_SEGSZ 0x400
#else
#define XD_LANCE_BOUNCE_SEGSZ   0
#endif

/* Always present, variable size */
#define FLOPPY_CACHE_SEG        DMASEG + (XD_LANCE_BOUNCE_SEGSZ>>4)
#if defined(CONFIG_FLOPPY_CACHE)
#define FLOPPY_CACHE_SEGSZ      (CONFIG_FLOPPY_CACHE<<9)
#else
#define FLOPPY_CACHE_SEGSZ      0x0400  /* 1 BLOCK (1024B) */
#endif

#define REL_SYSSEG      DMASEG + \
                        (XD_LANCE_BOUNCE_SEGSZ+FLOPPY_CACHE_SEGSZ)>>4 /* kernel code segment */
#define SETUP_DATA      REL_INITSEG
#endif

XD disk and LANCE share the same bounce buffer since they never occur concurrently. Floppy is always there and at least a 1k bounce/wrap buffer. The rest are using heap_alloc().

I'm thinking ultimately what we'll really want is a Configuration Guide, or at least some documentation of what options are available on a per-driver basis, if this ever comes up.

Yes, I think that's necessary - even for ourselves. it's har to remember all the variants and why this was chosen instead of that - etc.

if there is an XMS bounce buffer, there is no need for a DMA bounce.

Yes - although probably the other way around: if a DMA buffer is required, it can always also be used for XMS (but not vice versa). Our agreed point is that in all cases, the current drivers use the same buffer for DMA and XMS, I think.

Yes - that's my perception too :-)

I did allow mixing direct and BIOS drivers for a while in TLVC, then ditched it for exactly this reason: DMASEG sharing became too complicated (or dangerous if not covering all the bases).

Yes, except that for ELKS I think a very useful use case will be using the DF driver as standard simultaneously while allowing the use of the BIOS driver for HD access. This is especially important since the TLVC HD driver(s) are under heavy development and aren't really ready for release yet.

The IDE driver has been stable for a long time, but you're still right in the sense that the XT-IDE stuff has been added recently. Next step - interrupts - is a big change too, and I'm considering leaving the current driver in place ant let them live alongside each other while testing. Makes life easier in many ways.

By allowing DF and BIOS HD along with a separate DMASEG for BIOS HD, the next ELKS release could move to DF with continued HD reliability. There is some question as to whether the BIOS itself requires a DMA-wrap buffer when requesting HD I/O, but I'm thinking its not worth taking any chances as to whether PIO is always used internally, as the BIOS could be doing anything.

There is always the XD (MFM drive) case even with BIOS IO, but if you give the HD driver a DMA safe bounce buffer, it would act as a DMA bouncer for XT class system (no XMS there), and an XMS bouncer for AT+. The 1k in low men is cheap for the convenience and the memory has to come from somewhere anyway...

I came to use no more than 100 XMS buffers because of the syncing - blowing away the IO subsystem for elongated periods.

Geez - that's a serious problem. I've made a note of it. I'm thinking something along the lines of writing a sync_some_buffers(int n) which would effectively be sync_buffers up to a limit of N buffers to get around this big problem. Another thought is to change/add the bootopts sync= option to handle a buffer count, rather than a number of seconds (or a number of seconds and number of buffers).

What do you think, do the above bullet points cover what we need for the next step in separating DMASEG/CACHE buffers into two, or is there more that needs to be added?

ghaerr · 2024-11-03T18:08:37Z

The segment and offset are defined separately for cache and bounce buffers, allowing any memory area to be segmented or overlapped for use as cache or DMA buffer.

I keep thinking that the rarity of a DMA wrap would make it perfectly reasonable to always share the cache and bounce buffer for that purpose and just invalidate the cache

Agreed that DMA and XMS can should always be able to be the same buffer. What the above code does is allow the DF driver to separate it's own DMA/XMS buffer out of the track cache or not - currently the DF driver DMA/XMS buffer is always within its track cache.

I plan on doing the same sort of thing in the BIOS FD/HD driver, which will allow either driver to be compiled to include or not include its DMA/XMS buffer within its track cache, to make future configuration(s) easy while we're in the depths of this.

The important part though, after careful inspection of the BIOS FD/HD code - is that ELKS can with a few changes be setup to function with the the BIOS, DF or both drivers (permanently allowing either driver to be configured at any time). The enhancement would be configuring the BIOS driver to use a separate DMA/XMS buffer from its track cache and configure the DF driver to have its DMA/XMS buffer within its track cache. Since the DF driver is "async" (explained in long detail above), it can't share a DMA/XMS buffer with any other driver. But since the BIOS driver is "sync" and BIOS FD will never be used simultaneously with DF, having a separate DMA/XMS buffer for the BIOS driver allows the BIOS HD driver to co-exist with DF and work great. This also solves the ability to access old XD drives as you point out, without having to port the XD driver over. The BIOS DMA/XMS buffer will also be available to be shared with other "sync" drivers. I'm putting together a PR, you'll see the details.

if you give the HD driver a DMA safe bounce buffer, it would act as a DMA bouncer for XT class system (no XMS there), and an XMS bouncer for AT+.

Yes, and this will also work for the BIOS HD driver as described above. I'm going to keep the name of the shareable/sync etc DMA/XMS buffer DMASEG (as you have) which was its original purpose, and rename all other uses to something else (TRACKSEG etc) similar to what you might be doing.

I may have taken it to the extreme in terms of simplicity (and there is no provision for BIOS HD DSMA wrap here), but this is the setup I'm testing right now:

Yes, I was thinking similarly, but hope to avoid the #define complexity since ELKS doesn't have the XD or Lance drivers (yet).

The IDE driver has been stable for a long time, but you're still right in the sense that the XT-IDE stuff has been added recently.

There are more issues to be resolved for sharing drivers between TLVC and ELKS. Happy to talk more about them when ready, I would suggest we try getting the NIC drivers shared first, as less complications and the API is possibly the same.

Mellvik · 2024-11-04T08:23:00Z

I keep thinking that the rarity of a DMA wrap would make it perfectly reasonable to always share the cache and bounce buffer for that purpose and just invalidate the cache

Agreed that DMA and XMS can should always be able to be the same buffer. What the above code does is allow the DF driver to separate it's own DMA/XMS buffer out of the track cache or not - currently the DF driver DMA/XMS buffer is always within its track cache.

Oops; i forgot to add in that one - thanks for the reminder.

The important part though, after careful inspection of the BIOS FD/HD code - is that ELKS can with a few changes be setup to function with the the BIOS, DF or both drivers (permanently allowing either driver to be configured at any time). The enhancement would be configuring the BIOS driver to use a separate DMA/XMS buffer from its track cache and configure the DF driver to have its DMA/XMS buffer within its track cache. Since the DF driver is "async" (explained in long detail above), it can't share a DMA/XMS buffer with any other driver. But since the BIOS driver is "sync" and BIOS FD will never be used simultaneously with DF, having a separate DMA/XMS buffer for the BIOS driver allows the BIOS HD driver to co-exist with DF and work great. This also solves the ability to access old XD drives as you point out, without having to port the XD driver over. The BIOS DMA/XMS buffer will also be available to be shared with other "sync" drivers. I'm putting together a PR, you'll see the details.

That's a good plan. For TLVC all drivers are considered async (IDE driver soon), and need their own bounce buffer. That may sound like a lot but isn't. The FD driver gets its own, the XD or Lance get one and that's it. The non-DMA drivers use heap_alloc(). Getting there. I'm also leaning towards limiting the max allocatable FDcache to 6 or 7k, as there is no benefit (except experimentation) to make it larger...

ghaerr · 2024-11-05T01:13:22Z

I'm also leaning towards limiting the max allocatable FDcache to 6 or 7k, as there is no benefit (except experimentation) to make it larger...

IMO, the ability for continued experimentation is a good thing, especially up to at least a full track limit on 1440k drives. Why artificially limit what a user wants to configure, unless it breaks the system?

Mellvik · 2024-11-05T08:27:27Z

I'm also leaning towards limiting the max allocatable FDcache to 6 or 7k, as there is no benefit (except experimentation) to make it larger...

IMO, the ability for continued experimentation is a good thing, especially up to at least a full track limit on 1440k drives. Why artificially limit what a user wants to configure, unless it breaks the system?

Many reasons, actually - the most important that rigorous testing has shown a larger cache is not meaningful. Then there are

settings tend to be forgotten. Setting it (CONFIG) to 9 and using 6k or less is a waste
most users have no way (unless reading still non existent documentation) that bigger is 'badder', so lowering the limit is helpful
those capable of doing meaningful testing will easily be able to change the hard limit
the new setup will - in the presence of either XD or Lance plus the always present floppy - have 2 bounce buffers allocated. Hard-limiting the cache to (say) 7 keeps us within the 'old' 9k limit, whatever that's worth.

A KB here and a KB there. BTW the upcoming directfd driver update introduces conditionals around the cache code so it will automatically be removed if CONFIG_FLOPPY_CACHE is 0. Not a KB, maybe 150 bytes ...

ghaerr · 2024-11-05T14:39:31Z

I'm also leaning towards limiting the max allocatable FDcache to 6 or 7k

Sorry, I misunderstood your statement. Agreed the floppy cache should be set by default to the optimum results from your testing. I plan on following suit after making small updates to the DF driver. The BIOS driver is a bit more complicated and may be better left as-is, since its cache design is less amenable to a fixed-size cache starting at particular sector number with MT track wrap.

Current test results show 6k cache give best results, or is it a different number?

A KB here and a KB there

Totally agree!

Mellvik · 2024-11-05T15:33:41Z

It's safe to say that 6 is best. 7 is marginally better for some combinations but the margin is below noticeable. I've chosen 6 as the default in menuconfig, above 7 will become 7 in config.h.

CONFIGuring out the floppy cache turned out to save ~400 code bytes, 32 data bytes :-)

ghaerr added 2 commits October 31, 2024 13:47

Add 'buildimages.sh fast' for fast uncompressed builds for emulator t…

a0398e7

…esting Add dosbox.sh for PC-98 image testing Fix qemu.sh for macOS Enhance emu86.sh for use after buildimages.sh

Use calculations in config.h for DMASEGSZ and DMASEGEND

b1b96cb

ghaerr merged commit 57236d9 into master Nov 1, 2024
2 checks passed

ghaerr deleted the config branch November 1, 2024 01:48

ghaerr mentioned this pull request Nov 4, 2024

[kernel] Split XMS/DMA buffer from track cache for BIOS and DF drivers #2094

Merged

Mellvik mentioned this pull request Nov 7, 2024

[kernel] Reworked the use of lowmem DMASEG in xd and floppy driver + more Mellvik/TLVC#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kernel] Enhance buildimages.sh and emulator scripts, add DMASEGEND in config.h #2091

[kernel] Enhance buildimages.sh and emulator scripts, add DMASEGEND in config.h #2091

ghaerr commented Nov 1, 2024 •

edited

Loading

ghaerr commented Nov 1, 2024

Mellvik commented Nov 1, 2024

ghaerr commented Nov 1, 2024 •

edited

Loading

Mellvik commented Nov 1, 2024

ghaerr commented Nov 1, 2024

Mellvik commented Nov 2, 2024

Mellvik commented Nov 2, 2024 •

edited

Loading

ghaerr commented Nov 2, 2024

Mellvik commented Nov 3, 2024

ghaerr commented Nov 3, 2024

Mellvik commented Nov 4, 2024

ghaerr commented Nov 5, 2024

Mellvik commented Nov 5, 2024 •

edited

Loading

ghaerr commented Nov 5, 2024

Mellvik commented Nov 5, 2024

[kernel] Enhance buildimages.sh and emulator scripts, add DMASEGEND in config.h #2091

[kernel] Enhance buildimages.sh and emulator scripts, add DMASEGEND in config.h #2091

Conversation

ghaerr commented Nov 1, 2024 • edited Loading

ghaerr commented Nov 1, 2024

Mellvik commented Nov 1, 2024

ghaerr commented Nov 1, 2024 • edited Loading

Mellvik commented Nov 1, 2024

ghaerr commented Nov 1, 2024

Mellvik commented Nov 2, 2024

Mellvik commented Nov 2, 2024 • edited Loading

ghaerr commented Nov 2, 2024

Mellvik commented Nov 3, 2024

ghaerr commented Nov 3, 2024

Mellvik commented Nov 4, 2024

ghaerr commented Nov 5, 2024

Mellvik commented Nov 5, 2024 • edited Loading

ghaerr commented Nov 5, 2024

Mellvik commented Nov 5, 2024

ghaerr commented Nov 1, 2024 •

edited

Loading

ghaerr commented Nov 1, 2024 •

edited

Loading

Mellvik commented Nov 2, 2024 •

edited

Loading

Mellvik commented Nov 5, 2024 •

edited

Loading