v3 Planning & Objectives #194

jeremy-visionaid · 2024-10-13T21:48:44Z

Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?

If I understand correctly, the values/objectives for v2 are:

Pure dotnet implementation
Maximized client compatilibilty (i.e. netstandard2.0, net4.0)
Easy traversal of storages/streams
Easy manipulation of stream data

There are some goals I have in mind for v3:

Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
Support transactions (e.g. scratch data rather than snapshot copy)
Support consolidation on commit (e.g. online rather than copy)
Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
Improved performance
Reduced memory usage
Nullable attributes/static analysis

Other thoughts:

Multi-targeting for netstandard2.0 and net8.0
Spans (System.Memory if targeting netstandard2.0)
async (Currently no async BinaryReader/Writer: Add async overloads to BinaryReader/Writer dotnet/runtime#17229)

jeremy-visionaid · 2024-10-13T21:59:45Z

As I began to understand the OpenMcdf code better and more generally the CFB format while investigating #184, I was thinking that it might be quite difficult to add the features I've previously mentioned to v2. So, at the end of last week I took the liberty and started some proof of concept work for a new approach for version 3. I've adapted/written code to parse headers and directory entries, enumerate FAT sectors, traverse sector chains and read the contents of a stream. There's obviously quite a lot of functionality missing (substorage traversal, reading mini FAT, any kind of writing), but it is at least enough to spin up the equivalent of the "InMemory" benchmark:

Windows Structured Storage (ILockBytes over a MemoryStream)

Method	BufferSize	TotalStreamSize	Mean	Error	StdDev	Allocated
Test	1048576	1048576	215.4 us	2.25 us	2.11 us	440 B
Test	524288	1048576	215.6 us	4.07 us	4.36 us	440 B
Test	262144	1048576	212.6 us	4.06 us	4.51 us	440 B
Test	131072	1048576	211.4 us	1.71 us	1.42 us	440 B
Test	4096	1048576	205.6 us	1.78 us	1.58 us	440 B
Test	1024	1048576	237.9 us	4.69 us	4.39 us	440 B
Test	512	1048576	307.4 us	6.02 us	9.72 us	440 B

OpenMcdf v2.3.1

Method	BufferSize	TotalStreamSize	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
Test	1048576	1048576	187.9 us	3.71 us	8.14 us	137.6953	68.6035	-	1.1 MB
Test	524288	1048576	196.9 us	3.85 us	5.39 us	140.1367	55.1758	-	1.12 MB
Test	262144	1048576	219.8 us	2.81 us	2.76 us	144.2871	72.0215	-	1.15 MB
Test	131072	1048576	276.2 us	5.04 us	4.47 us	152.8320	76.1719	-	1.22 MB
Test	4096	1048576	3,899.4 us	74.65 us	69.83 us	671.8750	335.9375	-	5.41 MB
Test	1024	1048576	15,455.3 us	141.46 us	125.40 us	2281.2500	375.0000	156.2500	18.37 MB
Test	512	1048576	30,795.2 us	614.23 us	880.90 us	4468.7500	437.5000	187.5000	35.66 MB

OpenMcdf v3 Proof of Concept

Method	BufferSize	TotalStreamSize	Mean	Error	StdDev	Gen0	Allocated
Test	1048576	1048576	59.28 us	0.150 us	0.125 us	0.1221	1.96 KB
Test	524288	1048576	60.57 us	0.117 us	0.092 us	0.1831	1.96 KB
Test	262144	1048576	59.48 us	0.099 us	0.083 us	0.1831	1.96 KB
Test	131072	1048576	60.04 us	0.790 us	0.739 us	0.1831	1.96 KB
Test	4096	1048576	62.56 us	1.119 us	1.046 us	0.1221	1.96 KB
Test	1024	1048576	76.02 us	0.265 us	0.248 us	0.1221	1.96 KB
Test	512	1048576	75.58 us	0.997 us	0.933 us	0.1221	1.96 KB

So, there's some pretty big performance (400x faster for short reads) and memory reduction (Gen0 GCs are drastically reduced, and Gen1/2 GCs are eliminated) wins to be had on reading, while also enforcing reasonably strict validation.

In the proof of concept, BinaryReader and BinaryWriter are extended to handle CFB types, and there is only one reader and one writer stored in a context (along with the header) that is shared across objects that need access to it. Sectors are lightweight structs mostly to record their ID and map to their position within the CFB stream/file. There are a couple of enumerators that do the main work:
FatSectorEnumerator: Enumerates the FAT sectors from the Header's DIFAT array and the DIFAT chain.
FatSectorChainEnumerator: Enumerates a chain of FAT sectors for an entry/directory/FAT

Although I haven't done any code to write data yet, I'm thinking the enumerators might be converted to mutable iterators which should also be reasonably fast/efficient. I'll share some code when I've cleaned it up and progressed a bit further!

ironfede · 2024-10-14T20:01:04Z

Now that v2.4 looks like it's coming together. I was wondering what folks thoughts were for objectives for v3?

If I understand correctly, the values/objectives for v2 are:

Pure dotnet implementation

Maximized client compatilibilty (i.e. netstandard2.0, net4.0)

Easy traversal of storages/streams

Easy manipulation of stream data

There are some goals I have in mind for v3:

Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)

Support transactions (e.g. scratch data rather than snapshot copy)

Support consolidation on commit (e.g. online rather than copy)

Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)

Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)

Improved performance

Reduced memory usage

Nullable attributes/static analysis

Other thoughts:

Multi-targeting for netstandard2.0 and net8.0

Spans (System.Memory if targeting netstandard2.0)

async (Currently no async BinaryReader/Writer: Add async overloads to BinaryReader/Writer dotnet/runtime#17229)

Honestly speaking... it's a perfect summary.
I think that 2.4 target is almost reached. I would not introduce new features in this branch since it has reached a certain maturity level and v3 should take good ideas from it and refactor them to allow a better logic separation and avoid up-and-down runs to allocate and persist sector chains since there lay the big performance penalties even if it's somehow a compact representation and working unit for cfb handling.

jeremy-visionaid · 2024-10-14T20:27:39Z

@ironfede Yesterday, I added a dedicated enumerator for directory entries in a FAT chain and another enumerator for the directory tree, so it can now traverse storages and streams as part of a tree (it could only traverse them as a list before) . I also improved the enumerators so they're a bit easier to follow and improved validation (enumerators and sector offsets throw if you try to access something that's invalid/out of bounds). I'll have a look at implementing support for mini FAT sectors today, then I think I'll be to the point where I'll have something worth sharing for comments and feedback. But essentially, aside from some clean-up and further validation/testing, I think the POC already meets the following objectives:

Support 16 TB files (i.e. the maximum 0xFFFFFFFA sector count and therefore uint sector IDs)
Revised API to follow dotnet conventions (e.g. CFStream by implementing Stream directly instead of via a decorator)
Idiomatic exception hierarchy (Review exception hierarchy for v3 #146)
Improved performance
Reduced memory usage
Nullable attributes/static analysis

jeremy-visionaid · 2024-10-14T20:58:00Z

@Numpsy Looks like you might have a particular interest in the OLE Property Set Data Structures? Do you have anything you'd like to see for v3? I can't say I know too much about it, so aside from some nit-pick refactoring work my only real comment is that perhaps OpenMcdf.Extensions should be renamed to OpenMcdf.Ole (i.e. explicitly about OLE only). Especially since we probably won't require a decorator for streams in v3.

Numpsy · 2024-10-14T22:38:10Z

My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue.
The recent changes have shaved a nice amount of memory allocations from reading said properties, but as the time taken is sub-millisecond it's hard to measure changes (though any gain is nice).

As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes.
@farfilli has raised some issues in that area, so maybe he has some thoughts for possible improvements?

Numpsy · 2024-10-14T22:45:19Z

There's also scope for a more complete set of functions for adding/updating/deleting properties (e.g. #190) - that's a higher level API than the changes to the storage part.

farfilli · 2024-10-15T06:58:37Z

My current use case is reading and writing metadata (summary information etc) in Office documents, so things like really massive files aren't really an issue. The recent changes have shaved a nice amount of memory allocations from reading said properties, but as the time taken is sub-millisecond it's hard to measure changes (though any gain is nice).

As far as the API goes, I think it would be nice to review how it presents the different property sets - as well as SummaryInformation/DocumentSummaryInformation, there is some amound of support for others, but it's not clear how far the intended support goes. @farfilli has raised some issues in that area, so maybe he has some thoughts for possible improvements?

My use case is similar to @Numpsy but on Solid Edge (CAD) documents, besides the standard SummaryInformation/DocumentSummaryInformation, some more app-specific metadata streams need to be accessed. Regular documents contain one part per file however some documents may contain more than one part; this means the file's structure becomes multilevel and each part has its own SummaryInformation/DocumentSummaryInformation and so on.
It would be nice to have methods to identify these situations.

If needed I can provide example files with an explanation of their structure.

jeremy-visionaid · 2024-10-15T08:27:43Z

@farfilli Sure, I'd love to get some test documents from other apps for test purposes if they're not massive. My proof of concept also addresses #58, and I've been thinking about ways to cover #66 too which might also address your use case.

farfilli · 2024-10-15T08:58:56Z

@jeremy-visionaid that's great, let me know where to upload the files besides a brief descriptions of them

jeremy-visionaid · 2024-10-15T09:01:20Z

@farfilli Maybe you could push them with on a branch of a fork and put some descriptions in the commit message?

farfilli · 2024-10-15T09:41:03Z

@jeremy-visionaid done in 0c8fc18

jeremy-visionaid · 2024-10-16T04:16:10Z

Okeydoke, I think I've progressed with the v3 proof of concept far enough that I'm happy to push it for some early feedback.
It's a fresh branch that doesn't share any common commits with the upstream (it helped me having the projects side-by-side):
https://github.com/Visionaid-International-Ltd/openmcdf/tree/3.0-poc

There are some notable things still missing:

Comments 😅
It's currently read-only
Missing some argument validation (e.g.. ArgumentNullException)
Some data/format validation
A FileFormatException or equivalent class (esp. catching ArgumentExceptions and rethrowing for corrupt files)
Red-black tree search
Some things are constant that could/should be taken from the header (but would otherwise be considered corrupt according to the spec)
Project setup stuff (e.g. Licensing, spell checking, CI etc.)
EntryInfo is sparse (only the name is provided to the client)

I'll at least try and add some comments tomorrow, and see if I can add some additional features as above through the week, but comments and feedback would be appreciated!

jeremy-visionaid · 2024-10-17T08:51:51Z

I've pushed some changes with some refactoring, improved terminology, filled out the summary comments and improved validation for corrupt CFBs. The reading part at least should be in pretty good shape if anyone wants to take it for a spin and try it out. I could bring the structured storage explorer across if it's of interest? Otherwise it's just unit tests right now.

ironfede · 2024-10-17T11:17:00Z

Thank you @jeremy-visionaid for all your work. I'm already working on a base v3, including some deep structural changes in manipulation of sector chains and streams in general to set a background to evolve in a future for a SWMR scenario and for a performance enhancement in write operations. I hope to put a draft online in short time since I understand that there's (correctly) a little bit of pressure on this... :-)

jeremy-visionaid · 2024-10-18T02:30:35Z

@ironfede That's no worries, it's been quite fun! I did consider trying to work with v2.4 as a base, but I've handled sectors, chains and streams in a totally different way, changed and improved the error handling etc., and even the terminology for types and members is a bit different. Since I'm the one that wants most of the v3 features and robustness etc,, I'm also happy to do the bulk of the work, but I think it was just more efficient for me to start afresh in order to get there.

I'd be curious as to what things you were thinking about with regards to SWMR? Although it's not a goal I had in mind, I think the architecture of what I've done in the proof of concept would likely lend itself quite well to that situation already. I have been thinking ahead a bit towards transactions, consolidation and scratch data, etc., but I only really have an idea about how they could be achieved with the POC.

It's not so much that there's a huge amount of pressure to get v3 out immediately. I can probably work around the transactions and consolidation stuff on the client side. Releasing v2.4.0 would be helpful though, since v2.3 currently crashes our app, so I can't yet merge the changes to use it. Updating to v3 I hope would improve the performance, enable larger files and reduce memory and disk space requirements, etc, but it otherwise might not be a release blocker. (Though before we go to RTM with our application, I do need to validate some things with 2.4 - or possibly even 3.0 instead depending on how things go on that front).

Though I've been somewhat busy with other tasks this week, I hope might find some time to bring the POC up to feature parity with v2.4 around the middle of next week, but I'd also love to see your draft 3.0 to see how it compares.

ironfede · 2024-10-19T21:51:43Z

There are some points in a version 3 that are honestly a little bit difficult to implement in a clean way. The first is supporting LARGE files for v4 CFB (16TB Stream as specs define). Only this point alone means that a chain with no data should have more than 4 billions of indexes and this is absolutely a huge number of Sector indexes to support so I'm thinking how to partialize chain loading for sections of requested sectors in order to manipulate only required read/write sections. The natural "old" way to manipulate this type of things should be a memory file mapping but this means that all structures need to be rethinked in terms of structs with no reference at all in a C-way and this could be difficult to mantain in the long run. On the other hand, using a Stream-ed approach, I think that some functions to obtain partial chains sections should be introduced...

molesmoke · 2024-10-20T00:35:54Z

That's actually largely the point of why I rewrote the sector/chain/stream handling in the proof of concept. The POC is designed/intended to handle arbitrarily long files/chains/streams, but I couldn't really see how to adapt v2.4 to handle that... I made a start on handling writing, but its limited to in-place alterations at the moment, allowing arbitrarily large writes while deferring commits is still to do, but I think I might have that going in another day or two

ironfede · 2024-10-20T09:31:34Z

@jeremy-visionaid , @molesmoke , @Numpsy and everybody interested, I've added 3.0 branch.
If you want, please start pr-ing on this branch for POCs so we have a common ground on this repository.

Many thanks to everybody!

jeremy-visionaid changed the title ~~v3 Planning~~ v3 Planning & Objectives Oct 13, 2024

jeremy-visionaid mentioned this issue Oct 16, 2024

v3 Proof of Concept #199

Draft

ironfede added this to the 3.0 milestone Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3 Planning & Objectives #194

v3 Planning & Objectives #194

jeremy-visionaid commented Oct 13, 2024 •

edited

Loading

jeremy-visionaid commented Oct 13, 2024

ironfede commented Oct 14, 2024

jeremy-visionaid commented Oct 14, 2024

jeremy-visionaid commented Oct 14, 2024

Numpsy commented Oct 14, 2024

Numpsy commented Oct 14, 2024

farfilli commented Oct 15, 2024 •

edited

Loading

jeremy-visionaid commented Oct 15, 2024

farfilli commented Oct 15, 2024

jeremy-visionaid commented Oct 15, 2024

farfilli commented Oct 15, 2024

jeremy-visionaid commented Oct 16, 2024 •

edited

Loading

jeremy-visionaid commented Oct 17, 2024

ironfede commented Oct 17, 2024 •

edited

Loading

jeremy-visionaid commented Oct 18, 2024 •

edited

Loading

ironfede commented Oct 19, 2024

molesmoke commented Oct 20, 2024

ironfede commented Oct 20, 2024

v3 Planning & Objectives #194

v3 Planning & Objectives #194

Comments

jeremy-visionaid commented Oct 13, 2024 • edited Loading

jeremy-visionaid commented Oct 13, 2024

ironfede commented Oct 14, 2024

jeremy-visionaid commented Oct 14, 2024

jeremy-visionaid commented Oct 14, 2024

Numpsy commented Oct 14, 2024

Numpsy commented Oct 14, 2024

farfilli commented Oct 15, 2024 • edited Loading

jeremy-visionaid commented Oct 15, 2024

farfilli commented Oct 15, 2024

jeremy-visionaid commented Oct 15, 2024

farfilli commented Oct 15, 2024

jeremy-visionaid commented Oct 16, 2024 • edited Loading

jeremy-visionaid commented Oct 17, 2024

ironfede commented Oct 17, 2024 • edited Loading

jeremy-visionaid commented Oct 18, 2024 • edited Loading

ironfede commented Oct 19, 2024

molesmoke commented Oct 20, 2024

ironfede commented Oct 20, 2024

jeremy-visionaid commented Oct 13, 2024 •

edited

Loading

farfilli commented Oct 15, 2024 •

edited

Loading

jeremy-visionaid commented Oct 16, 2024 •

edited

Loading

ironfede commented Oct 17, 2024 •

edited

Loading

jeremy-visionaid commented Oct 18, 2024 •

edited

Loading