Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashStore Utility/Converter #95

Open
doulikecookiedough opened this issue Jul 24, 2024 · 14 comments
Open

HashStore Utility/Converter #95

doulikecookiedough opened this issue Jul 24, 2024 · 14 comments
Assignees

Comments

@doulikecookiedough
Copy link
Contributor

After further discussion with the team, we have confirmed that HashStore will assist the client, Metacat, in the migration process for existing data and metadata objects by providing a new utility class. Metacat will coordinate the iterative process so that the respective default checksum list for each data object is stored. This class, HashStoreConverter, has a single convert method:

  • convert(Path pathToDoc, String pid, Stream systema)
    • Creates a stream from the path to the object, generates the 5 default checksums, creates a hard link between an existing data or metadata object, tags the pid and data object, and stores the system metadata.
  • Note: If a given path is null, but system metadata is present, we will store the system metadata only with the given pid.
  • Note 2: The process may fail, in which Metacat may attempt to restore hard links for the entire set of data and documents. We need to be sure to check for the existence of a data object/pid relationship to save time before attempting the hard linking process.

This utility class will need to call a new class FileHashStoreLinks which extends FileHashStore. We will have a new public method storeHardLink(...) which will follow the same existing flow as storeObject. Except this method will eventually arrive at a new override method for move, which calls Files.createLink instead of Files.move.

// Files.move(sourceFilePath, targetFilePath, StandardCopyOption.ATOMIC_MOVE);
Files.createLink(sourceFilePath, targetFilePath);

Process Diagram (via @taojing2002 and @artntek)
HashStore Utility Design_Page 1- 1721847229317

To Do:

  • Create two new classes in a package under hashstore called 'hashstoreconverter'
    • HashStoreConverter
      • convert(Path pathToDoc, String pid, Stream systema)
    • FileHashStoreLinks
      • storeHardLink(InputStream object, String pid)
      • move(Path sourcePath, Path targetPath)
  • Write new junit tests to confirm functionality
@doulikecookiedough doulikecookiedough self-assigned this Jul 24, 2024
@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 25, 2024

Unfortunately, simply overriding move will not be enough to address this issue. To create a proper hard link, the original supplied path pathToDoc must be used as the source.

Using the existing storeObject flow, and overriding the final move operation, will actually only create a link between the tmpFile generated and the target permanent address. As a result, this link points to the new inodes/data blocks allocated for tmpFile, rather than referencing the original inodes of pathToDoc.

image
zenuml Syntax
zenuml
    title HashStoreConverter Process
    Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
      "new InputStream existDocStream"
      FileHashStoreLinks.storeHardLink(existDocStream, pid) {
        
        FileHashStore.storeObject(existDocStream, pid) {
          syncPutObject
          writeToTmpFileAndGenerateChecksums
          move
          // -
          // Override 'move'
          // Creates a hard link
          FileHashStoreLinks->FileHashStoreLinks
          return ObjectMetadata
        }
 
        FileHashStore.storeMetadata(sysmeta, pid) {
        // -
        // If sysmeta fails to store,
        // an exception will be thrown.
        // -
        // The hard link/tags created
        // for the data obj will remain.
        return pathToSysmeta
        }
        return ObjectMetadata
      }
      return ObjectMetadata
    }
Loading

Instead of overriding move, we will not follow the storeObject flow directly but indirectly (like in the scenario where a client receives a data stream first before metadata). However, instead of calling storeObject, we will directly call writeToTmpFileAndGenerateChecksums.

Since we are really only after the map of checksums/hex digests, there is no need to follow the storeObject synchronization process as the tmpFile being written into is discarded after. Once that is completed, we can call tagObject which is thread-safe and synchronized, then create the hard link, and lastly store the sysmeta.

image
zenuml Syntax New Flow
zenuml
    title HashStoreConverter Process
    Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
      "new InputStream existDocStream"
      FileHashStoreLinks.storeHardLink(existDoc, existDocStream, pid) {
        FileHashStore.generateTmpFile {
            return tmpFile
        }
        FileHashStore.writeToTmpFileAndGenerateChecksums {
          return hexDigests/Checksums
        }
        delete(tmpFile)
        // -
        // 'tagObject' is synchronized/thread safe
        FileHashStore.tagObject(pid, cid)
   
        FileHashStore.getHashStoreDataObjectPath(cid) {
            return cidExpectedObjectPath
        }

        createHardLink(existDoc, cidExpectedObjectPath)
        FileHashStore.storeMetadata(sysmeta, pid) {
        // -
        // If sysmeta fails to store,
        // an exception will be thrown.
        // -
        // The hard link/tags created
        // for the data obj will remain.
        return pathToSysmeta
        }
        return ObjectMetadata
      }
      "existDocStream.close()"
      return ObjectMetadata
    }
Loading

@taojing2002
Copy link

taojing2002 commented Jul 25, 2024 via email

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 25, 2024

Thank you for the quick feedback @taojing2002! I believe we actually don't need to override/touch the move method at all. The HashStoreConverter.convert method will make two calls to FileHashStoreLinks:

  • storeHardLink(Path existDoc, InputStream existDocStream, String pid)
    • This will get the 5 default checksums by writing into and then deleting a tmpFile
    • Then it will call tagObject, and finally create the actual hard link afterwards
      • I am contemplating creating the hard link first before calling tagObject
  • storeMetadata(InputStream sysmeta, String pid)

Updated zenuml diagram

image
zenuml Syntax Converter Process Updated
zenuml
    title HashStoreConverter Process
    Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
      "new InputStream existDocStream"
      FileHashStoreLinks.storeHardLink(existDoc, existDocStream, pid) {
        FileHashStore.generateTmpFile {
            return tmpFile
        }
        FileHashStore.writeToTmpFileAndGenerateChecksums {
          return hexDigests/Checksums
        }
        delete(tmpFile)
        // -
        // 'tagObject' is synchronized/thread safe
        FileHashStore.tagObject(pid, cid)
   
        FileHashStore.getHashStoreDataObjectPath(cid) {
            return cidExpectedObjectPath
        }

        createHardLink(existDoc, cidExpectedObjectPath)
        return ObjectMetadata
      }
        // -
        // Close object stream
        "existDocStream.close()"
        FileHashStoreLinks.storeMetadata(sysmeta, pid) {
        // -
        // If sysmeta fails to store,
        // an exception will be thrown.
        // -
        // The hard link/tags created
        // for the data obj will remain.
        FileHashStore.storeMetadata(sysmeta, pid) {
            return pathToSysmeta
        }
            return pathToSysmeta
        }
        // -
        // - Close sysmeta stream
        "sysmeta.close()"
        return ObjectMetadata
    }
Loading

@taojing2002
Copy link

taojing2002 commented Jul 25, 2024 via email

@mbjones
Copy link
Member

mbjones commented Jul 26, 2024

@doulikecookiedough Some idle noodling on all this, and how long it will take...

Process is still writing, not linking

The purpose of the use of a hard link is to completely avoid having to rewrite the data to disk. If you call writeToTmpFileAndGenerateChecksums, doesn't that rewrite the data to a tempfile in the process of calculating checksums? If so, then I think there will be no performance gain at all compared to just calling store_object() on all of the objects. If you can avoid the write altogether it has promise. Reading the data from disk to calculate the checksums should be faster I think. For a random 3GiB file on our cephfs filesystem, I just measured that it takes about 6.7s to cp the file to a new location, and about 20s to calculate a sha256 hash. When that gets done for the 140+TB of data on the ADC, its going to take a long time.

Batching with CephFS API

Also, one of the things that it would be good to avoid is running an iterative loop across 2 million+ objects and calling a method on each, which will take a long time on its own. When I first mentioned this optimization, I mentioned that it would be much faster to call the CephFS API to modify the metadata server directly (hopefully in batch ops) than it would be to use the posix link ln command on each file. Even these metadata operations on this many files is going to be very time consuming. Maybe the link API would be faster than using ln or cp -l, maybe not, needs some testing.

Possible alternative to consider

An alternative approach could be below, that I have no idea whether it will work or is supported by the Ceph API, jst food for thought:

1) QUERY: query metacat to get a table listing the PID, SID, docid, rev, and sysmeta for every object in the system
2) LINK OBJECTS: use that info to batch update cephfs creating hard links for all of the objects (see https://docs.ceph.com/en/reef/cephfs/api/libcephfs-py/#cephfs.LibCephFS.link) and create other hashstore info needed for each file
3) CALC CHECKSUMS: in a multithreaded process, iterate across all obects and calculate and store the checksums of the objects
4) UPDATE/UPGRADE METACAT: using the checksums from 3) do a batch update of all of those checksums in the Metacat checksums table (e.g., use `SQL COPY` not `SQL INSERT`)

Some quick metrics on our system write speeds

Of course, this is entirely out-of-band from both hashstore and metacat, but it might be orders of magnitude faster than trying to insert millions of objects one call at a time. In particular, once (2) is done, you can then massively parallelize (3) on our k8s cluster (although we will be quickly limited by read I/O rate). Because this process is out-of-band (and specific to our systems), it probably would not be the best automatic upgrade path for small metacat deployments. Rather, I see this as a pre-processing step to prepare a hashstore dir and set of checksums for our large installations that have a lot of objects (e.g., ADC/KNB/cn.dataone.org). Smaller metacat installs might just use the Hashstore API you are designing to convert directly.

A few stats follow for reference. I created a directory with 100 3 GiB files on our cephfs filesystem (named testfile-100 to testfile-199).

Hardlinking all of the files was fast, and took 2.0 seconds, which is an average of 50 files per second, or 150 GiB/s

jones@datateam:/mnt/ceph/home/jones/test$ time seq 100 199 | parallel "cp -l testfile-{} linkfile-{}"

real	0m2.006s
user	0m1.048s
sys	0m2.214s

Copying all of the data to a new file was I/O bound, took 4m58s, which is an average of 1 GiB/s

jones@datateam:/mnt/ceph/home/jones/test$ time seq 100 199 |  parallel "cp testfile-{} cpfile-{}"

real	4m58.120s
user	0m2.092s
sys	26m31.029s

We have about 140TB of data on the ADC. So, at that rate, it would take about 3982 hours (166 days) 40 hours to make a single write for the data, whereas hardlinking the same files would be much faster (a few days). All of that depends on sustained rates of transfer, which would likely be lower than this small test demonstrates (both because there would be more activity on the systems, and because these are all identical files so probably benefiting from OS caching/optimization). So, if we can cut back on our writes substantially, we will save a lot of time in the conversion. Last, calculating checksums for these files will also take some time:

jones@datateam:/mnt/ceph/home/jones/test$ time seq 100 199 | parallel "shasum -a 256 testfile-{} > sha-{}"

real	2m5.624s
user	39m51.798s
sys	5m20.154s

So, that's about half the time of the file writes, and hopefully something that we can parallelize much more extensively on the k8s cluster because sha calculations are generally compute-bound.

All just food for thought and discussion. Not sure what you should do, but thought it was worth pointing out your "link" op is really a "write" op.

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 26, 2024

Thank you for sharing your insightful thoughts with us @mbjones 🙏!

If I call writeToTmpFileAndGenerateChecksums, I'm indeed writing to a tmpFile while calculating the checksums - so there is no performance gain. However, I think creating a hardlink is still a required step so that we aren't using up more disk space than we need to. Regarding how to speed up the process, I think we should override writeToTmpFileAndGenerateChecksums so that it only reads into a buffer to calculate the checksums (skipping the writing process as it is not necessary, as you suggested).

RE: Alternative Approach & CephFS
Unfortunately... it looks like libcephfs for Java is no longer actively supported/maintained. Leveraging CephFS's API would have to be done in Python - which would mean this tool would now exist outside of Metacat's control. I think that this upgrade process should be coordinated by Metacat, otherwise it could become confusing for Metacat operators to have to download a separate tool to upgrade/prep existing directories* for HashStore, in addition to upgrading Metacat itself.

I will let this simmer in my mind for a bit, your feedback has been extremely helpful!

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 26, 2024

Updated Proposed Process:

  • FileHashStoreLinks.storeHardLink
    • Calls a new method generateChecksums that calculates the default 5 checksums (no writing)
    • Create hard link afterwards
    • Call tagObject
image
zenuml Syntax Converter Process Updated
zenuml
    title HashStoreConverter Process
    Client->HashStoreConverter.convert(Path existDoc,pid, Stream sysmeta) {
      "new InputStream existDocStream"
      FileHashStoreLinks.storeHardLink(existDoc, existDocStream, pid) {
        FileHashStore.generateChecksums {
          return hexDigests/Checksums
        }
        FileHashStore.getHashStoreDataObjectPath(cid) {
            return cidExpectedObjectPath
        }
        createHardLink(existDoc, cidExpectedObjectPath)
        // -
        // 'tagObject' is synchronized/thread safe
        FileHashStore.tagObject(pid, cid)

        return ObjectMetadata
      }
        // -
        // Close object stream
        "existDocStream.close()"
        FileHashStoreLinks.storeMetadata(sysmeta, pid) {
        // -
        // If sysmeta fails to store,
        // an exception will be thrown.
        // -
        // The hard link/tags created
        // for the data obj will remain.
        FileHashStore.storeMetadata(sysmeta, pid) {
            return pathToSysmeta
        }
            return pathToSysmeta
        }
        // -
        // - Close sysmeta stream
        "sysmeta.close()"
        return ObjectMetadata
    }
Loading

@mbjones
Copy link
Member

mbjones commented Jul 26, 2024

@doulikecookiedough thanks. In my comments above, I made a pretty big calculation error on my data throughput rates, which I edited above. So instead of 3982 hours for our 140TB cp, that should have been 39.82 hours. Pretty huge difference. Wanted to point this out as it changes my thinking on the scale of this conversion.

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 29, 2024

Thank you again @mbjones for sharing your thoughts with us and the update. I realize that I may have created some confusion regarding our approach to the conversion process as well.

While Metacat will likely set up the process with a loop, the execution phase will utilize Java's concurrency APIs (TBD, perhaps via the Java Collection object's .parallelStream().forEach(...) route) to handle the data efficiently.

The out-of-band process described sounds quite promising - but Metacat will eventually also need to store the system metadata for each data object. Given this, along with how we want to store a hard link to optimize disk usage, I feel that the convert process outlined in this issue feels appropriate/suitable for our needs. Also, confirming that we aren't writing any bytes to disk anymore.

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jul 31, 2024

Update:

After further discussion with @taojing2002 and @artntek, we have agreed to combine the Metacat DB Upgrade & HashStore conversion process together.

  • For smaller repositories, this combined process should be relatively quick and minimally impact the user experience.

When operators configure the DB or initiate the process, it will also automatically trigger the HashStore conversion.

  • Operators will be redirected to the properties configuration page, where the status will be displayed as in progress along with placeholder text Please refresh for updates where the DB version is usually shown.

RE: Jumbo Repositories (KNB, DataONE, ADC) - Minimize Downtime (<2 minutes or less?)

  • First, we will investigate how long the existing process may take with test data on the knbvm server and continue our discussions from there.

@mbjones
Copy link
Member

mbjones commented Jul 31, 2024

A note on the jumbo repos comment -- the reason that I think the downtime can be kept to a few minutes is that the conversion of files from the existing store to HashStore can be done ahead of time without loss (and hopefully without re-writing existing data), recording the checksum results for later update in Metacat. A rough pseudo algorithm might be:

  1. Start parallel system process to convert HashStore files
    1a) Batch query metacat to get list of all identifiers, key metadata, and their systemMetadata, and note the time this query was made
  2. for each file:
    2a) read file to calculate checksums, and save checksums for later use
    2b) hard link file into HashStore location
    2c) write sysmeta to HashStore location
    2d) write PID annotations and other metadata needed to HashStore
  3. With completed list of PID/Checksums, start Metacat upgrade process (read-only mode), and BATCH copy the metadata into appropriate postgres tables (e.g., don't do iterative INSERTS, rather do SQL COPY and batch updates)
  4. Sweep up any new identifiers that were touched after query time of step 1a and convert those few objects before resuming read-write operations

Hope this is helpful. Other approaches would be feasible too, this is just one possible path.

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Aug 16, 2024

@taojing2002 I have tested the HashStoreConverter with the data found in knbvm, specifically the first 10000 rows retrieved from metacat's pg db where the size < 1 gb:

  • 2791 data objects
  • 2791 cid reference files
  • 4830 pid reference files

For the normal storeObject process, it takes approximately 17 minutes to write a data file and calculate the default checksums.

For the hardlinking process convert / storeHardLink, it takes approximately 8 minutes to calculate the default checksums and create a hard link.

No unexpected errors - so creating a hard link is about 50% faster than the normal process.

@doulikecookiedough
Copy link
Contributor Author

This has been completed via Feature-95: HashStoreConverter & FileHashStoreLinks - however, leaving this issue open to continue discussing plans/strategy to reduce downtime for jumbo repos.

@doulikecookiedough
Copy link
Contributor Author

The hard link conversion process was run for all data objects on knbvm via the hashstore client and took approximately 22 hours and 8 minutes (~5.9T)

For quick reference:

mok@knbvm:/var/metacat$ du -h -d 1
0       ./benchmark
0       ./inline-data
du: cannot read directory './.metacat': Permission denied
0       ./.metacat
1.5G    ./documents
512     ./users
178K    ./logs
5.9T    ./hashstore
du: cannot read directory './config': Permission denied
0       ./config
910M    ./solr-home3
du: cannot read directory './tdb/tdb16272684195302037564': Permission denied
409G    ./tdb
4.5K    ./dataone
507G    ./data
421M    ./temporary
57K     ./certs
6.8T    .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants