Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom binary index file format #3

Open
msakuta opened this issue Feb 23, 2023 · 4 comments
Open

Use custom binary index file format #3

msakuta opened this issue Feb 23, 2023 · 4 comments

Comments

@msakuta
Copy link

msakuta commented Feb 23, 2023

This is very interesting project. A search engine on your local that you know how it works and can customize. It can actually impove my productivity.

So I was interested in the performance, because it seems not so fast.

I injected performance measurement code in various places in my branch.
This is the time to make index of the whole docs.gl in my computer.

Indexed in 7.639098 s

And this is the time to search a common word like TEXTURE:

Load index file (JSON): 5.9112787 s
Search index: 0.0046938 s
Total seach time: 5.9307961

It is too slow as a search engine, but interestingly most of the time is spent on loading the JSON file index.
I suppose most time consuming part is parsing.

I see that you attempt to implement sqlite index file. Indeed it is very fast to load, since it doesn't really load the file when it makes connection to the db.

Load index file (sqlite): 0.0006661 s

However, sqlite doesn't look particularly easy to work with. I tried to implement search feature for sqlite, but it's too painful to implement using SQL.

So I propose a third approach - designing a custom binary file format for the specific use case.
The idea is to implement our custom serializer/deserializer to a binary file like below. It will omit the parsing part and loading time should be much faster.

impl InMemoryModel {
    pub fn serialize(&self, mut writer: impl Write) -> std::io::Result<()> {
        writer.write_all(&self.docs.len().to_le_bytes())?;
        for (path, doc) in &self.docs {
            let path = path.as_os_str().to_string_lossy();
            let path_bytes = path.as_bytes();
            writer.write_all(&path_bytes.len().to_le_bytes())?;
            writer.write_all(path_bytes)?;

            writer.write_all(&doc.count.to_le_bytes())?;

            writer.write_all(&doc.tf.len().to_le_bytes())?;
            for (term, f) in &doc.tf {
                let term_bytes = term.as_bytes();
                writer.write_all(&term_bytes.len().to_le_bytes())?;
                writer.write_all(term_bytes)?;

                writer.write_all(&f.to_le_bytes())?;
            }
        }

        writer.write_all(&self.df.len().to_le_bytes())?;
        for (term, f) in &self.df {
            let term_bytes = term.as_bytes();
            writer.write_all(&term_bytes.len().to_le_bytes())?;
            writer.write_all(term_bytes)?;

            writer.write_all(&f.to_le_bytes())?;
        }
        Ok(())
    }
}

The deserializer is like below.

impl InMemoryModel {
    pub fn deserialize(mut reader: impl Read) -> Result<Self, Box<dyn Error>> {
        let mut this = Self::default();

        fn deserialize_string(reader: &mut impl Read) -> Result<String, Box<dyn Error>> {
            let mut buf = [0u8; std::mem::size_of::<usize>()];
            reader.read_exact(&mut buf)?;
            let path_len = usize::from_le_bytes(buf);
            let mut path_buf = vec![0u8; path_len];
            reader.read_exact(&mut path_buf)?;
            Ok(String::from_utf8(path_buf)?)
        }

        let mut buf = [0u8; std::mem::size_of::<usize>()];
        reader.read_exact(&mut buf)?;
        let num_docs = usize::from_le_bytes(buf);
        for _ in 0..num_docs {
            let path = deserialize_string(&mut reader)?;

            reader.read_exact(&mut buf)?;
            let count = usize::from_le_bytes(buf);

            let mut tf = TermFreq::new();

            reader.read_exact(&mut buf)?;
            let tf_len = usize::from_le_bytes(buf);
            for _ in 0..tf_len {
                let term = deserialize_string(&mut reader)?;
                reader.read_exact(&mut buf)?;
                let f = usize::from_le_bytes(buf);
                tf.insert(term, f);
            }

            this.docs.insert(PathBuf::from(path), Doc {
                count,
                tf
            });
        }

        reader.read_exact(&mut buf)?;
        let num_df = usize::from_le_bytes(buf);
        for _ in 0..num_df {
            let term = deserialize_string(&mut reader)?;
            reader.read_exact(&mut buf)?;
            let f = usize::from_le_bytes(buf);
            this.df.insert(term, f);
        }

        Ok(this)
    }

We don't need serde or anything, in fact using it may put some overhead. Dumping binary data right into disk should be the simplest and the fastest way to serialize.

I implemented in the branch and with this binary file:

Indexed in 3.9568251 s

Load index file (binary): 1.6069981 s
Search index: 0.0043521 s
Total seach time: 1.6252311000000002

It's still not very fast, but much better than JSON.

I think we can make it a randomly-accessible binary file format and seek the file pointer to the necessary field to reduce the time to load the whole file content. It should implement Model trait. We will need a header in the beginning of the binary to indicate these file offset per data records such as a document. At least it seems easier than SQL with complex queries.

It is like designing a custom database engine but I think it fits to the policy of this project :)

@SoutrikBandyopadhyay
Copy link

Excellent job done !

@benediktwerner
Copy link

benediktwerner commented Feb 24, 2023

I wonder how just using a more efficient serde format like bincode instead of json would compare? I suspect it probably will perform equally well. Using serde's zero-copy deserialization features should be even better.

@benediktwerner
Copy link

Actually, it looks like the main reason it's slow is just unbuffered IO. See #5. With that, it loads in 0.3s for me with json.

@msakuta
Copy link
Author

msakuta commented Feb 24, 2023

Oh... I was aware that reading xmls was using BufReader, but I missed the index file.
And I also forgot to put BufRead to the binary format.
Now:

Load index file (JSON): 0.0862068 s
Load index file (binary): 0.0465654 s

The difference doesn't really matter much.

BTW, I managed to make sqlite search work in #4, and it's

Sqlite search time: 0.0187526 s

I have to admit that sqlite is fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants