Use custom binary index file format #3

msakuta · 2023-02-23T10:42:03Z

This is very interesting project. A search engine on your local that you know how it works and can customize. It can actually impove my productivity.

So I was interested in the performance, because it seems not so fast.

I injected performance measurement code in various places in my branch.
This is the time to make index of the whole docs.gl in my computer.

Indexed in 7.639098 s

And this is the time to search a common word like TEXTURE:

Load index file (JSON): 5.9112787 s
Search index: 0.0046938 s
Total seach time: 5.9307961

It is too slow as a search engine, but interestingly most of the time is spent on loading the JSON file index.
I suppose most time consuming part is parsing.

I see that you attempt to implement sqlite index file. Indeed it is very fast to load, since it doesn't really load the file when it makes connection to the db.

Load index file (sqlite): 0.0006661 s

However, sqlite doesn't look particularly easy to work with. I tried to implement search feature for sqlite, but it's too painful to implement using SQL.

So I propose a third approach - designing a custom binary file format for the specific use case.
The idea is to implement our custom serializer/deserializer to a binary file like below. It will omit the parsing part and loading time should be much faster.

impl InMemoryModel {
    pub fn serialize(&self, mut writer: impl Write) -> std::io::Result<()> {
        writer.write_all(&self.docs.len().to_le_bytes())?;
        for (path, doc) in &self.docs {
            let path = path.as_os_str().to_string_lossy();
            let path_bytes = path.as_bytes();
            writer.write_all(&path_bytes.len().to_le_bytes())?;
            writer.write_all(path_bytes)?;

            writer.write_all(&doc.count.to_le_bytes())?;

            writer.write_all(&doc.tf.len().to_le_bytes())?;
            for (term, f) in &doc.tf {
                let term_bytes = term.as_bytes();
                writer.write_all(&term_bytes.len().to_le_bytes())?;
                writer.write_all(term_bytes)?;

                writer.write_all(&f.to_le_bytes())?;
            }
        }

        writer.write_all(&self.df.len().to_le_bytes())?;
        for (term, f) in &self.df {
            let term_bytes = term.as_bytes();
            writer.write_all(&term_bytes.len().to_le_bytes())?;
            writer.write_all(term_bytes)?;

            writer.write_all(&f.to_le_bytes())?;
        }
        Ok(())
    }
}

The deserializer is like below.

impl InMemoryModel {
    pub fn deserialize(mut reader: impl Read) -> Result<Self, Box<dyn Error>> {
        let mut this = Self::default();

        fn deserialize_string(reader: &mut impl Read) -> Result<String, Box<dyn Error>> {
            let mut buf = [0u8; std::mem::size_of::<usize>()];
            reader.read_exact(&mut buf)?;
            let path_len = usize::from_le_bytes(buf);
            let mut path_buf = vec![0u8; path_len];
            reader.read_exact(&mut path_buf)?;
            Ok(String::from_utf8(path_buf)?)
        }

        let mut buf = [0u8; std::mem::size_of::<usize>()];
        reader.read_exact(&mut buf)?;
        let num_docs = usize::from_le_bytes(buf);
        for _ in 0..num_docs {
            let path = deserialize_string(&mut reader)?;

            reader.read_exact(&mut buf)?;
            let count = usize::from_le_bytes(buf);

            let mut tf = TermFreq::new();

            reader.read_exact(&mut buf)?;
            let tf_len = usize::from_le_bytes(buf);
            for _ in 0..tf_len {
                let term = deserialize_string(&mut reader)?;
                reader.read_exact(&mut buf)?;
                let f = usize::from_le_bytes(buf);
                tf.insert(term, f);
            }

            this.docs.insert(PathBuf::from(path), Doc {
                count,
                tf
            });
        }

        reader.read_exact(&mut buf)?;
        let num_df = usize::from_le_bytes(buf);
        for _ in 0..num_df {
            let term = deserialize_string(&mut reader)?;
            reader.read_exact(&mut buf)?;
            let f = usize::from_le_bytes(buf);
            this.df.insert(term, f);
        }

        Ok(this)
    }

We don't need serde or anything, in fact using it may put some overhead. Dumping binary data right into disk should be the simplest and the fastest way to serialize.

I implemented in the branch and with this binary file:

Indexed in 3.9568251 s

Load index file (binary): 1.6069981 s
Search index: 0.0043521 s
Total seach time: 1.6252311000000002

It's still not very fast, but much better than JSON.

I think we can make it a randomly-accessible binary file format and seek the file pointer to the necessary field to reduce the time to load the whole file content. It should implement Model trait. We will need a header in the beginning of the binary to indicate these file offset per data records such as a document. At least it seems easier than SQL with complex queries.

It is like designing a custom database engine but I think it fits to the policy of this project :)

The text was updated successfully, but these errors were encountered:

SoutrikBandyopadhyay · 2023-02-23T18:36:09Z

Excellent job done !

benediktwerner · 2023-02-24T05:48:05Z

I wonder how just using a more efficient serde format like bincode instead of json would compare? I suspect it probably will perform equally well. Using serde's zero-copy deserialization features should be even better.

benediktwerner · 2023-02-24T11:13:26Z

Actually, it looks like the main reason it's slow is just unbuffered IO. See #5. With that, it loads in 0.3s for me with json.

msakuta · 2023-02-24T12:06:48Z

Oh... I was aware that reading xmls was using BufReader, but I missed the index file.
And I also forgot to put BufRead to the binary format.
Now:

Load index file (JSON): 0.0862068 s
Load index file (binary): 0.0465654 s

The difference doesn't really matter much.

BTW, I managed to make sqlite search work in #4, and it's

Sqlite search time: 0.0187526 s

I have to admit that sqlite is fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use custom binary index file format #3

Use custom binary index file format #3

msakuta commented Feb 23, 2023 •

edited

Loading

SoutrikBandyopadhyay commented Feb 23, 2023

benediktwerner commented Feb 24, 2023 •

edited

Loading

benediktwerner commented Feb 24, 2023

msakuta commented Feb 24, 2023

Use custom binary index file format #3

Use custom binary index file format #3

Comments

msakuta commented Feb 23, 2023 • edited Loading

SoutrikBandyopadhyay commented Feb 23, 2023

benediktwerner commented Feb 24, 2023 • edited Loading

benediktwerner commented Feb 24, 2023

msakuta commented Feb 24, 2023

msakuta commented Feb 23, 2023 •

edited

Loading

benediktwerner commented Feb 24, 2023 •

edited

Loading