Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Add additional knowledge bases #3

Open
16 tasks
svilupp opened this issue Feb 5, 2024 · 4 comments
Open
16 tasks

[FR] Add additional knowledge bases #3

svilupp opened this issue Feb 5, 2024 · 4 comments

Comments

@svilupp
Copy link
Owner

svilupp commented Feb 5, 2024

Add the following knowledge sources:

Knowledge should contain both the documentation and the code snippets.
To be added as separate artifacts, clearly label the embedding model (and associated parameters like dimensions).


In addition, we need to build:

  • Docsite scraper for repeatable collection
    • Scraper must be compliant (robots.txt) and considerate (throttling requests)
    • It must be able to focus only on specific package version and ignore others
  • High quality chunker/processor
    • It must be able to deduplicate content
    • It should honor the structure of the content as much as possible, to generate high-quality document chunks

This functionality should be well-documented and user-friendly, so that anyone can index their own favourite package (and, ideally, share it with others in the Julia community).

All this tooling should live in the AIHelpMe as a separate module (initially) with its own separate dependencies (eg, Gumbo, etc).

@splendidbug
Copy link

splendidbug commented Mar 9, 2024

I like the idea of the project. Can anyone outside the organization start contributing by providing the knowledge sources? If yes, do you expect ChunkIndex (the output of build_index())?

@svilupp
Copy link
Owner Author

svilupp commented Mar 9, 2024

It was originally earmarked for the GSoC: https://julialang.org/jsoc/gsoc/juliagenai/

But I don't think anything is stopping us from laying the foundations.

Yes, anyone can contribute but I think especially in the beginning the biggest help would be building a powerful data harvesting & processing pipeline.

The reason is that there are more choices to be made (how to chunk, chunk size, dimensions, etc), which heavily influence the performance but it will take a while to tune them. So we need to be able to refresh and reprocess our data sources easily.

To your specific question, we need to time the impact on loading times but it would be much more robust to save the knowledge via JLD2 and build the index upon downloading (it would allow us to support older Julia versions and also be resilient to any future Chunk index changes).

If you would like to chat about it, we can have a call. You can find me on Julia Slack under the handle svilup?

EDIT: I forgot to say - I'm in the process of slightly rewriting the RAG tooling in PromptingTools to be more hackable/modular, which should also help us here.

@svilupp
Copy link
Owner Author

svilupp commented Mar 13, 2024

@splendidbug I've just updated the first post with more details around associated websites + some more details for the data pipeline. Feel free to convert them into separate new issues if you start playing with it.

@divital-coder
Copy link

Hiya tuning in, because i would love to help expand the knowledge base and fine-tune the RAG pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants