Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow orchestrator for scalable/reliable ingestion #207

Open
JayGhiya opened this issue Nov 11, 2024 — with Huly for GitHub · 9 comments
Open

Workflow orchestrator for scalable/reliable ingestion #207

JayGhiya opened this issue Nov 11, 2024 — with Huly for GitHub · 9 comments
Assignees
Labels
enhancement New feature or request HA
Milestone

Comments

Copy link
Member

Workflow orchestration has to be enabled for Scalable and reliable ingestion.

@JayGhiya JayGhiya added HA enhancement New feature or request labels Nov 11, 2024
@JayGhiya JayGhiya moved this to In progress in Unoplat Roadmap Nov 11, 2024
@JayGhiya
Copy link
Member Author

@vipinshreyaskumar please create a separate folder in root-repo as of now. Sub folder can cause issues for ci/cd. Root folder could be called unoplat-code-confluence-harvestor if you feel that's a cool name. Second when you start committing please follow https://www.conventionalcommits.org/en/v1.0.0/ as our ci/cd depends on it and it will cause issues if commit messages do not follow the convention.

So I am working on total rewamp for context performance reliability at algorithm level in #206 issue once that's completed we could merge harvestor code and current utility.

@JayGhiya
Copy link
Member Author

JayGhiya commented Nov 14, 2024

@JayGhiya
Copy link
Member Author

JayGhiya commented Dec 17, 2024

Cloning a GitHub repo and running certain operations through linting tools on the codebase would be the starting point. @vipinshreyaskumar .

Our config looks like this:

{
  "repositories": [
    {
      "git_url": "https://github.com/your-org/your-repo",
      "markdown_output_path": "/path/to/output",
      "codebases": [
        {
          "codebase_folder_name": "your-codebase",
          "root_package_name": "your_package",
          "programming_language_metadata": {
            "language": "python",
            "package_manager": "poetry",
            "language_version": "3.12.0"
          }        
        }
      ]
    }
  ],
  "archguard": {
    "download_url": "archguard/archguard",
    "download_directory": "/path/to/directory"
  },
  "llm_provider_config": {
    "llm_model_provider": "openai/model-name",
    "model_provider_args": {
      "max_tokens": 500,
      "temperature": 0.0
    }
  },
  "logging_handlers": [
    {
      "sink": "path/to/app.log",
      "format": "<green>{time:YYYY-MM-DD at HH:mm:ss}</green> | <level>{level}</level>",
      "rotation": "10 MB",
      "retention": "10 days",
      "level": "DEBUG"
    }
  ],
  "json_output": false,
  "sentence_transformer_model": "jinaai/jina-embeddings-v3"
} 

Right now we use:
Ruff.toml with following content.

# Target Python 3.10
target-version = "py311"

exclude = [
    ".git",
    ".mypy_cache",
    ".pytest_cache",
    ".ruff_cache",
    ".venv",
    "venv",
    "build",
    "dist",
]

src = ["unoplat_code_confluence"]  # Adjust this to your project's source directory (This is important the config will provide this - root_package_name)

[lint]
# Enable only flake8-tidy-imports
select = ["TID","F401","F841"]

[lint.per-file-ignores]
"__init__.py" = ["E402","F401"]

then post doing this we have to run isort with below config
Based on package manager config in config run the relevant code from this branch of package manager to get the
packages and update them in known_third_party.

[settings]
known_third_party = marko,pydantic,ruff,loguru,pygithub,pypdf,pydantic_settings,litellm,pytest,dspy_ai,packaging,progiter,sentence_transformers,einops,rich,neo4j,neomodel,requirements_parser,tomlkit,stdlib_list,pytest_cov,gitpython
import_heading_stdlib = Standard Library
import_heading_thirdparty = Third Party
import_heading_firstparty = First Party
import_heading_localfolder = Local 
combine_as_imports = true
py_version = 311  # For Python 3.11

This will make prerequisites for our parsing.

@JayGhiya JayGhiya self-assigned this Dec 24, 2024
@JayGhiya
Copy link
Member Author

i have pushed description/diagram in readme which is unoplat-code-confluence in this branch for our milestone 2 @vipinshreyaskumar @apekshamehta @milind12 . i will do the same in our google docs tomorrow and kickoff.

@JayGhiya
Copy link
Member Author

we now have a skeleton with one actual activity being run through fastapi implemented in temporal. We also have improved our contributor experience through task file - where one just has to run - task dev. to setup venv ,install packages and start fastapi server. @vipinshreyaskumar . One needs to install uv, task and fastapi cli that's all.

@JayGhiya
Copy link
Member Author

This branch will have pushes everyday please do pull to checkout. This also has improved cli experience which accepts configuration and manages n requests per n repos to our fastapi based code confluence flow bridge.

@JayGhiya
Copy link
Member Author

We have successfully incorporated two tasks via temporal:

  1. launching per repo workflow.
  2. Doing main activity per repo workflow - that is cloning github
  3. Spawning child workflow per codebase (if monorepo) - and doing other activities per codebase in that child workflow - Parsing package manager metadata.

Code is pushed. cc: @vipinshreyaskumar will update next plan for activities and in design doc as well.

@JayGhiya
Copy link
Member Author

JayGhiya commented Jan 1, 2025

Tasks:

  • Validate and if required if child workflows per codebases are launched in parallel.
  • Read on best strategy for parent-child workflow when child workflows are long running (summary can take long).
  • Neo4j modelling
  • integrate into parent and child activities.

cc: @vipinshreyaskumar

@JayGhiya
Copy link
Member Author

JayGhiya commented Jan 1, 2025

Tasks:

  • Validate and if required if child workflows per codebases are launched in parallel.
  • Read on best strategy for parent-child workflow when child workflows are long running (summary can take long).
  • Neo4j modelling
  • integrate into parent and child activities.

cc: @vipinshreyaskumar

We have fixed launching parallel child workflows. For parent-child workflows child workflows will be long running and we will be using across parent/child workflows database references to merge/relate data so we do not require parent to be waiting for it. so we have set parent child policy to be abandon. This will ensure child workflows keep running independent of parent.

@vipinshreyaskumar vipinshreyaskumar added this to the Milestone 02 milestone Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HA
Projects
Status: In progress
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants