Workflow orchestrator for scalable/reliable ingestion #207

JayGhiya · 2024-11-11T04:05:35Z

Workflow orchestration has to be enabled for Scalable and reliable ingestion.

JayGhiya · 2024-11-11T04:16:37Z

@vipinshreyaskumar please create a separate folder in root-repo as of now. Sub folder can cause issues for ci/cd. Root folder could be called unoplat-code-confluence-harvestor if you feel that's a cool name. Second when you start committing please follow https://www.conventionalcommits.org/en/v1.0.0/ as our ci/cd depends on it and it will cause issues if commit messages do not follow the convention.

So I am working on total rewamp for context performance reliability at algorithm level in #206 issue once that's completed we could merge harvestor code and current utility.

JayGhiya · 2024-11-14T13:06:26Z

Test it with config file reading local repository when starting up.
Figure out how to check and not read again the same repository because of docker-compose restart. Otherwise too much tokens could be used up unnecessary.
Test after reading through archguard activity. Refer Ingestion Utility: All class , func metadata and Func to func call ingestion into Neo4j #206 - Arcguardhandler . This will also give us clarity how to run other programming lang based cli lib inside temporal in best way possible
Test with temporal nodejs client call for a diff on an sample repository through different diff archguard workflow inside temporal - how does it handle?
Reference: https://github.com/archguard/archguard/tree/master/scanner_cli

JayGhiya · 2024-12-17T05:51:41Z

Cloning a GitHub repo and running certain operations through linting tools on the codebase would be the starting point. @vipinshreyaskumar .

Our config looks like this:

{
  "repositories": [
    {
      "git_url": "https://github.com/your-org/your-repo",
      "markdown_output_path": "/path/to/output",
      "codebases": [
        {
          "codebase_folder_name": "your-codebase",
          "root_package_name": "your_package",
          "programming_language_metadata": {
            "language": "python",
            "package_manager": "poetry",
            "language_version": "3.12.0"
          }        
        }
      ]
    }
  ],
  "archguard": {
    "download_url": "archguard/archguard",
    "download_directory": "/path/to/directory"
  },
  "llm_provider_config": {
    "llm_model_provider": "openai/model-name",
    "model_provider_args": {
      "max_tokens": 500,
      "temperature": 0.0
    }
  },
  "logging_handlers": [
    {
      "sink": "path/to/app.log",
      "format": "<green>{time:YYYY-MM-DD at HH:mm:ss}</green> | <level>{level}</level>",
      "rotation": "10 MB",
      "retention": "10 days",
      "level": "DEBUG"
    }
  ],
  "json_output": false,
  "sentence_transformer_model": "jinaai/jina-embeddings-v3"
}

Right now we use:
Ruff.toml with following content.

# Target Python 3.10
target-version = "py311"

exclude = [
    ".git",
    ".mypy_cache",
    ".pytest_cache",
    ".ruff_cache",
    ".venv",
    "venv",
    "build",
    "dist",
]

src = ["unoplat_code_confluence"]  # Adjust this to your project's source directory (This is important the config will provide this - root_package_name)

[lint]
# Enable only flake8-tidy-imports
select = ["TID","F401","F841"]

[lint.per-file-ignores]
"__init__.py" = ["E402","F401"]

then post doing this we have to run isort with below config
Based on package manager config in config run the relevant code from this branch of package manager to get the
packages and update them in known_third_party.

[settings]
known_third_party = marko,pydantic,ruff,loguru,pygithub,pypdf,pydantic_settings,litellm,pytest,dspy_ai,packaging,progiter,sentence_transformers,einops,rich,neo4j,neomodel,requirements_parser,tomlkit,stdlib_list,pytest_cov,gitpython
import_heading_stdlib = Standard Library
import_heading_thirdparty = Third Party
import_heading_firstparty = First Party
import_heading_localfolder = Local 
combine_as_imports = true
py_version = 311  # For Python 3.11

This will make prerequisites for our parsing.

JayGhiya · 2024-12-24T12:13:47Z

i have pushed description/diagram in readme which is unoplat-code-confluence in this branch for our milestone 2 @vipinshreyaskumar @apekshamehta @milind12 . i will do the same in our google docs tomorrow and kickoff.

JayGhiya · 2024-12-27T10:34:54Z

we now have a skeleton with one actual activity being run through fastapi implemented in temporal. We also have improved our contributor experience through task file - where one just has to run - task dev. to setup venv ,install packages and start fastapi server. @vipinshreyaskumar . One needs to install uv, task and fastapi cli that's all.

JayGhiya · 2024-12-27T10:35:45Z

This branch will have pushes everyday please do pull to checkout. This also has improved cli experience which accepts configuration and manages n requests per n repos to our fastapi based code confluence flow bridge.

JayGhiya · 2024-12-30T10:51:13Z

We have successfully incorporated two tasks via temporal:

launching per repo workflow.
Doing main activity per repo workflow - that is cloning github
Spawning child workflow per codebase (if monorepo) - and doing other activities per codebase in that child workflow - Parsing package manager metadata.

Code is pushed. cc: @vipinshreyaskumar will update next plan for activities and in design doc as well.

JayGhiya · 2025-01-01T05:58:05Z

Tasks:

Validate and if required if child workflows per codebases are launched in parallel.
Read on best strategy for parent-child workflow when child workflows are long running (summary can take long).
Neo4j modelling
integrate into parent and child activities.

cc: @vipinshreyaskumar

JayGhiya · 2025-01-01T12:05:17Z

Tasks:

Validate and if required if child workflows per codebases are launched in parallel.

Read on best strategy for parent-child workflow when child workflows are long running (summary can take long).

Neo4j modelling

integrate into parent and child activities.

cc: @vipinshreyaskumar

We have fixed launching parallel child workflows. For parent-child workflows child workflows will be long running and we will be using across parent/child workflows database references to merge/relate data so we do not require parent to be waiting for it. so we have set parent child policy to be abandon. This will ensure child workflows keep running independent of parent.

JayGhiya assigned vipinshreyaskumar Nov 11, 2024

JayGhiya added HA enhancement New feature or request labels Nov 11, 2024

JayGhiya added this to Unoplat Roadmap Nov 11, 2024

JayGhiya moved this to In progress in Unoplat Roadmap Nov 11, 2024

JayGhiya self-assigned this Dec 24, 2024

vipinshreyaskumar added this to the Milestone 02 milestone Jan 2, 2025

JayGhiya assigned JayGhiya and unassigned JayGhiya and vipinshreyaskumar Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow orchestrator for scalable/reliable ingestion #207

Workflow orchestrator for scalable/reliable ingestion #207

JayGhiya commented Nov 11, 2024

JayGhiya commented Nov 11, 2024

JayGhiya commented Nov 14, 2024 •

edited

Loading

JayGhiya commented Dec 17, 2024 •

edited

Loading

JayGhiya commented Dec 24, 2024

JayGhiya commented Dec 27, 2024

JayGhiya commented Dec 27, 2024

JayGhiya commented Dec 30, 2024

JayGhiya commented Jan 1, 2025 •

edited

Loading

JayGhiya commented Jan 1, 2025 •

edited

Loading

Workflow orchestrator for scalable/reliable ingestion #207

Workflow orchestrator for scalable/reliable ingestion #207

Comments

JayGhiya commented Nov 11, 2024

JayGhiya commented Nov 11, 2024

JayGhiya commented Nov 14, 2024 • edited Loading

JayGhiya commented Dec 17, 2024 • edited Loading

JayGhiya commented Dec 24, 2024

JayGhiya commented Dec 27, 2024

JayGhiya commented Dec 27, 2024

JayGhiya commented Dec 30, 2024

JayGhiya commented Jan 1, 2025 • edited Loading

JayGhiya commented Jan 1, 2025 • edited Loading

JayGhiya commented Nov 14, 2024 •

edited

Loading

JayGhiya commented Dec 17, 2024 •

edited

Loading

JayGhiya commented Jan 1, 2025 •

edited

Loading

JayGhiya commented Jan 1, 2025 •

edited

Loading