{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":653623369,"defaultBranch":"main","name":"datatrove","ownerLogin":"huggingface","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-06-14T12:05:28.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/25720743?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1725279404.0","currentOid":""},"activityList":{"items":[{"before":"a147fd50c32452c871b2b137d6e0e5f1381f4506","after":"9ad0747db43c94ffa4c4b9ece074f90eab679431","ref":"refs/heads/multilingual","pushedAt":"2024-09-21T17:33:30.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"add todo","shortMessageHtmlLink":"add todo"}},{"before":"9142e3eca63673075932bb885b8b85859d8c45ed","after":"c7f6f516abc1349e4995451ff4017790d00d2d68","ref":"refs/heads/main","pushedAt":"2024-09-11T13:39:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update huggingface.py","shortMessageHtmlLink":"Update huggingface.py"}},{"before":"c2fc90213db55afc1c3dcb8d9210b273d9b38d6c","after":"9142e3eca63673075932bb885b8b85859d8c45ed","ref":"refs/heads/main","pushedAt":"2024-09-11T11:35:21.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. (#286)\n\n* fixed a bug that in reader pipeline, the document count is always less\r\nthan the actual number of documents by the number of files.\r\n\r\n* Update document counting based on advice from @guipenedo\r\n\r\nNow use a seperate variable `ndocs` to count number of docs yielded.","shortMessageHtmlLink":"Fixed a bug that in the reader pipline, the document count is always …"}},{"before":"25a5919138901c25ec4f9ed9cc18665f3727ac51","after":"a147fd50c32452c871b2b137d6e0e5f1381f4506","ref":"refs/heads/multilingual","pushedAt":"2024-09-11T10:05:55.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"fix tokenizer issues","shortMessageHtmlLink":"fix tokenizer issues"}},{"before":"81d8e5d594df863c4a1f9706ac479f87da2f98a0","after":"25a5919138901c25ec4f9ed9cc18665f3727ac51","ref":"refs/heads/multilingual","pushedAt":"2024-09-05T17:48:16.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"add all available tokenizers and all iso-639-1 languages","shortMessageHtmlLink":"add all available tokenizers and all iso-639-1 languages"}},{"before":"b12390ef4f3f8991c0d9902f121fff92bdde028a","after":"81d8e5d594df863c4a1f9706ac479f87da2f98a0","ref":"refs/heads/multilingual","pushedAt":"2024-09-04T17:16:50.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"word tokenizers changes: use spacy when possible, added missing languages from spacy and stanza","shortMessageHtmlLink":"word tokenizers changes: use spacy when possible, added missing langu…"}},{"before":"63f7f3f8c3bd30f988ba402d20561640e4f48b45","after":"b12390ef4f3f8991c0d9902f121fff92bdde028a","ref":"refs/heads/multilingual","pushedAt":"2024-09-04T11:51:26.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"bugfix for empty lines (breaking chinese samples)","shortMessageHtmlLink":"bugfix for empty lines (breaking chinese samples)"}},{"before":"2da6f22cddf46617510144d1f5f259d806107c84","after":"c2fc90213db55afc1c3dcb8d9210b273d9b38d6c","ref":"refs/heads/main","pushedAt":"2024-09-02T12:20:56.000Z","pushType":"pr_merge","commitsCount":2,"pusher":{"login":"hynky1999","name":"Hynek Kydlíček","path":"/hynky1999","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/39408646?s=80&v=4"},"commit":{"message":"Merge pull request #280 from huggingface/readme_formatting_issues\n\nReadme nits","shortMessageHtmlLink":"Merge pull request #280 from huggingface/readme_formatting_issues"}},{"before":"685915d4a21782321cece8c8a72b9e57d073ffe9","after":"c06344a10a8521627bfef431173d2756fe5b0931","ref":"refs/heads/readme_formatting_issues","pushedAt":"2024-09-02T12:18:47.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"hynky1999","name":"Hynek Kydlíček","path":"/hynky1999","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/39408646?s=80&v=4"},"commit":{"message":"improve formatting of readme + small nit in stats docs","shortMessageHtmlLink":"improve formatting of readme + small nit in stats docs"}},{"before":null,"after":"685915d4a21782321cece8c8a72b9e57d073ffe9","ref":"refs/heads/readme_formatting_issues","pushedAt":"2024-09-02T12:16:44.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"hynky1999","name":"Hynek Kydlíček","path":"/hynky1999","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/39408646?s=80&v=4"},"commit":{"message":"improve formatting of readme + small nit in stats docs","shortMessageHtmlLink":"improve formatting of readme + small nit in stats docs"}},{"before":null,"after":"63f7f3f8c3bd30f988ba402d20561640e4f48b45","ref":"refs/heads/multilingual","pushedAt":"2024-08-30T18:06:26.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"change fw quality to strict inequality","shortMessageHtmlLink":"change fw quality to strict inequality"}},{"before":"d95e0ee85d3ce3a376c46dfdbf22b0f23749b654","after":"2da6f22cddf46617510144d1f5f259d806107c84","ref":"refs/heads/main","pushedAt":"2024-08-28T15:42:53.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"re-enable testpypi upload","shortMessageHtmlLink":"re-enable testpypi upload"}},{"before":"1297570901687c895b9b7272f2daaf6f842ef35a","after":"d95e0ee85d3ce3a376c46dfdbf22b0f23749b654","ref":"refs/heads/main","pushedAt":"2024-08-28T15:36:29.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"temporarily disable testpypi upload","shortMessageHtmlLink":"temporarily disable testpypi upload"}},{"before":"7815b053b64b837a0f8c6ef7fa240a8d3622859c","after":"1297570901687c895b9b7272f2daaf6f842ef35a","ref":"refs/heads/main","pushedAt":"2024-08-28T15:15:47.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"publish action try #6","shortMessageHtmlLink":"publish action try #6"}},{"before":"cca96dc39f61414b7b2df0f5a548294d12e8a797","after":"7815b053b64b837a0f8c6ef7fa240a8d3622859c","ref":"refs/heads/main","pushedAt":"2024-08-28T12:52:07.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"publish action try #5","shortMessageHtmlLink":"publish action try #5"}},{"before":"ca302dfa28657483e90a4726fb723a51964f9abd","after":"cca96dc39f61414b7b2df0f5a548294d12e8a797","ref":"refs/heads/main","pushedAt":"2024-08-28T12:46:21.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"run pip install with --system","shortMessageHtmlLink":"run pip install with --system"}},{"before":"c1416c59bfc080254f1dce7960ed792fc31a2928","after":"ca302dfa28657483e90a4726fb723a51964f9abd","ref":"refs/heads/main","pushedAt":"2024-08-28T12:38:36.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"skip existing for failed action","shortMessageHtmlLink":"skip existing for failed action"}},{"before":"160b7481827374ab4d3c8f3f76d6259581fcb5a7","after":"c1416c59bfc080254f1dce7960ed792fc31a2928","ref":"refs/heads/main","pushedAt":"2024-08-28T12:32:45.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"re-add comment","shortMessageHtmlLink":"re-add comment"}},{"before":"2b7538efd0ae1f727e7f84f347ca1e538e6efb86","after":"160b7481827374ab4d3c8f3f76d6259581fcb5a7","ref":"refs/heads/main","pushedAt":"2024-08-28T12:32:00.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"changes to pypi release action","shortMessageHtmlLink":"changes to pypi release action"}},{"before":null,"after":"3416227ad960c5a4a922e8ff567e9b5a33d8352a","ref":"refs/heads/feature/add_shuffle","pushedAt":"2024-08-28T12:31:35.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"changes to pypi release action","shortMessageHtmlLink":"changes to pypi release action"}},{"before":"6a341b9e09ab33863f2fd3e534b278daae4816b4","after":"2b7538efd0ae1f727e7f84f347ca1e538e6efb86","ref":"refs/heads/main","pushedAt":"2024-08-28T12:01:41.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update pyproject.toml","shortMessageHtmlLink":"Update pyproject.toml"}},{"before":"87ddf3aa8587de00eb164e325253a8b6af115abe","after":"6a341b9e09ab33863f2fd3e534b278daae4816b4","ref":"refs/heads/main","pushedAt":"2024-08-28T12:01:16.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Add shuffle option on huggingface reader (#224)\n\n* Add shuffle option to readers/huggingface\r\n\r\n* Add shuffle on hf reader","shortMessageHtmlLink":"Add shuffle option on huggingface reader (#224)"}},{"before":"762cc525eb3301269547852529257346de20f87e","after":"87ddf3aa8587de00eb164e325253a8b6af115abe","ref":"refs/heads/main","pushedAt":"2024-08-28T11:49:10.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"fix argument in example","shortMessageHtmlLink":"fix argument in example"}},{"before":"9808db8c164f438d0b30416f0d8deafe5298e165","after":"762cc525eb3301269547852529257346de20f87e","ref":"refs/heads/main","pushedAt":"2024-08-28T10:28:28.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update sentence_deduplication.py","shortMessageHtmlLink":"Update sentence_deduplication.py"}},{"before":"f3945627a0d281bb6c3f97007993d48bc9e17ed4","after":"9808db8c164f438d0b30416f0d8deafe5298e165","ref":"refs/heads/main","pushedAt":"2024-08-28T10:24:19.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update README.md","shortMessageHtmlLink":"Update README.md"}},{"before":"3de36c97737d2e705dde23518dd75ca1225982b6","after":"f3945627a0d281bb6c3f97007993d48bc9e17ed4","ref":"refs/heads/main","pushedAt":"2024-08-28T10:01:52.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update pypi-release.yml","shortMessageHtmlLink":"Update pypi-release.yml"}},{"before":"c4f57839adde086c88a76cb6cbf81998cf301126","after":"3de36c97737d2e705dde23518dd75ca1225982b6","ref":"refs/heads/main","pushedAt":"2024-08-28T09:58:45.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Add expand_metadata feature in jsonlwriter (#268)","shortMessageHtmlLink":"Add expand_metadata feature in jsonlwriter (#268)"}},{"before":"d5d1924e91b378f3084a7b23c26d240c5f627702","after":"c4f57839adde086c88a76cb6cbf81998cf301126","ref":"refs/heads/main","pushedAt":"2024-08-28T09:58:30.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Update filter_hf_dataset.py (#274)\n\nThe type of `text_key` should be `str` instead of `int`.","shortMessageHtmlLink":"Update filter_hf_dataset.py (#274)"}},{"before":"3b91550a5d67acad067091b8ee33465ed21f73b0","after":"d5d1924e91b378f3084a7b23c26d240c5f627702","ref":"refs/heads/main","pushedAt":"2024-08-28T09:51:36.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"guipenedo","name":"Guilherme Penedo","path":"/guipenedo","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/3883401?s=80&v=4"},"commit":{"message":"Implement zstd Compression Support for JSONL and Parquet Files (#230)\n\n* Add zstandard dependency for compression support\r\n\r\n* feat: Add zstd compression support for jsonl reader\r\n\r\n* feat: Add zstd compression support for ParquetWriter\r\n\r\n* feat: Update DiskWriter to handle the other compression for Parquet files\r\n\r\n* Remove annotaion\r\n\r\n* feat: Update compression handling in DiskWriter and ParquetWriter\r\n\r\n* Update src/datatrove/pipeline/writers/disk_base.py\r\n\r\nHandle compression on ParquetWriter directly\r\n\r\nCo-authored-by: Guilherme Penedo \r\n\r\n* Update src/datatrove/pipeline/writers/parquet.py\r\n\r\nNone to out of list\r\n\r\nCo-authored-by: Guilherme Penedo \r\n\r\n* Refactor constructor to explicitly set default compression to None\r\n\r\n* Add validation for compression parameter in ParquetWriter\r\n\r\n* Update src/datatrove/pipeline/writers/disk_base.py\r\n\r\nofficial extension for zstd is \".zst\"\r\n\r\nCo-authored-by: Guilherme Penedo \r\n\r\n---------\r\n\r\nCo-authored-by: Guilherme Penedo ","shortMessageHtmlLink":"Implement zstd Compression Support for JSONL and Parquet Files (#230)"}},{"before":"6102f59d33833e0c96c4f838c82e1cfd0cd67d62","after":"3b91550a5d67acad067091b8ee33465ed21f73b0","ref":"refs/heads/main","pushedAt":"2024-08-28T09:02:48.000Z","pushType":"pr_merge","commitsCount":2,"pusher":{"login":"hynky1999","name":"Hynek Kydlíček","path":"/hynky1999","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/39408646?s=80&v=4"},"commit":{"message":"Merge pull request #276 from shizhediao/patch-2\n\nUpdate README.md","shortMessageHtmlLink":"Merge pull request #276 from shizhediao/patch-2"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0yMVQxNzozMzozMC4wMDAwMDBazwAAAAS8z7DX","startCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0yMVQxNzozMzozMC4wMDAwMDBazwAAAAS8z7DX","endCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOC0yOFQwOTowMjo0OC4wMDAwMDBazwAAAASmRVmH"}},"title":"Activity · huggingface/datatrove"}