Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepdocs.py InvalidContent The file is corrupted or format is unsupported #1199

Open
tomqwu opened this issue Dec 5, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@tomqwu
Copy link

tomqwu commented Dec 5, 2024

Describe the bug

Running "prepdocs.py"
Data preparation script started
Preparing data for index: gptkbindex
Ensuring search index gptkbindex exists
2024-12-05 15:43:06,245 - INFO - AzureDeveloperCliCredential.get_token succeeded
2024-12-05 15:43:06,246 - INFO - Request URL: 'https://gxxxc.search.windows.net/indexes?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'Accept': 'application/json'
    'x-ms-client-request-id': '9eaeaf30-b31f-11ef-97d0-0242ac110002'
    'User-Agent': 'azsdk-python-search-documents/11.4.0b6 Python/3.10.15 (Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36)'
    'Authorization': 'REDACTED'
No body was attached to the request
2024-12-05 15:43:06,522 - INFO - Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; odata.metadata=minimal; odata.streaming=true; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Vary': 'REDACTED'
    'Server': 'Microsoft-IIS/10.0'
    'Strict-Transport-Security': 'REDACTED'
    'Preference-Applied': 'REDACTED'
    'OData-Version': 'REDACTED'
    'request-id': '9eaeaf30-b31f-11ef-97d0-0242ac110002'
    'elapsed-time': 'REDACTED'
    'Date': 'Thu, 05 Dec 2024 15:43:04 GMT'
Search index gptkbindex already exists
Chunking directory...
Total files to process=1 out of total directory size=1
Single process to chunk and parse the files. --njobs > 1 can help performance.
  0%|                                                                                                            | 0/1 [00:00<?, ?it/s]2024-12-05 15:43:06,798 - INFO - AzureDeveloperCliCredential.get_token succeeded
2024-12-05 15:43:06,798 - INFO - Request URL: 'https://cog-fr-7sropmy2c6ksc.cognitiveservices.azure.com/formrecognizer/documentModels/prebuilt-layout:analyze?stringIndexType=unicodeCodePoint&api-version=2023-07-31'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/octet-stream'
    'Accept': 'application/json'
    'x-ms-client-request-id': '9f0a4bb0-b31f-11ef-97d0-0242ac110002'
    'User-Agent': 'azsdk-python-ai-formrecognizer/3.3.3 Python/3.10.15 (Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36)'
    'Authorization': 'REDACTED'
A body is sent with the request
2024-12-05 15:43:07,335 - INFO - Response status: 400
Response headers:
    'Content-Length': '221'
    'Content-Type': 'application/json; charset=utf-8'
    'ms-azure-ai-errorcode': 'InvalidRequest'
    'x-ms-error-code': 'InvalidRequest'
    'x-envoy-upstream-service-time': '33'
    'apim-request-id': '456b774b-8626-486c-860e-3dd4d78b3803'
    'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload'
    'x-content-type-options': 'nosniff'
    'x-ms-region': 'Canada Central'
    'Date': 'Thu, 05 Dec 2024 15:43:06 GMT'
(InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
    "code": "InvalidContent",
    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
}
File (./data/GitHub Actions.docx) failed with  (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
    "code": "InvalidContent",
    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.25it/s]
Warning: No chunks found. Please check the data directory for valid and supported files.
Data preparation for index gptkbindex completed

Out of box pdf also has similar error

@tomqwu tomqwu added the bug Something isn't working label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant