Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filesize when using the Rest Service #1868

Merged
merged 1 commit into from
May 2, 2024
Merged

Conversation

dadoonet
Copy link
Owner

@dadoonet dadoonet commented May 2, 2024

As reported in https://discuss.elastic.co/t/358630

Filesize is missing when using the Rest service.

As reported in https://discuss.elastic.co/t/358630

Filesize is missing when using the Rest service.
@dadoonet dadoonet added the bug For confirmed bugs label May 2, 2024
@dadoonet dadoonet added this to the 2.10 milestone May 2, 2024
@dadoonet dadoonet self-assigned this May 2, 2024
Copy link

sonarcloud bot commented May 2, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@dadoonet dadoonet merged commit 9616dd1 into master May 2, 2024
13 checks passed
@dadoonet dadoonet deleted the rest/add-filesize branch May 2, 2024 20:08
@mingyitianxia
Copy link

when i used the follow command:

image

the result is :

""",
"file": {
"extension": "txt",
"content_type": "text/plain; charset=ISO-8859-1",
"indexing_date": "2024-05-03T05:15:24.128+00:00",
"filesize": -1,
"filename": "test.txt"
},
"path": {
"virtual": "test.txt",
"real": "test.txt"
}
}
}
]

filesize = -1 .

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

Argh! I need to do more testing 😬

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

That's weird! When I'm executing the tests from the Java code it works well but from the CLI I can reproduce the behavior you are seeing... Checking...

@mingyitianxia
Copy link

I use the python client ,and the filesize = -1

@mingyitianxia
Copy link

image
image

The bug exist in the newsest package. fscrawler-distribution-2.10-20240503.070246-354.zip

please used the the follow command:

echo "This is my text" > test.txt curl -F "file=@test.txt" "http://127.0.0.1:18888/fscrawler/_upload"

and use the python client like
try:

    file.seek( 0, os.SEEK_END )
    file_size = file.tell()
    file.seek( 0 )


    print( "File size:", file_size )


    files = {
        'file': (file.filename, file, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document')
    }
    headers = {
        'Content-Length': str( file_size )
    }


    response = requests.post( FS_WEB_ADDRESS, files=files, headers=headers, timeout=10 )


    if response.ok:
        return jsonify( {"message": "File uploaded successfully"} ), 200
    else:
        return jsonify( {"error": f"FSCrawler handling error: {response.status_code} {response.text}"} ), response.status_code

except requests.exceptions.RequestException as e:

    return jsonify( {"error": f"Request error: {str( e )}"} ), 500
finally:

    file.close()`

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

I know. I can reproduce the problem...

from the CLI I can reproduce the behavior you are seeing...

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

So apparently curl does not provide this information:

curl -v --trace-ascii debug.txt -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_document?simulate=true"
cat debug.txt

Gives

== Info:   Trying 127.0.0.1:8080...
== Info: Connected to 127.0.0.1 (127.0.0.1) port 8080
=> Send header, 224 bytes (0xe0)
0000: POST /fscrawler/_document?simulate=true HTTP/1.1
0032: Host: 127.0.0.1:8080
0048: User-Agent: curl/8.4.0
0060: Accept: */*
006d: Content-Length: 214
0082: Content-Type: multipart/form-data; boundary=--------------------
00c2: ----VzJBwyDNXJA2IVvgyzIvvA
00de: 
=> Send data, 214 bytes (0xd6)
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e: 
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
<= Recv header, 17 bytes (0x11)
0000: HTTP/1.1 200 OK
<= Recv header, 32 bytes (0x20)
0000: Content-Type: application/json
<= Recv header, 21 bytes (0x15)
0000: Content-Length: 489
<= Recv header, 2 bytes (0x2)
0000: 
<= Recv data, 489 bytes (0x1e9)
0000: {.  "ok" : true,.  "filename" : "test.txt",.  "url" : "https://1
0040: 27.0.0.1:9200/rest/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",.  "doc
0080: " : {.    "content" : "This is my text\n\n",.    "meta" : { },. 
00c0:    "file" : {.      "extension" : "txt",.      "content_type" : 
0100: "text/plain; charset=ISO-8859-1",.      "indexing_date" : "2024-
0140: 05-03T10:39:47.685+00:00",.      "filesize" : -1,.      "filenam
0180: e" : "test.txt".    },.    "path" : {.      "virtual" : "test.tx
01c0: t",.      "real" : "test.txt".    }.  }.}
== Info: Connection #0 to host 127.0.0.1 left intact

That's why I can reproduce this... I will do some other checks.

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

Apparently the size field does not seem mandatory as per the RFC... See curl/curl#13527.

So I need to compute it if not provided.

@dadoonet
Copy link
Owner Author

dadoonet commented May 3, 2024

So there is a workaround using tags:

echo "This is my text" > test.txt
curl -F "file=@test.txt" \
  -F "tags={\"file\":{\"filesize\":$(ls -l test.txt | awk '{print $5}')}}" \
  "http://127.0.0.1:8080/fscrawler/_document"

Let me know if this works for you.

dadoonet added a commit that referenced this pull request May 3, 2024
See discussion at: curl/curl#13527

Calling curl with a file does not provide the `size` field for the file:

```sh
curl --trace-ascii debug.txt -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_document"
```

Gives:

```txt
== Info:   Trying 127.0.0.1:8080...
== Info: Connected to 127.0.0.1 (127.0.0.1) port 8080
=> Send header, 224 bytes (0xe0)
0000: POST /fscrawler/_document?simulate=true HTTP/1.1
0032: Host: 127.0.0.1:8080
0048: User-Agent: curl/8.4.0
0060: Accept: */*
006d: Content-Length: 214
0082: Content-Type: multipart/form-data; boundary=--------------------
00c2: ----VzJBwyDNXJA2IVvgyzIvvA
00de:
=> Send data, 214 bytes (0xd6)
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
<= Recv header, 17 bytes (0x11)
0000: HTTP/1.1 200 OK
<= Recv header, 32 bytes (0x20)
0000: Content-Type: application/json
<= Recv header, 21 bytes (0x15)
0000: Content-Length: 489
<= Recv header, 2 bytes (0x2)
0000:
<= Recv data, 489 bytes (0x1e9)
0000: {.  "ok" : true,.  "filename" : "test.txt",.  "url" : "https://1
0040: 27.0.0.1:9200/rest/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",.  "doc
0080: " : {.    "content" : "This is my text\n\n",.    "meta" : { },.
00c0:    "file" : {.      "extension" : "txt",.      "content_type" :
0100: "text/plain; charset=ISO-8859-1",.      "indexing_date" : "2024-
0140: 05-03T10:39:47.685+00:00",.      "filesize" : -1,.      "filenam
0180: e" : "test.txt".    },.    "path" : {.      "virtual" : "test.tx
01c0: t",.      "real" : "test.txt".    }.  }.}
== Info: Connection #0 to host 127.0.0.1 left intact
```

Important part is:

```txt
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
```

We can see that the `size` of the file is not provided.

But when calling the same endpoint using Java `jakarta.ws.rs.client` client, the `size` is provided:

```
1 > PUT http://127.0.0.1:8080/fscrawler/_document/1234
1 > Accept: multipart/form-data,application/json
1 > Content-Type: multipart/form-data
--Boundary_1_46114008_1714750065797
Content-Type: application/octet-stream
Content-Disposition: form-data; filename="test.txt"; modification-date="Fri, 03 May 2024 15:27:44 GMT"; size=30; name="file"

This file contains some words.
--Boundary_1_46114008_1714750065797--
```

The [RFC-2183](https://datatracker.ietf.org/doc/html/rfc2183#section-2.7) does not make this parameter mandatory.
So the workaround is to compute it from the CLI and send it as a tag:

```sh
echo "This is my text" > test.txt
curl -F "file=@test.txt" \
  -F "tags={\"file\":{\"filesize\":$(ls -l test.txt | awk '{print $5}')}}" \
  "http://127.0.0.1:8080/fscrawler/_document"
```

Related to #1868
@mingyitianxia
Copy link

"", "file": { "extension": "txt", "content_type": "text/plain; charset=ISO-8859-1", "indexing_date": "2024-05-04T02:05:54.628+00:00", "filesize": 16, "filename": "test.txt" }, "path": { "virtual": "test.txt", "real": "test.txt" } } ————ok, the bug can be fixed!

dadoonet added a commit that referenced this pull request May 13, 2024
See discussion at: curl/curl#13527

Calling curl with a file does not provide the `size` field for the file:

```sh
curl --trace-ascii debug.txt -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_document"
```

Gives:

```txt
== Info:   Trying 127.0.0.1:8080...
== Info: Connected to 127.0.0.1 (127.0.0.1) port 8080
=> Send header, 224 bytes (0xe0)
0000: POST /fscrawler/_document?simulate=true HTTP/1.1
0032: Host: 127.0.0.1:8080
0048: User-Agent: curl/8.4.0
0060: Accept: */*
006d: Content-Length: 214
0082: Content-Type: multipart/form-data; boundary=--------------------
00c2: ----VzJBwyDNXJA2IVvgyzIvvA
00de:
=> Send data, 214 bytes (0xd6)
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
<= Recv header, 17 bytes (0x11)
0000: HTTP/1.1 200 OK
<= Recv header, 32 bytes (0x20)
0000: Content-Type: application/json
<= Recv header, 21 bytes (0x15)
0000: Content-Length: 489
<= Recv header, 2 bytes (0x2)
0000:
<= Recv data, 489 bytes (0x1e9)
0000: {.  "ok" : true,.  "filename" : "test.txt",.  "url" : "https://1
0040: 27.0.0.1:9200/rest/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",.  "doc
0080: " : {.    "content" : "This is my text\n\n",.    "meta" : { },.
00c0:    "file" : {.      "extension" : "txt",.      "content_type" :
0100: "text/plain; charset=ISO-8859-1",.      "indexing_date" : "2024-
0140: 05-03T10:39:47.685+00:00",.      "filesize" : -1,.      "filenam
0180: e" : "test.txt".    },.    "path" : {.      "virtual" : "test.tx
01c0: t",.      "real" : "test.txt".    }.  }.}
== Info: Connection #0 to host 127.0.0.1 left intact
```

Important part is:

```txt
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
```

We can see that the `size` of the file is not provided.

But when calling the same endpoint using Java `jakarta.ws.rs.client` client, the `size` is provided:

```
1 > PUT http://127.0.0.1:8080/fscrawler/_document/1234
1 > Accept: multipart/form-data,application/json
1 > Content-Type: multipart/form-data
--Boundary_1_46114008_1714750065797
Content-Type: application/octet-stream
Content-Disposition: form-data; filename="test.txt"; modification-date="Fri, 03 May 2024 15:27:44 GMT"; size=30; name="file"

This file contains some words.
--Boundary_1_46114008_1714750065797--
```

The [RFC-2183](https://datatracker.ietf.org/doc/html/rfc2183#section-2.7) does not make this parameter mandatory.
So the workaround is to compute it from the CLI and send it as a tag:

```sh
echo "This is my text" > test.txt
curl -F "file=@test.txt" \
  -F "tags={\"file\":{\"filesize\":$(ls -l test.txt | awk '{print $5}')}}" \
  "http://127.0.0.1:8080/fscrawler/_document"
```

Related to #1868
dadoonet added a commit that referenced this pull request May 13, 2024
Fix documentation for `filesize` is not provided by curl

See discussion at: curl/curl#13527

Calling curl with a file does not provide the `size` field for the file:

```sh
curl --trace-ascii debug.txt -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_document"
```

Gives:

```txt
== Info:   Trying 127.0.0.1:8080...
== Info: Connected to 127.0.0.1 (127.0.0.1) port 8080
=> Send header, 224 bytes (0xe0)
0000: POST /fscrawler/_document?simulate=true HTTP/1.1
0032: Host: 127.0.0.1:8080
0048: User-Agent: curl/8.4.0
0060: Accept: */*
006d: Content-Length: 214
0082: Content-Type: multipart/form-data; boundary=--------------------
00c2: ----VzJBwyDNXJA2IVvgyzIvvA
00de:
=> Send data, 214 bytes (0xd6)
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
<= Recv header, 17 bytes (0x11)
0000: HTTP/1.1 200 OK
<= Recv header, 32 bytes (0x20)
0000: Content-Type: application/json
<= Recv header, 21 bytes (0x15)
0000: Content-Length: 489
<= Recv header, 2 bytes (0x2)
0000:
<= Recv data, 489 bytes (0x1e9)
0000: {.  "ok" : true,.  "filename" : "test.txt",.  "url" : "https://1
0040: 27.0.0.1:9200/rest/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",.  "doc
0080: " : {.    "content" : "This is my text\n\n",.    "meta" : { },.
00c0:    "file" : {.      "extension" : "txt",.      "content_type" :
0100: "text/plain; charset=ISO-8859-1",.      "indexing_date" : "2024-
0140: 05-03T10:39:47.685+00:00",.      "filesize" : -1,.      "filenam
0180: e" : "test.txt".    },.    "path" : {.      "virtual" : "test.tx
01c0: t",.      "real" : "test.txt".    }.  }.}
== Info: Connection #0 to host 127.0.0.1 left intact
```

Important part is:

```txt
0000: --------------------------VzJBwyDNXJA2IVvgyzIvvA
0032: Content-Disposition: form-data; name="file"; filename="test.txt"
0074: Content-Type: text/plain
008e:
0090: This is my text.
00a2: --------------------------VzJBwyDNXJA2IVvgyzIvvA--
== Info: We are completely uploaded and fine
```

We can see that the `size` of the file is not provided.

But when calling the same endpoint using Java `jakarta.ws.rs.client` client, the `size` is provided:

```
1 > PUT http://127.0.0.1:8080/fscrawler/_document/1234
1 > Accept: multipart/form-data,application/json
1 > Content-Type: multipart/form-data
--Boundary_1_46114008_1714750065797
Content-Type: application/octet-stream
Content-Disposition: form-data; filename="test.txt"; modification-date="Fri, 03 May 2024 15:27:44 GMT"; size=30; name="file"

This file contains some words.
--Boundary_1_46114008_1714750065797--
```

The [RFC-2183](https://datatracker.ietf.org/doc/html/rfc2183#section-2.7) does not make this parameter mandatory.
So the workaround is to compute it from the CLI and send it as a tag:

```sh
echo "This is my text" > test.txt
curl -F "file=@test.txt" \
  -F "tags={\"file\":{\"filesize\":$(ls -l test.txt | awk '{print $5}')}}" \
  "http://127.0.0.1:8080/fscrawler/_document"
```

Related to #1868
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs component:rest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants