Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction starts from the beginning in the retry cycle in case of an error #1

Open
vertuk opened this issue Feb 5, 2024 · 1 comment

Comments

@vertuk
Copy link

vertuk commented Feb 5, 2024

It goes like this:

$ vkimexp -b brave [ID]
Estimating…  1377 queries, 137565 messages
------------------------------------------------------------------------------
S     1/1377  0.07%  offset 137530…    213b       0   
S     2/1377  0.15%  offset 137430…    213b       0   
S     3/1377  0.22%  offset 137330…    213b       0   
S     4/1377  0.29%  offset 137230…   110kb      77   (5)I                  
S     5/1377  0.36%  offset 137130…   133kb     100   III                   
S     6/1377  0.44%  offset 137030…   132kb     100   IIP                   
S     7/1377  0.51%  offset 136930…   166kb     100   (6)I(7)P              
S     8/1377  0.58%  offset 136830…   170kb     100   (6)I(5)P
...
S   190/1377  13.8%  offset 118630…   124kb     100   P                     
S   191/1377  13.9%  offset 118530…   135kb     100   (5)P                  
S   192/1377  13.9%  offset 118430…   134kb     100   (19)I(10)P            
S   193/1377  14.0%  offset 118330…   151kb     100   (10)P                 
·   194/1377  14.1%  offset 118230…   139kb     100+  IPP·                  expected str, bytes or os.PathLike object, not NoneType
Attempt 2/10, will retry in 3.8 seconds...
Estimating…  1377 queries, 137565 messages
------------------------------------------------------------------------------
S     1/1377  0.07%  offset 137530…    213b       0   
S     2/1377  0.15%  offset 137430…    213b       0   
S     3/1377  0.22%  offset 137330…    213b       0   
S     4/1377  0.29%  offset 137230…   110kb      77   (5)I                  
S     5/1377  0.36%  offset 137130…   133kb     100   III                   
S     6/1377  0.44%  offset 137030…   132kb     100   IIP                   
S     7/1377  0.51%  offset 136930…   166kb     100   (6)I(7)P

You see? It starts to cycle through the data from the beginning, not from the place it got an error (as i would imagine it should). And this repeats until it retries it 10 times, as indicated in Attempt 2/10, will retry in 3.8 seconds...
Thankfully it doesn't redownload everything every time, just kinda goes through it probably just checking existence and integrity of already downloaded data about one line a second, sometimes up to 10, maybe 30 seconds a line, but it still adds up.
And it doesn't go past such place until it runs out of retries.

My setup:
OS: Ubuntu 23.10
Python: Python 3.12.0 (main, Oct 4 2023, 06:27:34) [GCC 13.2.0]

@delameter
Copy link
Owner

delameter commented Feb 5, 2024

That is an expected behaviour, because there is no mechanism for resuming failed export (yet). The good news are that I was going to implement it, but not at the time of first release, rather a bit later.

Tracking and therefore preserving media files (photos etc) is much simpler because 1) each of these has a unique hash from the start and 2) this hash does not change.

Resuming the message export is a bit more complicated, because the app must keep all earlier messages somewhere in the memory or on the disk, otherwise the output files will contain less data than it potentionally could; consider this example: the export process downloaded 3 pages of the history and failed, then the next attempt started from page 4, which contain a reply to a message from page 3. App from the first attempt had this message from pg.3 in memory and could insert the quotation of earlier message near the later one. But the app after restart was downloading the messages right from page 4 and therefore is not able to get data for quoted message.

So correct implementation should not only resume the export from some preserved page number, but also recover full state of the previous export attempt. Which means that this state should be also written somewhere to begin with. It is not really hard to implement this, rather it's hard to make the feature flawless and elegant from the start. Thats why it was deferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants