-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More parser gold, "the same author/work" references #191
Comments
Trying to train a model with this gold, I am getting
|
Another question: for training, where should the token "in: " go, as in:
I assume it belongs into |
I posted the current version (cleanup is still ongoing) to a gist: https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 |
Yes, 'in' should definitely go with editors (it's a good marker!). The editor normalizer will strip it off. I'm not sure I've seen it often in the context of journals but we'd obviously follow the same approach there (would have to check if the journal normalizer already strips it though). |
Any idea about the |
Maybe an empty tag somewhere? |
Is there a chance you could try to train a parser model with https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 to see if you get the error as well or if it is just my setup? |
Any idea how I could debug this? I was trying to get an extended stack trace but to no avail. It would be so nice if I could get these two new xml training docs (1, 2) working with anystyle. |
Looking only at the first of the linked datasets above, there are a few issues that cause wapiti to bail out. If you want to debug the native module you need to attach Here's a diff to make fix the first dataset: *** /home/dupin/Downloads/zfrsoz-footnotes.xml 2022-08-17 11:05:56.104535376 +0200
--- zfrsoz-footnotes.xml 2022-08-17 11:36:27.720096975 +0200
***************
*** 6290,6296 ****
<note>Mainz</note>
<date>1982</date>
</sequence>
- <sequence/>
<sequence>
<author>Ministerium für Arbeit, Gesundheit und Sozialordnung:</author>
<title>Die Situation der Frau in Baden-Württemberg,</title>
--- 6290,6295 ----
***************
*** 12850,12857 ****
<volume>23/März</volume>
<date>1990</date>
</sequence>
- </dataset><?xml version='1.0' encoding='UTF-8'?>
- <dataset>
<sequence>
<editor>Armer/Grimshaw (Hrsg.), </editor>
<title>Comparative Social Research Methodological Problems and Strategies (New York, London, Sydney, Tokio </title>
--- 12849,12854 ----
***************
*** 19142,19148 ****
<note>Mainz </note>
<date>1982</date>
</sequence>
- <sequence/>
<sequence>
<author>Ministerium für Arbeit, Gesundheit und Sozialordnung: </author>
<title>Die Situation der Frau in Baden-Württemberg, </title>
--- 19139,19144 ----
***************
*** 25702,25705 ****
<volume>23/März </volume>
<date>1990</date>
</sequence>
! </dataset>
\ No newline at end of file
--- 25698,25701 ----
<volume>23/März </volume>
<date>1990</date>
</sequence>
! </dataset> As a general observation, those datasets are very large. It's my feeling that it's better to have a smaller set with less inconsistencies than a larger set with more errors, though I don't have hard evidence to back this up. Smaller datasets make for quicker training so that's definitely a point in favor of a smaller model. What I'd suggest to do if you have such large sets is to train only a small subset first, then use that model to check the rest of the data. If there's a high error rate I'd make the training set larger. Once the error rate is low I'd only pick out those sequences that produce errors and add only those to the training set (or review them first, because errors can often point to inconsistencies in the marked up data). Finally, as a general tip, you can usually spot errors in large datasets quickly by using a binary search approach: keep training with one half of the dataset until there's no error. This way you can usually limit the faulty section to a small set that's easily reviewable. |
Thanks so much for looking into it and I am embarrassed that the xml contained junk - I did check for empty tags (but not on the I'll break up the large xml into smaller parts based on the discipline (there's computer science, natural sciences, and social sciences in it), which might allow some interesting tests of the performance of a domain-specific vs. general-purpose dataset. |
The multiple root problem was actually a copy/paste error when uploading the data as a gist, sorry. But removing the empty I've put the individual parser training files in here:
I've put a lot of work into cleaning up and fixing the annotations, throwing out a large number of sequences which were poorly annotated. So at least in theory, the annotations should be of fairly high quality. |
Ok, the performance, at least measured against
The consistency of the annotations seems to be quite good, as seen when the model is checked against its own training material. |
Well those dataset differ considerably from the data in That said, if you're looking for a combined model that gives good results for both datasets, I'd add something between 50-250 footnote references (aiming for a representative sample of course) to the core set and use that to train the model. Adding more footnote references as necessary. |
Here is some more Parser gold which needs some more love because the source references are VERY messy and therefore the manual annotation were not always correct. I did quite a bit of manual correction after converting it from the EXparser format
zfrsoz-footnotes-corrected.xml.txt
If you spot any obvious mislabelings that could confuse the parser, please let me know. I am happy to repost the material after some more cleaning & correcting.
But here's my question: in German footnotes references (and alson sometimes in bibliographies), it is common to use backreferences to the previous footnote in the form of "ders." (the same author - male) or "dies." (the same author-female). In bibliographies, this sometimes appears in the form of "______". Or it is referred to the previously cited work with "op. cit.", "a.a.O.", etc.
Do you have any opinion on if/how AnyStyle could handle these cases - or should it be left to the postprocessing of CSL data?
The text was updated successfully, but these errors were encountered: