Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some more cases that a currently not supported #5

Open
d-e-h-i-o opened this issue Feb 28, 2021 · 4 comments
Open

Some more cases that a currently not supported #5

d-e-h-i-o opened this issue Feb 28, 2021 · 4 comments

Comments

@d-e-h-i-o
Copy link

d-e-h-i-o commented Feb 28, 2021

Hi Malte,

the are some cases which are currently not supported by the extractor. I want to rework the regex a bit, but I wanted to talk with you about the changes first before implementing anything.

Cases which are caught incorrectly:

  • False positives like 'Ein Satz mit 2014 und 2014/20, und weiteren Sachen' (see False-Positive Extractions #4)
  • BVerwG cases (e.g. 7 A 9.19)
  • Bavaria courts (as referenced in the source code)
  • GSZ 1/16 (courts that omit the chamber)
  • BVerfGE 1, 208 (different citation style)

The basic idea is to make the regex a bit more specific to the real work usage, and less general. That is always a bit tricky since some courts seem to deviate arbitrarily from the naming scheme, but it would filter out false positives like 2014 und 2014/20 better. False negatives can always be added one by one as special cases when dedected.

Chamber (Richter/Spruchkörper)

Current regex: (?P<chamber>([0-9]+)[a-z]?|([IVX]+))

Problem

Could theoretically match arbitrarily long numbers, both in arabic and latin digits.

Possible Solution

For arabic numbers a sensible upper bound could be choosen. Inuitively I don't think there will be a chamber with a number as high as 2014, but I'm not 100% sure on that yet.
For latin numbers it might be possible to make a conclusive list. At OpenLegalData, the highest number seems to be XXI.

Code (Registerzeichen)

Current regex: r'(?P<code>[A-Za-z]{1,6})(\s\(([A-Za-z]{1,6})\))?(\s([A-Za-z]{1,6}))?'

I'm not a 100% sure what the two groups after the code group should capture (are those for the format that the Sozialgerichtsbarkeit use?).

Problem

It matches codes that are not actually being used, like und.

Possible solution

A conclusive list of codes could be used. There is one online dictionary that looks pretty comprehensive.
There is also a list in the codebase, which does not contain all terms from the web dictionary (e.g. 'D' is not contained, which is being used by the OVG). What is the list being used for currently? Would it make sense to complete it (or use it as it is) for code matching?

Year (Jahr)

Current regex: (?P<year>[0-9]{2})

Problem

Any year could be matched, also in the future, and before '45 (though I do not know whether that is an acutal issue).

Possible solution

For the year a sensible range could be used, with the current year decided dynamically as upper bound. That might not feel quite right since it destroys the functional nature of the regex, but it would prevent it to match a year that is in the future, which does not make sense.

Special cases

  • Some courts prefix their cases, e.g. B 6 KA 45/13 R by the BSG. Those styles are also contained in the online dictionary, thus could be handled by a conclusive list.

  • Some courts postfix their cases (see also example above). I'm not sure whether this is important to catch.

  • Bavarian courts and BVerwG use a full stop to separate number and year, and Bavarian courts have a different order.

  • Sometimes cases are handed to a different chamber, so from 7 C 123/14 becomes 12 (7) C 456/15. These are edge cases thought.

  • Courts like the Großer Senat für Zivilsachen (e.g. GSZ 1/16) do not specify a chamber, since they only have one.

  • Citation style like BVerfGE 1, 208. This is technically not a court case number, but often used as a citation style so I think it would be very helpful to support it. It refers to a position in the collection of important decisions by the BverfG, so every one of these cases also has a traditional number (e.g. 2 BvH 1/52 for BVerfGE 1, 208).
    Since there is a mapping online I think it would make sense to build a dictionary that resolves to the court number.

The citation_style.md should show the currently supported styles, right? So there I would need to update the case law part?

On the long run it could also make sense to benchmark the regex. We could choose a random selection of court cases and count how many false positives/negatives occur. @dataspider does someting similar in Coupette, 244 ff. (the regex code there is under Creative Commons No Derivatives license though, so we won't be able to use it here).

Do these changes make sense? Did I miss something? I'm also interesting to understand why you chose to match the court number first, and then search in the surroundings for the court name. Is there a special reason why you designed it that way?

@dataspider
Copy link

FYI (adding to this thread because @d-e-h-i-o tagged me here):
The quantlaw package, maintained by @beckedorf and myself, has a BSD3-licensed reference parser for references to German statutes (optimized for references from statutes to statutes but also working on references to statutes in other legal documents), which we regularly use in our research.
We'll likely extend it to cover references to German judicial decisions of all kinds and citation formats in future work.

@malteos
Copy link
Contributor

malteos commented Feb 28, 2021

Hey @d-e-h-i-o

thanks for the issue. There are indeed many cases that are currently not matched by the regex. I tried to write the regex based on gerichtsaktenzeichen.de and some examples from our corpus. However, there many different corner cases that I found very difficult to cover with one regex (with too complex regex I ended up with many timeouts impractical for the application on a larger corpus). For the future, it might be better to switch to machine learning based approach (the data is there). For now, I'd be maybe small modification to the regex can be sufficient. Would you be interested in contributing these modifications?

@dataspider Great to you that you're now working on this! Since you have probably more resources to maintain this, what about merging the two projects? Let me know if you want to chat about this.

@d-e-h-i-o
Copy link
Author

@malteos Yes, I'd like to contribute those. Do you have have a specific list of what makes sense to add? I would have thought that the more it catches, the better for most use cases, and I have no idea at which point it would become too slow for your applications. I certainly would like to add the citation style BVerfGE 1, 208 and cases from the BVerwG, since these are important groups of citations, and some checks to filter out the mentioned false positives.

@dataspider Thanks for the hint. Do you already have an idea what exactly you need, and in which format? Then I could look into whether it makes sense to contribute that too, if I'm working on it anyway (in case you're not interested in merging as Malte suggested).

@dataspider
Copy link

@malteos I'm skeptical whether merging makes sense (at least at the current stage), but we can certainly have a chat about potential synergies.

@d-e-h-i-o The "example" folder in our project should allow you to get an idea of the format we are using (the approach is designed more carefully than the BVerfGE extraction, which was my rookie project), but again, whether there could be synergies depends on what you're actually working on/interested in. Feel free to reach out to discuss this (if it's for research, I might even be able to offer some advice - unless you want to step into all the traps yourself, of course).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants