Refactor VirtualNode with annotations #521

vmarkovtsev · 2019-01-10T11:28:31Z

Inspired by UIMA. We never remove annotations, only add.

...

zurk · 2019-02-18T07:48:29Z

I think that this refactoring issue is an important one to do. We do not have time to implement UIMA approach for Python, so, I think that the easiest solution will be to create a new class AnnotatedVirtualNode:

class AnnotatedVirtualNode:
    def __init__():
        self.node = None  # type: VirtualNode
        self.y = None  # type: Sequence[int]
        self.y_pred = None  # type: Sequence[int]
        self.parent = None  # AnnotatedVirtualNode
        # etc

And put here all annotations we may need. After that, we agreed that VirtualNode is immutable and that is.

It is a simple approach and maybe I missed some important details why it does not suit well, what do you think @m09?

m09 · 2019-02-19T15:03:59Z

This is mostly what we currently have. Some problems include:

the number and location of attributes is not always the same (basic formatting classes are more numerous than merged formatting classes. Merged formatting classes are more numerous than filtered merged formatting classes, etc)
this relies on us not modifying values, it's not enforced even slightly by design
type annotations are not used to their full power (Optional[…] everywhere)
single annotations would be coupled with the full type system (ie all possible annotations).

Overall it's very far from a good annotation pattern. But indeed it might still be good to fix our current VirtualNode in this way (ie remove the cases that are not add-only in the codebase)

zurk · 2019-02-19T18:02:19Z

ok, I knew that I just did not see the problem in all colors. :)
AnnotatedVirtualNode v2:

class Annotation:
    def __init__(self):
        self._annotations = {}

    def __getitem__(self, item: str):
        return self._annotations[item]

    def __setitem__(self, key: str, value):
        # We can also add a list of known annotations with their types.
        # So we can check unexpected annotations and if it is known its type.
        if key in self._annotations:
            raise KeyError(("Annotation %s exists with value %s. It is not possible to overwrite "
                           "annotation") % (key, repr(self[key])))
        self._annotations[key] = value

    # I think it will be convenient to have such properties.
    @property
    def y(self):
        try:
            return self._annotations["y"]
        except KeyError:
            raise KeyError("There is no `y` annotation vnode %s")
        
    # etc


class AnnotatedVirtualNode:
    def __init__(self):
        # plus read-only properties
        self._node = None  # type: VirtualNode
        self._annotations = Annotation()

Also we can add a check function to know if this node is annotated as expected.
Should be easy to introduce into the codebase and covers all your points.

Do you see any drawbacks?

m09 · 2019-02-20T11:03:38Z

Sadly this does not solve the biggest problem (potentially different number/location of annotations for each type). More generally, I don't think there is a shortcut that can reap the benefits of a proper annotation pattern while needing only a small refactoring.

zurk · 2019-02-20T11:35:40Z

Sorry, @m09, I did not get the problem about "different number/location of annotations for each type".
Not sure what do you mean by "type" and "location" and why we cannot have a set of 5 possible annotations that will cover all our needs (or maybe even two for now: y and predicted with y_new and rule_number)

You did not mention this problem, but we can also have special decorators for functions to check that vnode sequence has required annotations

m09 · 2019-02-20T14:37:02Z

By type I just mean a different annotation (so for example y and predicted would be different annotations). By different numbers and locations I mean what I already mentioned: atomic classes cannot be attributes of the same object than y and predicted because there are many more of them (they annotate different locations of the code).

In a nutshell: what's needed is to have a separate list of objects for each type of annotation and that is why it's a pain to refactor. The specifics of the objects that represent annotations are not really important (they do matter for the add-only aspect and general performance/usability though).

zurk · 2019-02-20T17:00:34Z

Ah, now I see the problem about vnodes sequence with atomic classes and vnodes sequence with merged classes.

However, I think that we can consider these sequences as completely different. Moreover, the first one is kind of internal and appears only inside FeatureExtractor._parce_vnodes() function. Maybe it is not the best solution to use the same VirtualNode class for both because the vnode meanings are different, but I think it is a minor issue. As a result of FeatureExtractor._parce_vnodes() we have unchangeable vnodes sequence and just add more annotations on the way: y_original, y_filtered, y_predicted, etc.

I think it is also a good idea to introduce a special class for Sequence[VirtualNode]. VirtualCode for example. with such class, we will be able to avoid pass all vnodes, vnodes_y, parents, etc independently from function to function, but will work with one solid object what will make our life easier.

@m09 do not get my arguing wrong. We just do not have enough time to create good annotation tooling (+debugging +recode everything). So I think about a good intermediate solution that is not hard to implement but have almost all good properties we want.

m09 · 2019-02-20T18:10:11Z

Don't worry, I don't get it wrong ☺️ It's just that we should probably close this issue and open a new one with different refactoring ideas since those won't be close to implementing annotations. We should keep this pattern in mind for our next project though.

Renaming some of the attributes and adding a few new ones to be more explicit is indeed a way forward 👍

zurk · 2019-02-20T18:45:53Z

Ok, we need @vmarkovtsev to have a decision here.

Actually approach I described in #521 (comment) perfectly consistent with issue description so I think we can keep it, maybe with renaming.

m09 · 2019-02-20T18:54:10Z

It's not perfectly consistent since it doesn't implement the pattern that we wanted to implement. TBH I'm not even sure it helps compared to current codebase. The point of having separate annotations is that a given piece of code only considers a given type of annotation and has no filtering/validation to do. If we don't have that we miss the gist of the pattern.

zurk · 2019-02-20T19:04:00Z

ok, I see. let's discuss it tomorrow at the meeting.

vmarkovtsev · 2019-02-21T11:44:52Z

Discussed at the meeting. Actions:

Write the core parsing in lookout-sdk-ml. Describe the classes, agree on the structure and then implement it.
Add the rest of the annotators to style-analyzer, replace the existing parsing and convert to the old format
Remove the conversion and replace everything left

zurk · 2019-02-26T14:13:18Z

We have a quick call with @m09 and he ok with this structure. Hugo, any additional comments are welcome. to_vnodes method should be removed as soon as we rewrite everything with annotations.

I start to code proper implementation for these classes and after that, I begin style-analyzer code migration.

We should use correct names for classes. I am not sure about RawData and AnnotatedData.

Core Annotation class structure

"""
Annotation tool.
Inspired by https://uima.apache.org/d/uimafit-current/api/
"""
from collections import OrderedDict
from typing import Tuple, Dict, Iterator, Sequence, List


class RawData:
    """The storage for ordered document collection indexed and accessible by global offsets."""

    def __init__(self):
        self.document_collection = []  # type: List[str]

        # Used by __getitem__ method.
        self._global_to_document_offset = None
        self._document_offset_to_global = None

    def __getitem__(self, key):
        # main function to access data parts by offset.
        if isinstance(key, slice):
            return ...
        return ...

class Annotation:
    """Base class for annotation"""

    name = None

    def __init__(self):
        self.start_offset = None
        self.end_offset = None


class ValuedAnnotation(Annotation):
    def __init__(self, value):
        self.value = value  # line number, bblfsh Node, target value, etc.


# Specific annotations for style-analyzer:
class AnnotationNames:
    atomic_token = "atomic_token"
    line = "line"
    uast_node = "uast_node"


class AtomicTokenAnnotation(Annotation):
    """Infrangible сode token annotation"""

    name = AnnotationNames.atomic_token


class LineAnnotation(ValuedAnnotation):
    """Line number annotation"""

    name = AnnotationNames.line


class UASTNodeAnnotation(ValuedAnnotation):
    """UAST Node annotation"""

    name = AnnotationNames.uast_node

# etc


AnnotationNameType = str  # just to make a difference between str and annotation name.


class AnnotatedData:
    """
    Class that couples annotations and data together.

    All special utilities to work with annotations should be implemented in this class
    List of methods that should be implemented can be found here: 
    https://uima.apache.org/d/uimafit-current/api/org/apache/uima/fit/util/JCasUtil.html
    """
    def __init__(self):
        self.raw_data = None  # type: RawData
        # Interval trees should be used for _annotations later.
        self._annotations = None  # type: OrderedDict[(int, int), Dict[AnnotationNameType, Annotation]]

    def get(self, position: Tuple[int, int]) -> Tuple[str, Dict[str, Annotation]]:
        """
        Get annotated value and all annotations for the range.
        """
        pass

    def iter_annotation(self, t: AnnotationNameType) -> Iterator[Tuple[str, Annotation]]:
        """
        Iter through specific annotation atomic_tokens, ys, files, etc
        returns slice of RawData and its annotation.
        """

    def to_vnodes(self) -> List[VirtualNode]:
        # backward capability
        # then delete
        pass

zurk · 2019-04-23T09:36:21Z

Next steps:

Merge PR Add annotations to FeatureExtractor #761
Add more tests
Resolve todos that are easy to resolve
Continue Annotation expansion.
Optimize. I feel like speed can be improved a lot.

zurk · 2019-08-20T09:08:19Z

@vmarkovtsev @m09 I suggest to close this issue and open another one when we need to continue code refactoring with annotation

vmarkovtsev added enhancement New feature or request refactor labels Jan 10, 2019

vmarkovtsev added this to the Refactoring January 2019 milestone Jan 10, 2019

vmarkovtsev assigned m09 Jan 10, 2019

vmarkovtsev added the large Large size label Jan 10, 2019

zurk self-assigned this Feb 26, 2019

zurk mentioned this issue Feb 28, 2019

Add annotations #654

Closed

zurk mentioned this issue Apr 22, 2019

Add annotations to FeatureExtractor #761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor VirtualNode with annotations #521

Refactor VirtualNode with annotations #521

vmarkovtsev commented Jan 10, 2019

zurk commented Feb 18, 2019

m09 commented Feb 19, 2019

zurk commented Feb 19, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019 •

edited

Loading

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019 •

edited

Loading

vmarkovtsev commented Feb 21, 2019

zurk commented Feb 26, 2019 •

edited

Loading

zurk commented Apr 23, 2019 •

edited

Loading

zurk commented Aug 20, 2019

Refactor VirtualNode with annotations #521

Refactor VirtualNode with annotations #521

Comments

vmarkovtsev commented Jan 10, 2019

zurk commented Feb 18, 2019

m09 commented Feb 19, 2019

zurk commented Feb 19, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019 • edited Loading

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019

m09 commented Feb 20, 2019

zurk commented Feb 20, 2019 • edited Loading

vmarkovtsev commented Feb 21, 2019

zurk commented Feb 26, 2019 • edited Loading

zurk commented Apr 23, 2019 • edited Loading

zurk commented Aug 20, 2019

zurk commented Feb 20, 2019 •

edited

Loading

zurk commented Feb 20, 2019 •

edited

Loading

zurk commented Feb 26, 2019 •

edited

Loading

zurk commented Apr 23, 2019 •

edited

Loading