Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance drop for react-native repo with new tokens structure. #595

Closed
zurk opened this issue Feb 11, 2019 · 6 comments
Closed
Assignees
Labels
format Issues related to format analyzer

Comments

@zurk
Copy link
Contributor

zurk commented Feb 11, 2019

Context: #586 (comment)

@zurk zurk added the format Issues related to format analyzer label Feb 11, 2019
@zurk zurk added this to the Refactoring January 2019 milestone Feb 11, 2019
@zurk zurk self-assigned this Feb 11, 2019
@zurk
Copy link
Contributor Author

zurk commented Feb 15, 2019

What I can see from the reports:

| type  |               repo |      precision |         recall |    full_recall |             f1 |        full_f1 |           ppcr |          support |     full_support |   Rules Number |   Average Rule Len |
|-------|-------------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|-----------------:|-----------------:|---------------:|-------------------:|
| test  |       react-native | 0.928 (-0.024) | 0.928 (-0.024) | 0.915 (+0.029) | 0.928 (-0.024) | 0.921 (+0.004) | 0.987 (+0.056) |  89834 (  +6513) |  91060 (  +1518) |    1043 (+875) |        11.5 (+0.5) |
| train |       react-native | 0.956 (-0.019) | 0.956 (-0.019) | 0.884 (-0.085) | 0.956 (-0.019) | 0.919 (-0.053) | 0.925 (-0.069) | 622541 ( +56980) | 673224 (+104247) |    1043 (+875) |        11.5 (+0.5) |
  • First of all, it is not related to trailing characters because there is no such. So, it is only about quotes
  • There is big PPCR increase (+5.3%) in test part.
  • A huge amount of rules (+400%) -- maybe we did not find good parameters by GridSearch

I also take the repo with the biggest quality improvement (storybook) to see if there some kind of similar patterns. it helps to understand what is going on.

In next tables v 010 means style-analyzer v0.1.0 and na means commit eb2ec03d702cfbacfe2bfd1601cede2320ea4884 after all my recent changes.

test classification report comparison for react-native
v Class Precision Recall Full Recall F1-score Full F1-score Support Full Support PPCR
010 0.968 0.989 0.972 0.978 0.970 45992 46775 0.983
na 0.950 0.986 0.986 0.968 0.968 46764 46775 1.000
010 0.963 0.975 0.911 0.969 0.936 20479 21917 0.934
na 0.929 0.959 0.954 0.944 0.942 23070 23182 0.995
010 0.877 0.882 0.675 0.880 0.763 5035 6573 0.766
na 0.846 0.828 0.752 0.837 0.796 6013 6622 0.908
010 ⏎␣⁺␣⁺ 0.916 0.706 0.538 0.798 0.678 3324 4368 0.761
na ⏎␣⁺␣⁺ 0.889 0.584 0.569 0.705 0.694 4266 4381 0.974
010 ⏎␣⁻␣⁻ 0.919 0.818 0.731 0.865 0.814 3251 3635 0.894
na ⏎␣⁻␣⁻ 0.892 0.794 0.764 0.841 0.823 3562 3705 0.961
010 ⏎␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 196 279 0.703
na ⏎␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 255 279 0.914
010 ⏎⏎ 0.712 0.806 0.567 0.756 0.631 839 1193 0.703
na ⏎⏎ 0.753 0.620 0.549 0.680 0.635 1144 1292 0.885
010 ⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 45 65 0.692
na ⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 61 65 0.938
010 ⏎␣⁺␣⁺␣⁺␣⁺ 0.000 0.000 0.000 0.000 0.000 33 101 0.327
na ⏎␣⁺␣⁺␣⁺␣⁺ 0.000 0.000 0.000 0.000 0.000 101 102 0.990
010 ⏎⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 10 19 0.526
na ⏎⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 16 19 0.842
:-- --: :--- :--- :----- :----- :----- :---- :--- :---
010 ' 0.905 0.991 0.944 0.946 0.924 2521 2646 0.953
na ' 0.862 0.978 0.976 0.916 0.915 3713 3722 0.998
010 ␣' 0.873 0.939 0.823 0.904 0.847 701 800 0.876
010 '␣ 0.818 0.628 0.570 0.711 0.672 129 142 0.908
010 ⏎⏎' 0.933 0.990 0.990 0.960 0.960 98 98 1.000
010 '⏎ 1.000 0.500 0.417 0.667 0.588 10 12 0.833
010 ⏎' 0.000 0.000 0.000 0.000 0.000 7 7 1.000
:-- --: :----- :---- :---- :----- :--- :--- :------ :---
010 " 0.974 0.644 0.453 0.775 0.618 348 495 0.703
na " 0.960 0.522 0.496 0.677 0.654 869 916 0.949
010 "␣ 1.000 0.840 0.781 0.913 0.877 200 215 0.930
010 ␣" 0.000 0.000 0.000 0.000 0.000 70 108 0.648
010 "⏎ 1.000 0.172 0.167 0.294 0.286 29 30 0.967
010 "⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 4 64 0.062
:-- --: :----- :---- :---- :----- :--- :--- :------ :---
010 micro avg 0.952 0.952 0.886 0.952 0.917 83321 89542 0.931
na micro avg 0.928 0.928 0.915 0.928 0.921 89834 91060 0.987
010 macro avg 0.612 0.518 0.454 0.544 0.503 83321 89542 0.931
na macro avg 0.590 0.523 0.504 0.547 0.536 89834 91060 0.987
010 weighted avg 0.947 0.952 0.886 0.948 0.909 83321 89542 0.931
na weighted avg 0.922 0.928 0.915 0.922 0.915 89834 91060 0.987
test classification report comparacion for storybook
v Class Precision Recall Full Recall F1-score Full F1-score Support Full Support PPCR
010 0.968 0.989 0.972 0.978 0.970 45992 46775 0.983
na 0.950 0.986 0.986 0.968 0.968 46764 46775 1.000
010 0.963 0.975 0.911 0.969 0.936 20479 21917 0.934
na 0.929 0.959 0.954 0.944 0.942 23070 23182 0.995
010 0.877 0.882 0.675 0.880 0.763 5035 6573 0.766
na 0.846 0.828 0.752 0.837 0.796 6013 6622 0.908
010 ⏎␣⁺␣⁺ 0.916 0.706 0.538 0.798 0.678 3324 4368 0.761
na ⏎␣⁺␣⁺ 0.889 0.584 0.569 0.705 0.694 4266 4381 0.974
010 ⏎␣⁻␣⁻ 0.919 0.818 0.731 0.865 0.814 3251 3635 0.894
na ⏎␣⁻␣⁻ 0.892 0.794 0.764 0.841 0.823 3562 3705 0.961
010 ⏎␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 196 279 0.703
na ⏎␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 255 279 0.914
010 ⏎⏎ 0.712 0.806 0.567 0.756 0.631 839 1193 0.703
na ⏎⏎ 0.753 0.620 0.549 0.680 0.635 1144 1292 0.885
010 ⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 45 65 0.692
na ⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 61 65 0.938
010 ⏎␣⁺␣⁺␣⁺␣⁺ 0.000 0.000 0.000 0.000 0.000 33 101 0.327
na ⏎␣⁺␣⁺␣⁺␣⁺ 0.000 0.000 0.000 0.000 0.000 101 102 0.990
010 ⏎⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 10 19 0.526
na ⏎⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 16 19 0.842
:-- --: :--- :--- :----- :----- :----- :---- :--- :---
010 ' 0.905 0.991 0.944 0.946 0.924 2521 2646 0.953
na ' 0.862 0.978 0.976 0.916 0.915 3713 3722 0.998
010 ␣' 0.873 0.939 0.823 0.904 0.847 701 800 0.876
010 '␣ 0.818 0.628 0.570 0.711 0.672 129 142 0.908
010 ⏎⏎' 0.933 0.990 0.990 0.960 0.960 98 98 1.000
010 '⏎ 1.000 0.500 0.417 0.667 0.588 10 12 0.833
010 ⏎' 0.000 0.000 0.000 0.000 0.000 7 7 1.000
:-- --: :----- :---- :---- :----- :--- :--- :------ :---
010 " 0.974 0.644 0.453 0.775 0.618 348 495 0.703
na " 0.960 0.522 0.496 0.677 0.654 869 916 0.949
010 "␣ 1.000 0.840 0.781 0.913 0.877 200 215 0.930
010 ␣" 0.000 0.000 0.000 0.000 0.000 70 108 0.648
010 "⏎ 1.000 0.172 0.167 0.294 0.286 29 30 0.967
010 "⏎␣⁻␣⁻ 0.000 0.000 0.000 0.000 0.000 4 64 0.062
:-- --: :----- :---- :---- :----- :--- :--- :------ :---
010 micro avg 0.952 0.952 0.886 0.952 0.917 83321 89542 0.931
na micro avg 0.928 0.928 0.915 0.928 0.921 89834 91060 0.987
010 macro avg 0.612 0.518 0.454 0.544 0.503 83321 89542 0.931
na macro avg 0.590 0.523 0.504 0.547 0.536 89834 91060 0.987
010 weighted avg 0.947 0.952 0.886 0.948 0.909 83321 89542 0.931
na weighted avg 0.922 0.928 0.915 0.922 0.915 89834 91060 0.987

What we can see:

  • The behavior is precisely the opposite. Almost all storybook tokens have an increase in precision while react-native has a reverse situation. But there is few
  • I can see that for 010 quote-related tokens we have better performance that for na. May be spaces and new lines help to classify tokens.
  • The only case where na perform better is for ⏎⏎
  • react-native also contains a lot of JSX, and it also can affect performance because of wrong processing Failed to fix formating in JSX #605.

Here the comparison between confusion matrixes for react-native. They were normalized by overall predictions number and plot is in log-scale.
To read it you should know that if the color is green that means that value in na report is bigger and if it is purple that means that value in 010 report is bigger. x axis is the model prediction and y axis is the correct answer.

image

  • na has way less refuses. It can explain part of the quality drop.
  • we confuse ' and " more often. Maybe because now it is harder to predict.
  • we predict space instead of new line or new line and indentation increase more often.

I also spend a lot of time digging deeper but I do not find anything suspicious there.
That is only insights I found.

@zurk
Copy link
Contributor Author

zurk commented Feb 22, 2019

What I did:

  1. Train model v010
  2. Review the model v010 parameters and compare them to the new parameters set.
    2.1 Rule number boost was because of Random forest. The old model uses decision tree.
    2.2 See that our unpretentious feature selection algorithm select quite a different set of features.
  3. I disable RandomForest and pin features to the same that was selected in v010 model and see what I can get. Here are the results in comparison with v010 report:
|type  |               repo |      precision |         recall |    full_recall |             f1 |        full_f1 |           ppcr |         support |    full_support |   Rules Number |   Average Rule Len |
|------|-------------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|----------------:|----------------:|---------------:|-------------------:|
|train |       react-native | 0.975 (+0.000) | 0.975 (+0.000) | 0.845 (-0.124) | 0.975 (+0.000) | 0.906 (-0.066) | 0.867 (-0.127) | 578027( +12466) | 667002( +98025) |     164 (  -4) |        10.9 (-0.1) |
|test  |       react-native | 0.956 (+0.004) | 0.956 (+0.004) | 0.889 (+0.003) | 0.956 (+0.004) | 0.922 (+0.005) | 0.930 (-0.001) | 84702 (  +1381) | 91060 (  +1518) |     164 (  -4) |        10.9 (-0.1) |

So all big deal was about unperfect FeatureSelection part and Random forest.


And I see here one important outcome that is not related to the particular issue.
Right now we have 4185 features generated and select 500 from them. We do it with some basic algorithm with many drawbacks. It makes inference logic a little bit shady and we do not know which features we actually have in the end. Also, It adds some randomness to check the quality of our changes.
We did not have enough time to experiment and change something here but we should do it after refactoring or in Q2. Once I tried to pre-select features by hands and saw (by eyes not by numbers) some good performance gain: #190. There was a discussion that I left too few features. In the end, this PR was for the demo and we did not have proper research after it.


Summary:

  1. A quality drop was because of the unpretentious feature selection.
  2. We should improve Feature Selection after refactoring.
    2.1. Decrease number of generated features from 4185 to 1000-1500
    2.2. Think if we can have better feature selection.
  3. If we do not want to have +1000 Rules we should disable RandomForest classifier.

@m09
Copy link
Contributor

m09 commented Feb 22, 2019

If we research feature selection again then it's worth studying feature agglomeration instead of selection. Also it's worth discussing running the feature selection on a selected set of repos and not train it again for each repo.

Also now that we found that the better features are not selected it would be nice to find out why. They should be selected during selector fitting if they really help classification.

@vmarkovtsev
Copy link
Collaborator

@m09 Can we add feature selection to our Optimizer? We could raise the initial number from 500 to 2000 and then pick 500 from those 2000.

@m09
Copy link
Contributor

m09 commented Feb 22, 2019

@vmarkovtsev It should be doable yes. We can select over the 4k in hyperparameter-opt (or aggregate, as mentioned above). The really too expensive thing to run in hyperparameter-opt is feature extraction, even though we'd need to. Selection is fast enough I think.

@zurk
Copy link
Contributor Author

zurk commented Feb 22, 2019

Let's continue here: #637 please add If I miss something in the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format Issues related to format analyzer
Projects
None yet
Development

No branches or pull requests

3 participants