Investigate performance drop for react-native repo with new tokens structure. #595

zurk · 2019-02-11T10:27:13Z

zurk · 2019-02-15T12:40:56Z

What I can see from the reports:

| type  |               repo |      precision |         recall |    full_recall |             f1 |        full_f1 |           ppcr |          support |     full_support |   Rules Number |   Average Rule Len |
|-------|-------------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|-----------------:|-----------------:|---------------:|-------------------:|
| test  |       react-native | 0.928 (-0.024) | 0.928 (-0.024) | 0.915 (+0.029) | 0.928 (-0.024) | 0.921 (+0.004) | 0.987 (+0.056) |  89834 (  +6513) |  91060 (  +1518) |    1043 (+875) |        11.5 (+0.5) |
| train |       react-native | 0.956 (-0.019) | 0.956 (-0.019) | 0.884 (-0.085) | 0.956 (-0.019) | 0.919 (-0.053) | 0.925 (-0.069) | 622541 ( +56980) | 673224 (+104247) |    1043 (+875) |        11.5 (+0.5) |

First of all, it is not related to trailing characters because there is no such. So, it is only about quotes
There is big PPCR increase (+5.3%) in test part.
A huge amount of rules (+400%) -- maybe we did not find good parameters by GridSearch

I also take the repo with the biggest quality improvement (storybook) to see if there some kind of similar patterns. it helps to understand what is going on.

In next tables v 010 means style-analyzer v0.1.0 and na means commit eb2ec03d702cfbacfe2bfd1601cede2320ea4884 after all my recent changes.

test classification report comparison for react-native

v	Class	Precision	Recall	Full Recall	F1-score	Full F1-score	Support	Full Support	PPCR
010	`∅`	0.968	0.989	0.972	0.978	0.970	45992	46775	0.983
na	`∅`	0.950	0.986	0.986	0.968	0.968	46764	46775	1.000
010	`␣`	0.963	0.975	0.911	0.969	0.936	20479	21917	0.934
na	`␣`	0.929	0.959	0.954	0.944	0.942	23070	23182	0.995
010	`⏎`	0.877	0.882	0.675	0.880	0.763	5035	6573	0.766
na	`⏎`	0.846	0.828	0.752	0.837	0.796	6013	6622	0.908
010	`⏎␣⁺␣⁺`	0.916	0.706	0.538	0.798	0.678	3324	4368	0.761
na	`⏎␣⁺␣⁺`	0.889	0.584	0.569	0.705	0.694	4266	4381	0.974
010	`⏎␣⁻␣⁻`	0.919	0.818	0.731	0.865	0.814	3251	3635	0.894
na	`⏎␣⁻␣⁻`	0.892	0.794	0.764	0.841	0.823	3562	3705	0.961
010	`⏎␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	196	279	0.703
na	`⏎␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	255	279	0.914
010	`⏎⏎`	0.712	0.806	0.567	0.756	0.631	839	1193	0.703
na	`⏎⏎`	0.753	0.620	0.549	0.680	0.635	1144	1292	0.885
010	`⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	45	65	0.692
na	`⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	61	65	0.938
010	`⏎␣⁺␣⁺␣⁺␣⁺`	0.000	0.000	0.000	0.000	0.000	33	101	0.327
na	`⏎␣⁺␣⁺␣⁺␣⁺`	0.000	0.000	0.000	0.000	0.000	101	102	0.990
010	`⏎⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	10	19	0.526
na	`⏎⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	16	19	0.842
:--	--:	:---	:---	:-----	:-----	:-----	:----	:---	:---
010	`'`	0.905	0.991	0.944	0.946	0.924	2521	2646	0.953
na	`'`	0.862	0.978	0.976	0.916	0.915	3713	3722	0.998
010	`␣'`	0.873	0.939	0.823	0.904	0.847	701	800	0.876
010	`'␣`	0.818	0.628	0.570	0.711	0.672	129	142	0.908
010	`⏎⏎'`	0.933	0.990	0.990	0.960	0.960	98	98	1.000
010	`'⏎`	1.000	0.500	0.417	0.667	0.588	10	12	0.833
010	`⏎'`	0.000	0.000	0.000	0.000	0.000	7	7	1.000
:--	--:	:-----	:----	:----	:-----	:---	:---	:------	:---
010	`"`	0.974	0.644	0.453	0.775	0.618	348	495	0.703
na	`"`	0.960	0.522	0.496	0.677	0.654	869	916	0.949
010	`"␣`	1.000	0.840	0.781	0.913	0.877	200	215	0.930
010	`␣"`	0.000	0.000	0.000	0.000	0.000	70	108	0.648
010	`"⏎`	1.000	0.172	0.167	0.294	0.286	29	30	0.967
010	`"⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	4	64	0.062
:--	--:	:-----	:----	:----	:-----	:---	:---	:------	:---
010	`micro avg`	0.952	0.952	0.886	0.952	0.917	83321	89542	0.931
na	`micro avg`	0.928	0.928	0.915	0.928	0.921	89834	91060	0.987
010	`macro avg`	0.612	0.518	0.454	0.544	0.503	83321	89542	0.931
na	`macro avg`	0.590	0.523	0.504	0.547	0.536	89834	91060	0.987
010	`weighted avg`	0.947	0.952	0.886	0.948	0.909	83321	89542	0.931
na	`weighted avg`	0.922	0.928	0.915	0.922	0.915	89834	91060	0.987

test classification report comparacion for storybook

v	Class	Precision	Recall	Full Recall	F1-score	Full F1-score	Support	Full Support	PPCR
010	`∅`	0.968	0.989	0.972	0.978	0.970	45992	46775	0.983
na	`∅`	0.950	0.986	0.986	0.968	0.968	46764	46775	1.000
010	`␣`	0.963	0.975	0.911	0.969	0.936	20479	21917	0.934
na	`␣`	0.929	0.959	0.954	0.944	0.942	23070	23182	0.995
010	`⏎`	0.877	0.882	0.675	0.880	0.763	5035	6573	0.766
na	`⏎`	0.846	0.828	0.752	0.837	0.796	6013	6622	0.908
010	`⏎␣⁺␣⁺`	0.916	0.706	0.538	0.798	0.678	3324	4368	0.761
na	`⏎␣⁺␣⁺`	0.889	0.584	0.569	0.705	0.694	4266	4381	0.974
010	`⏎␣⁻␣⁻`	0.919	0.818	0.731	0.865	0.814	3251	3635	0.894
na	`⏎␣⁻␣⁻`	0.892	0.794	0.764	0.841	0.823	3562	3705	0.961
010	`⏎␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	196	279	0.703
na	`⏎␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	255	279	0.914
010	`⏎⏎`	0.712	0.806	0.567	0.756	0.631	839	1193	0.703
na	`⏎⏎`	0.753	0.620	0.549	0.680	0.635	1144	1292	0.885
010	`⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	45	65	0.692
na	`⏎␣⁻␣⁻␣⁻␣⁻␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	61	65	0.938
010	`⏎␣⁺␣⁺␣⁺␣⁺`	0.000	0.000	0.000	0.000	0.000	33	101	0.327
na	`⏎␣⁺␣⁺␣⁺␣⁺`	0.000	0.000	0.000	0.000	0.000	101	102	0.990
010	`⏎⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	10	19	0.526
na	`⏎⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	16	19	0.842
:--	--:	:---	:---	:-----	:-----	:-----	:----	:---	:---
010	`'`	0.905	0.991	0.944	0.946	0.924	2521	2646	0.953
na	`'`	0.862	0.978	0.976	0.916	0.915	3713	3722	0.998
010	`␣'`	0.873	0.939	0.823	0.904	0.847	701	800	0.876
010	`'␣`	0.818	0.628	0.570	0.711	0.672	129	142	0.908
010	`⏎⏎'`	0.933	0.990	0.990	0.960	0.960	98	98	1.000
010	`'⏎`	1.000	0.500	0.417	0.667	0.588	10	12	0.833
010	`⏎'`	0.000	0.000	0.000	0.000	0.000	7	7	1.000
:--	--:	:-----	:----	:----	:-----	:---	:---	:------	:---
010	`"`	0.974	0.644	0.453	0.775	0.618	348	495	0.703
na	`"`	0.960	0.522	0.496	0.677	0.654	869	916	0.949
010	`"␣`	1.000	0.840	0.781	0.913	0.877	200	215	0.930
010	`␣"`	0.000	0.000	0.000	0.000	0.000	70	108	0.648
010	`"⏎`	1.000	0.172	0.167	0.294	0.286	29	30	0.967
010	`"⏎␣⁻␣⁻`	0.000	0.000	0.000	0.000	0.000	4	64	0.062
:--	--:	:-----	:----	:----	:-----	:---	:---	:------	:---
010	`micro avg`	0.952	0.952	0.886	0.952	0.917	83321	89542	0.931
na	`micro avg`	0.928	0.928	0.915	0.928	0.921	89834	91060	0.987
010	`macro avg`	0.612	0.518	0.454	0.544	0.503	83321	89542	0.931
na	`macro avg`	0.590	0.523	0.504	0.547	0.536	89834	91060	0.987
010	`weighted avg`	0.947	0.952	0.886	0.948	0.909	83321	89542	0.931
na	`weighted avg`	0.922	0.928	0.915	0.922	0.915	89834	91060	0.987

What we can see:

The behavior is precisely the opposite. Almost all storybook tokens have an increase in precision while react-native has a reverse situation. But there is few
I can see that for 010 quote-related tokens we have better performance that for na. May be spaces and new lines help to classify tokens.
The only case where na perform better is for ⏎⏎
react-native also contains a lot of JSX, and it also can affect performance because of wrong processing Failed to fix formating in JSX #605.

Here the comparison between confusion matrixes for react-native. They were normalized by overall predictions number and plot is in log-scale.
To read it you should know that if the color is green that means that value in na report is bigger and if it is purple that means that value in 010 report is bigger. x axis is the model prediction and y axis is the correct answer.

na has way less refuses. It can explain part of the quality drop.
we confuse ' and " more often. Maybe because now it is harder to predict.
we predict space instead of new line or new line and indentation increase more often.

I also spend a lot of time digging deeper but I do not find anything suspicious there.
That is only insights I found.

zurk · 2019-02-22T10:39:15Z

What I did:

Train model v010
Review the model v010 parameters and compare them to the new parameters set.
2.1 Rule number boost was because of Random forest. The old model uses decision tree.
2.2 See that our unpretentious feature selection algorithm select quite a different set of features.
I disable RandomForest and pin features to the same that was selected in v010 model and see what I can get. Here are the results in comparison with v010 report:

|type  |               repo |      precision |         recall |    full_recall |             f1 |        full_f1 |           ppcr |         support |    full_support |   Rules Number |   Average Rule Len |
|------|-------------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|----------------:|----------------:|---------------:|-------------------:|
|train |       react-native | 0.975 (+0.000) | 0.975 (+0.000) | 0.845 (-0.124) | 0.975 (+0.000) | 0.906 (-0.066) | 0.867 (-0.127) | 578027( +12466) | 667002( +98025) |     164 (  -4) |        10.9 (-0.1) |
|test  |       react-native | 0.956 (+0.004) | 0.956 (+0.004) | 0.889 (+0.003) | 0.956 (+0.004) | 0.922 (+0.005) | 0.930 (-0.001) | 84702 (  +1381) | 91060 (  +1518) |     164 (  -4) |        10.9 (-0.1) |

So all big deal was about unperfect FeatureSelection part and Random forest.

And I see here one important outcome that is not related to the particular issue.
Right now we have 4185 features generated and select 500 from them. We do it with some basic algorithm with many drawbacks. It makes inference logic a little bit shady and we do not know which features we actually have in the end. Also, It adds some randomness to check the quality of our changes.
We did not have enough time to experiment and change something here but we should do it after refactoring or in Q2. Once I tried to pre-select features by hands and saw (by eyes not by numbers) some good performance gain: #190. There was a discussion that I left too few features. In the end, this PR was for the demo and we did not have proper research after it.

Summary:

A quality drop was because of the unpretentious feature selection.
We should improve Feature Selection after refactoring.
2.1. Decrease number of generated features from 4185 to 1000-1500
2.2. Think if we can have better feature selection.
If we do not want to have +1000 Rules we should disable RandomForest classifier.

m09 · 2019-02-22T10:44:15Z

If we research feature selection again then it's worth studying feature agglomeration instead of selection. Also it's worth discussing running the feature selection on a selected set of repos and not train it again for each repo.

Also now that we found that the better features are not selected it would be nice to find out why. They should be selected during selector fitting if they really help classification.

vmarkovtsev · 2019-02-22T11:06:38Z

@m09 Can we add feature selection to our Optimizer? We could raise the initial number from 500 to 2000 and then pick 500 from those 2000.

m09 · 2019-02-22T11:22:38Z

@vmarkovtsev It should be doable yes. We can select over the 4k in hyperparameter-opt (or aggregate, as mentioned above). The really too expensive thing to run in hyperparameter-opt is feature extraction, even though we'd need to. Selection is fast enough I think.

zurk · 2019-02-22T11:24:02Z

Let's continue here: #637 please add If I miss something in the description.

zurk added the format Issues related to format analyzer label Feb 11, 2019

zurk added this to the Refactoring January 2019 milestone Feb 11, 2019

zurk self-assigned this Feb 11, 2019

vmarkovtsev mentioned this issue Feb 14, 2019

Fix Code Generation trailing symbols loss #602

Merged

vmarkovtsev closed this as completed Feb 22, 2019

zurk mentioned this issue Feb 22, 2019

Improve feature selection process #637

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance drop for react-native repo with new tokens structure. #595

Investigate performance drop for react-native repo with new tokens structure. #595

zurk commented Feb 11, 2019

zurk commented Feb 15, 2019

zurk commented Feb 22, 2019

m09 commented Feb 22, 2019 •

edited

Loading

vmarkovtsev commented Feb 22, 2019

m09 commented Feb 22, 2019

zurk commented Feb 22, 2019 •

edited

Loading

Investigate performance drop for react-native repo with new tokens structure. #595

Investigate performance drop for react-native repo with new tokens structure. #595

Comments

zurk commented Feb 11, 2019

zurk commented Feb 15, 2019

zurk commented Feb 22, 2019

m09 commented Feb 22, 2019 • edited Loading

vmarkovtsev commented Feb 22, 2019

m09 commented Feb 22, 2019

zurk commented Feb 22, 2019 • edited Loading

m09 commented Feb 22, 2019 •

edited

Loading

zurk commented Feb 22, 2019 •

edited

Loading