general_notes

Metrics
  1. AUC: use rank of probabilities before averaging submission not proba themselves
          don’t need all raws of dataset (AUC is equivalent to a rank) especially if a category is not present in test  
  2. if target of regression is between 0 and 1, try Xentropy with RMSE.
  3. if target of regression is not between 0 and 1, try multiple binary folders.


Feature engineering for time series:

  2. type: count, cumcount (NA with mean), time delta (replace NA with max) (talkingdata-adtracking-fraud-detection), moving average 
  4. Time delta: be careful of timing test / train: readjust calculation of features to a day window if too many NAs
  5. keep an eye on duplicates for patterns (imagine how dataset was build)
  6. cluster on text variable or numeric variable after grouping on categorical var and grouping-by this '_cluster', calculate aggregated features like price rank: https://www.kaggle.com/c/avito-demand-prediction/discussion/59886
  7. reorder raws and columns to find time series (see santander 2018 comp) using brute force, jaccard index & graph, unique values & matrices
  
Feature engineering traditional:
  1. count NA, count NA per categorical / numerical /text var, count duplicates (even if removed later)
  2. if feature is a item seq, find the var that changes and the differences of this var as feature.
  3. Groupby, count,... :see TalkingData kernels for pandas possibilities
  4. anonymous data: try extratrees special tool.
  5. ideas: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/18754/feature-engineering
  6. Noisy data: try DAE NN from seguro competition
  7. Chris threads on tabular data: 
  # https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575#latest-643395
  # https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284#latest-643397
  

Find IDs:
Raddar: 
correctly identify cardid's
identify how public/private test was split
consider only cards which are only in test set
take such cards which have at least 1 rows in public LB, and as many as possible in private LB
make a probe submission (ie using your best submission) and making overrides for that specific card)
get feedback about card's label in public LB set by sorting "My submissions"
infer that information to private set

Chris:
# https://www.kaggle.com/c/ieee-fraud-detection/discussion/111510#latest-643489
  
Feature selection: 
a. LGB: personally trusted a lot on how LightGBM prioritises features, and I operate in the following principles:
      whenever a batch of features improve local validation score, keep them
      whenever a batch of feature decrease score - I will remove the most import features among this batch according to LightGBM, and re-test
      whenever features were not used by LightGBM at all i.e. 0 importance score I will drop them.
b. Time series: 
  1. try anokas tool
  2.
c. LOFO: 
d. NULL importance: https://www.kaggle.com/ogrellier/feature-selection-with-null-importances


Categorical variable:
1. # https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63  for high cardinal
2. Use Catboost with the ID column.
3. thermometer categories: see python in folder. good for ordinal categories and NN.
4. target encoding


Large Datasets Tools:
1. Dask: pandas based
2. Big Query: Google SQL (fast!)
3. memo (see talking data competition)

Cross Validation:
https://www.kaggle.com/forums/f/15/kaggle-forum/t/18793/cross-validation-strategy-when-blending-stacking/107073

Models:
1. NN:
  use binary file to load mmap
  use time delta if not rnn friendly (talkingdata-adtracking-fraud-detection)
  use rnn with batch of 1
  use rankgaussian for normalization (see porto seguro autoencoder)
  use binary files such as np mmap with an holder in the train generator to prevent memory increase
  37 reasons why your NN is not working: blog.slavv.com
  Solution IEEE NN: https://www.kaggle.com/c/ieee-fraud-detection/discussion/111476#latest-643042. add id to last level.
  
  For categorical, either use one-hot-encoding or better yet use keras.layers.Embedding together with label encoding. 
  It is helpful to reduce the cardinality of categorical variables as much as possible beforehand. 
  For numerical, NN's like normally distributed variables. If a variable has a skewed distribution first make a log transform. 
  Next standardize by subtracting mean and dividing by standard deviation. Finally replace numeric variable NANs with zero 
  (which is the transformed variable's mean).
  
2. Trees:
  a. Try Dart (drop out) instead of traditional gbtree
  b. custom xgboost objective:
    https://www.kaggle.com/c/higgs-boson/forums/t/10286/customize-loss-function-in-xgboost?page=2
    https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
    https://www.kaggle.com/scirpus/prudential-life-insurance-assessment/xgb-test

3. computer vision:
  a. see W2Unet by Dimitry on DSBL 2018.
  b. Pavel Yakubovskiy who posted models in both Keras and PyTorch. pip install segmentation-models