From 894658edeeb0bab62b0c2c725024947baa5f4c10 Mon Sep 17 00:00:00 2001 From: Dmitry Tikhomirov Date: Mon, 15 Jan 2024 10:41:18 +0300 Subject: [PATCH] Deleted old tutorials --- .../Tutorial_1_Simple_Matching.ipynb | 559 ----------------- ...Tutorial_2_Matching_with_fixed_group.ipynb | 593 ------------------ .../Tutorial_3_Results_Interpretation.ipynb | 409 ------------ 3 files changed, 1561 deletions(-) delete mode 100644 examples/tutorials/Tutorial_1_Simple_Matching.ipynb delete mode 100644 examples/tutorials/Tutorial_2_Matching_with_fixed_group.ipynb delete mode 100644 examples/tutorials/Tutorial_3_Results_Interpretation.ipynb diff --git a/examples/tutorials/Tutorial_1_Simple_Matching.ipynb b/examples/tutorials/Tutorial_1_Simple_Matching.ipynb deleted file mode 100644 index c965077a..00000000 --- a/examples/tutorials/Tutorial_1_Simple_Matching.ipynb +++ /dev/null @@ -1,559 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Tutorial 1: Simple Matching\n", - "\n", - "In this tutorial you will learn how to process simple matching with HypEx." - ] - }, - { - "cell_type": "markdown", - "source": [ - "### How Matching in HypEx works? \n", - "\n", - "To find groups for all samples we will use Faiss library and the Mahalanobis distance. \n", - "\n", - "**Faiss** is a library for efficient similarity **search and clustering** of dense vectors. \n", - "\n", - "**The Mahalanobis distance** is a measure of the **distance between vectors** of random variables, generalizing the concept of Euclidean distance. \n", - "**Using** the Mahalanobis distance, it is possible to **determine the similarity of an unknown and a known sample**. It differs from the Euclidean distance in that it takes into account correlations between variables and is invariant to the scale. " - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "In Faiss only the calculation of the $L_2$-metric is implemented, therefore, to calculate the Mahalanobis metric, we first apply the **Cholesky decomposition** to transform the feature space, and in the new space we already apply the standard Euclidean metric. \n", - "\n", - "**The Cholesky decomposition** is a representation of a symmetric positive definite matrix $A$ in the form $A = LL^T$, where $L$ is a lower triangular matrix with strictly positive elements on the diagonal.\n", - "**The Cholesky decomposition always exists and is unique for any symmetric positive definite matrix.**" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "source": [ - "Steps: \n", - "1. Find decomposition: $\\Sigma^{-1} = LL^T$ \n", - "2. Transform the space: $\\hat x = Lx$\n", - "\n", - "Lets prove the equivalence of the transformation:\n", - "$$|| \\hat x - \\hat y||_{L_2} = \\sqrt{(Lx - Ly)^T(Lx - Ly)} = \\sqrt{(x - y)^TL^TL(x - y)} = \\sqrt{(x - y)^T\\Sigma^{-1}(x - y)} = || x - y||_{Maha}$$\n", - "\n", - "Finally, get: \n", - "$$|| \\hat x - \\hat y||_{L_2} = || x - y||_{Maha}$$ \n" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 0. Import libraries " - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:43:51.258770600Z", - "start_time": "2023-11-23T12:43:47.367980300Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], - "source": [ - "import warnings \n", - "from hypex import Matcher\n", - "from hypex.dataset import DataGenerator\n", - "\n", - "warnings.simplefilter(action='ignore', category=FutureWarning)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Create or upload your dataset \n", - "In this case we will create random dataset with known effect size \n", - "If you have your own dataset, go to the part 2 \n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:12.230668Z", - "start_time": "2023-11-23T12:44:12.073259Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " info_1 info_2 feature_1 feature_2 feature_3 feature_4 treatment \\\n0 9508 Q female Deposit NaN 0.0 0.0 \n1 1783 Q female NaN -1.946330 3.0 1.0 \n2 2815 U female Investment 0.423735 3.0 1.0 \n3 6961 Q male Credit 0.183354 1.0 1.0 \n4 13036 U female Deposit -1.145800 2.0 0.0 \n... ... ... ... ... ... ... ... \n4995 1753 Q male Credit -1.032884 3.0 1.0 \n4996 13009 U female Deposit -2.286353 1.0 0.0 \n4997 10744 U male Credit -0.076360 3.0 0.0 \n4998 12343 U female Deposit 1.524106 0.0 1.0 \n4999 7078 U male Investment 1.038898 0.0 0.0 \n\n target_1 target_2 \n0 -0.215974 -0.215974 \n1 5.204231 5.204231 \n2 8.012220 3.079220 \n3 4.699710 4.699710 \n4 0.504827 0.504827 \n... ... ... \n4995 3.797474 0.979399 \n4996 -1.164799 -1.164799 \n4997 2.785862 2.785862 \n4998 4.919137 4.919137 \n4999 -1.299679 2.833841 \n\n[5000 rows x 9 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
info_1info_2feature_1feature_2feature_3feature_4treatmenttarget_1target_2
09508QfemaleDepositNaN0.00.0-0.215974-0.215974
11783QfemaleNaN-1.9463303.01.05.2042315.204231
22815UfemaleInvestment0.4237353.01.08.0122203.079220
36961QmaleCredit0.1833541.01.04.6997104.699710
413036UfemaleDeposit-1.1458002.00.00.5048270.504827
..............................
49951753QmaleCredit-1.0328843.01.03.7974740.979399
499613009UfemaleDeposit-2.2863531.00.0-1.164799-1.164799
499710744UmaleCredit-0.0763603.00.02.7858622.785862
499812343UfemaleDeposit1.5241060.01.04.9191374.919137
49997078UmaleInvestment1.0388980.00.0-1.2996792.833841
\n

5000 rows × 9 columns

\n
" - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data = DataGenerator(na_columns=['feature_3', 'feature_2'], \n", - " num_features=2, \n", - " num_targets=2)\n", - "data.df" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:15.170523600Z", - "start_time": "2023-11-23T12:44:15.042996Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "Index(['info_1', 'info_2', 'feature_1', 'feature_2', 'feature_3', 'feature_4',\n 'treatment', 'target_1', 'target_2'],\n dtype='object')" - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.columns" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "scrolled": true, - "ExecuteTime": { - "end_time": "2023-11-23T12:44:20.560397600Z", - "start_time": "2023-11-23T12:44:20.523064500Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "treatment\n0.0 2528\n1.0 2472\nName: count, dtype: int64" - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.treatment.value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:21.558974200Z", - "start_time": "2023-11-23T12:44:21.506538100Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "info_1 0\ninfo_2 0\nfeature_1 0\nfeature_2 500\nfeature_3 500\nfeature_4 0\ntreatment 0\ntarget_1 0\ntarget_2 0\ndtype: int64" - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Matching \n", - "### 2.0 Init params\n", - "info_col used to define informative attributes that should not be part of matching, such as user_id \n", - "But to explicitly store this column in the table, so that you can compare directly after computation" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:47.768557100Z", - "start_time": "2023-11-23T12:44:47.730542300Z" - } - }, - "outputs": [], - "source": [ - "info_col = [data.info_col_names[0]]\n", - "\n", - "outcome = data.target_names\n", - "treatment = data.treatment_name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.1 Simple matching\n", - "This is the easiest way to initialize and calculate metrics on a Matching task \n", - "Use it when you are clear about each attribute or if you don't have any additional task conditions (Strict equality for certain features) " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:51.187217700Z", - "start_time": "2023-11-23T12:44:50.141104400Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[23.11.2023 15:44:50 | hypex | INFO]: Number of NaN values filled with zeros: 1000\n", - "Get treated index: 100%|██████████| 5000/5000 [00:00<00:00, 26720.59it/s] \n" - ] - } - ], - "source": [ - "# Standard model with base parameters\n", - "model = Matcher(input_data=data.df, outcome=outcome, treatment=treatment, info_col=info_col)\n", - "results, quality_results, df_matched = model.estimate()" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## 3. Results \n", - "### ATE, ATC, ATT \n", - "\n", - "*p.s. read mere about results [there](./Tutorial_3_Results_Interpretation.ipynb)*" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:52.633633Z", - "start_time": "2023-11-23T12:44:52.580170100Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.652815 0.068678 0.00 2.518206 2.787425 target_1\nATC 2.623115 0.077757 0.00 2.470711 2.775519 target_1\nATT 2.683189 0.076139 0.00 2.533956 2.832422 target_1\nATE -0.114397 0.082510 0.17 -0.276117 0.047322 target_2\nATC -0.133360 0.092808 0.15 -0.315265 0.048544 target_2\nATT -0.095005 0.092118 0.30 -0.275556 0.085546 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6528150.0686780.002.5182062.787425target_1
ATC2.6231150.0777570.002.4707112.775519target_1
ATT2.6831890.0761390.002.5339562.832422target_1
ATE-0.1143970.0825100.17-0.2761170.047322target_2
ATC-0.1333600.0928080.15-0.3152650.048544target_2
ATT-0.0950050.0921180.30-0.2755560.085546target_2
\n
" - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ] - }, - { - "cell_type": "markdown", - "source": [ - "### 3.2 SMD, PSI, KS-test, repeats" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:06.312331800Z", - "start_time": "2023-11-23T12:45:06.280283900Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "dict_keys(['psi', 'ks_test', 'smd', 'repeats'])" - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:06.526584800Z", - "start_time": "2023-11-23T12:45:06.495530100Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " match_control_to_treat match_treat_to_control\nfeature_3 0.902708 0.991299\nfeature_4 1.000000 1.000000", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
match_control_to_treatmatch_treat_to_control
feature_30.9027080.991299
feature_41.0000001.000000
\n
" - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['ks_test']" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:07.596980100Z", - "start_time": "2023-11-23T12:45:07.515466800Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " index feature_3 feature_4 info_2_U feature_1_male feature_2_Credit \\\n0 1783 -1.946330 3.0 0 0 0 \n1 2815 0.423735 3.0 1 0 0 \n2 6961 0.183354 1.0 0 1 1 \n3 10432 0.000000 3.0 1 1 1 \n4 10222 0.090388 1.0 0 0 0 \n... ... ... ... ... ... ... \n2523 2902 0.842208 2.0 0 0 0 \n2524 4228 1.204038 2.0 1 0 1 \n2525 13009 -2.286353 1.0 1 0 0 \n2526 10744 -0.076360 3.0 1 1 1 \n2527 7078 1.038898 0.0 1 1 0 \n\n feature_2_Deposit feature_2_Investment feature_3_matched \\\n0 0 0 -1.621690 \n1 0 1 0.439773 \n2 0 0 0.213917 \n3 0 0 0.000000 \n4 0 1 0.088173 \n... ... ... ... \n2523 0 1 0.843187 \n2524 0 0 1.111896 \n2525 1 0 -2.454503 \n2526 0 0 -0.086061 \n2527 0 1 1.198814 \n\n feature_4_matched ... feature_2_Investment_matched \\\n0 3.0 ... 0.0 \n1 3.0 ... 1.0 \n2 1.0 ... 0.0 \n3 3.0 ... 0.0 \n4 1.0 ... 1.0 \n... ... ... ... \n2523 2.0 ... 1.0 \n2524 2.0 ... 0.0 \n2525 1.0 ... 0.0 \n2526 3.0 ... 0.0 \n2527 0.0 ... 1.0 \n\n index_matched target_1 target_1_matched \\\n0 [5185] 5.204231 2.240444 \n1 [9157] 8.012220 4.175422 \n2 [7645] 4.699710 1.186502 \n3 [8854, 2941, 4294, 5695, 2998, 1624] 7.690724 2.652024 \n4 [4336] 2.617214 0.582712 \n... ... ... ... \n2523 [14560] 4.095340 6.403056 \n2524 [11908] 3.917107 6.630113 \n2525 [8008] -1.164799 -3.272139 \n2526 [2806] 2.785862 7.040315 \n2527 [12136] -1.299679 2.777553 \n\n target_1_matched_bias target_2 target_2_matched \\\n0 3.278713 5.204231 2.240444 \n1 3.852356 3.079220 8.624334 \n2 3.542858 4.699710 1.186502 \n3 5.038700 7.690724 5.111347 \n4 2.032352 0.461260 3.671266 \n... ... ... ... \n2523 2.305748 6.402154 3.407937 \n2524 2.898269 7.444792 2.129195 \n2525 -1.769257 -1.164799 -0.811464 \n2526 4.273958 2.785862 1.707816 \n2527 3.755702 2.833841 -0.361463 \n\n target_2_matched_bias treatment treatment_matched \n0 3.450316 1 0 \n1 -5.521078 1 0 \n2 3.559013 1 0 \n3 2.579377 1 0 \n4 -3.213325 1 0 \n... ... ... ... \n2523 -2.995618 0 1 \n2524 -5.183684 0 1 \n2525 0.594060 0 1 \n2526 -1.064157 0 1 \n2527 -3.424243 0 1 \n\n[5000 rows x 24 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
indexfeature_3feature_4info_2_Ufeature_1_malefeature_2_Creditfeature_2_Depositfeature_2_Investmentfeature_3_matchedfeature_4_matched...feature_2_Investment_matchedindex_matchedtarget_1target_1_matchedtarget_1_matched_biastarget_2target_2_matchedtarget_2_matched_biastreatmenttreatment_matched
01783-1.9463303.000000-1.6216903.0...0.0[5185]5.2042312.2404443.2787135.2042312.2404443.45031610
128150.4237353.0100010.4397733.0...1.0[9157]8.0122204.1754223.8523563.0792208.624334-5.52107810
269610.1833541.0011000.2139171.0...0.0[7645]4.6997101.1865023.5428584.6997101.1865023.55901310
3104320.0000003.0111000.0000003.0...0.0[8854, 2941, 4294, 5695, 2998, 1624]7.6907242.6520245.0387007.6907245.1113472.57937710
4102220.0903881.0000010.0881731.0...1.0[4336]2.6172140.5827122.0323520.4612603.671266-3.21332510
..................................................................
252329020.8422082.0000010.8431872.0...1.0[14560]4.0953406.4030562.3057486.4021543.407937-2.99561801
252442281.2040382.0101001.1118962.0...0.0[11908]3.9171076.6301132.8982697.4447922.129195-5.18368401
252513009-2.2863531.010010-2.4545031.0...0.0[8008]-1.164799-3.272139-1.769257-1.164799-0.8114640.59406001
252610744-0.0763603.011100-0.0860613.0...0.0[2806]2.7858627.0403154.2739582.7858621.707816-1.06415701
252770781.0388980.0110011.1988140.0...1.0[12136]-1.2996792.7775533.7557022.833841-0.361463-3.42424301
\n

5000 rows × 24 columns

\n
" - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_matched" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:48.479487200Z", - "start_time": "2023-11-23T12:45:48.329113900Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "Empty DataFrame\nColumns: [index, feature_3, feature_4, info_2_U, feature_1_male, feature_2_Credit, feature_2_Deposit, feature_2_Investment, feature_3_matched, feature_4_matched, info_2_U_matched, feature_1_male_matched, feature_2_Credit_matched, feature_2_Deposit_matched, feature_2_Investment_matched, index_matched, target_1, target_1_matched, target_1_matched_bias, target_2, target_2_matched, target_2_matched_bias, treatment, treatment_matched]\nIndex: []\n\n[0 rows x 24 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
indexfeature_3feature_4info_2_Ufeature_1_malefeature_2_Creditfeature_2_Depositfeature_2_Investmentfeature_3_matchedfeature_4_matched...feature_2_Investment_matchedindex_matchedtarget_1target_1_matchedtarget_1_matched_biastarget_2target_2_matchedtarget_2_matched_biastreatmenttreatment_matched
\n

0 rows × 24 columns

\n
" - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_matched[df_matched['info_2_U'] != df_matched['info_2_U_matched']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.3 Validation\n", - "Validate estimated effect:\n", - "1. by replacing real treatment (`random_treatment`) with random placebo treatment.\n", - " Estimated effect must be dropped to zero;\n", - "2. by adding random feature (`random_feature`). Estimated effect shouldn't change\n", - "significantly, p-val < 0.05;\n", - "3. estimates effect on subset of data (`subset_refuter`) (default fraction is 0.8). Estimated effect\n", - "shouldn't change significantly, p-val < 0.05." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.691819800Z", - "start_time": "2023-11-23T12:46:31.211070400Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 10/10 [00:07<00:00, 1.35it/s]\n" - ] - }, - { - "data": { - "text/plain": "{'target_1': [-0.022073635054381528, 0.0],\n 'target_2': [-0.01543623282222804, 0.1337448453290817]}" - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.validate_result(refuter=\"random_treatment\", effect_type=\"att\", n_sim=10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Save model" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.790842800Z", - "start_time": "2023-11-23T12:46:38.652166800Z" - } - }, - "outputs": [], - "source": [ - "model.save(\"test_model.pickle\")" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.829773700Z", - "start_time": "2023-11-23T12:46:38.759032Z" - } - }, - "outputs": [], - "source": [ - "model2 = Matcher.load(\"test_model.pickle\")" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.833878Z", - "start_time": "2023-11-23T12:46:38.791897200Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model2.results" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.885010300Z", - "start_time": "2023-11-23T12:46:38.822482900Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.results" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Read more: \n", - "[Faiss Library](https://github.com/facebookresearch/faiss) \n", - "[Matching with fixed variables](./Tutorial_2_Matching_with_fixed_group.ipynb) \n", - "[Results Interpretation](./Tutorial_3_Results_Interpretation.ipynb) " - ], - "metadata": { - "collapsed": false - } - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.13" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/examples/tutorials/Tutorial_2_Matching_with_fixed_group.ipynb b/examples/tutorials/Tutorial_2_Matching_with_fixed_group.ipynb deleted file mode 100644 index 6a0a1742..00000000 --- a/examples/tutorials/Tutorial_2_Matching_with_fixed_group.ipynb +++ /dev/null @@ -1,593 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Tutorial 2: Matching with fixed group" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 0. Import libraries " - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:43:51.258770600Z", - "start_time": "2023-11-23T12:43:47.367980300Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "C:\\Users\\User\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], - "source": [ - "import warnings \n", - "from hypex import Matcher\n", - "from hypex.dataset import DataGenerator\n", - "\n", - "warnings.simplefilter(action='ignore', category=FutureWarning)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Create or upload your dataset \n", - "In this case we will create random dataset with known effect size \n", - "If you have your own dataset, go to the part 2 \n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:12.230668Z", - "start_time": "2023-11-23T12:44:12.073259Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " info_1 info_2 feature_1 feature_2 feature_3 feature_4 treatment \\\n0 9508 Q female Deposit NaN 0.0 0.0 \n1 1783 Q female NaN -1.946330 3.0 1.0 \n2 2815 U female Investment 0.423735 3.0 1.0 \n3 6961 Q male Credit 0.183354 1.0 1.0 \n4 13036 U female Deposit -1.145800 2.0 0.0 \n... ... ... ... ... ... ... ... \n4995 1753 Q male Credit -1.032884 3.0 1.0 \n4996 13009 U female Deposit -2.286353 1.0 0.0 \n4997 10744 U male Credit -0.076360 3.0 0.0 \n4998 12343 U female Deposit 1.524106 0.0 1.0 \n4999 7078 U male Investment 1.038898 0.0 0.0 \n\n target_1 target_2 \n0 -0.215974 -0.215974 \n1 5.204231 5.204231 \n2 8.012220 3.079220 \n3 4.699710 4.699710 \n4 0.504827 0.504827 \n... ... ... \n4995 3.797474 0.979399 \n4996 -1.164799 -1.164799 \n4997 2.785862 2.785862 \n4998 4.919137 4.919137 \n4999 -1.299679 2.833841 \n\n[5000 rows x 9 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
info_1info_2feature_1feature_2feature_3feature_4treatmenttarget_1target_2
09508QfemaleDepositNaN0.00.0-0.215974-0.215974
11783QfemaleNaN-1.9463303.01.05.2042315.204231
22815UfemaleInvestment0.4237353.01.08.0122203.079220
36961QmaleCredit0.1833541.01.04.6997104.699710
413036UfemaleDeposit-1.1458002.00.00.5048270.504827
..............................
49951753QmaleCredit-1.0328843.01.03.7974740.979399
499613009UfemaleDeposit-2.2863531.00.0-1.164799-1.164799
499710744UmaleCredit-0.0763603.00.02.7858622.785862
499812343UfemaleDeposit1.5241060.01.04.9191374.919137
49997078UmaleInvestment1.0388980.00.0-1.2996792.833841
\n

5000 rows × 9 columns

\n
" - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data = DataGenerator(na_columns=['feature_3', 'feature_2'], \n", - " num_features=2, \n", - " num_targets=2)\n", - "data.df" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:15.170523600Z", - "start_time": "2023-11-23T12:44:15.042996Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "Index(['info_1', 'info_2', 'feature_1', 'feature_2', 'feature_3', 'feature_4',\n 'treatment', 'target_1', 'target_2'],\n dtype='object')" - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.columns" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "scrolled": true, - "ExecuteTime": { - "end_time": "2023-11-23T12:44:20.560397600Z", - "start_time": "2023-11-23T12:44:20.523064500Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "treatment\n0.0 2528\n1.0 2472\nName: count, dtype: int64" - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.treatment.value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:21.558974200Z", - "start_time": "2023-11-23T12:44:21.506538100Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "info_1 0\ninfo_2 0\nfeature_1 0\nfeature_2 500\nfeature_3 500\nfeature_4 0\ntreatment 0\ntarget_1 0\ntarget_2 0\ndtype: int64" - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data.df.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Matching \n", - "### 2.0 Init params\n", - "info_col used to define informative attributes that should not be part of matching, such as user_id \n", - "But to explicitly store this column in the table, so that you can compare directly after computation" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:44:47.768557100Z", - "start_time": "2023-11-23T12:44:47.730542300Z" - } - }, - "outputs": [], - "source": [ - "info_col = [data.info_col_names[0]]\n", - "\n", - "outcome = data.target_names\n", - "treatment = data.treatment_name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.1 Matching with a fixed variable \n", - "Used when you have categorical feature(s) that you want to compare by strict equality \n", - "group_col is used for strict comparison of categorical features. \n", - "In our case there is only one attribute \n", - "If there are several such attributes, you should make one of them and use it" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:52.145427900Z", - "start_time": "2023-11-23T12:45:52.065775400Z" - } - }, - "outputs": [], - "source": [ - "group_col = 'info_2'" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:54.197881600Z", - "start_time": "2023-11-23T12:45:52.777141400Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[23.11.2023 15:45:52 | hypex | INFO]: Number of NaN values filled with zeros: 1000\n", - "Get treated index by group U: 100%|██████████| 4/4 [00:00<00:00, 31.27it/s] \n" - ] - } - ], - "source": [ - "model = Matcher(input_data=data.df, outcome=outcome, treatment=treatment,\n", - " info_col=info_col, group_col=group_col)\n", - "results, quality_results, df_matched = model.estimate()" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:45:55.283617300Z", - "start_time": "2023-11-23T12:45:55.164796600Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "outputs": [ - { - "data": { - "text/plain": " index feature_3 feature_4 feature_1_male feature_2_Credit \\\n0 1783 -1.946330 3.0 0 0 \n1 6961 0.183354 1.0 1 1 \n2 10222 0.090388 1.0 0 0 \n3 13003 -0.551919 3.0 0 0 \n4 1555 -0.452623 1.0 1 0 \n... ... ... ... ... ... \n2523 11620 -0.514484 0.0 0 0 \n2524 4228 1.204038 2.0 0 1 \n2525 13009 -2.286353 1.0 0 0 \n2526 10744 -0.076360 3.0 1 1 \n2527 7078 1.038898 0.0 1 0 \n\n feature_2_Deposit feature_2_Investment info_2 feature_3_matched \\\n0 0 0 Q -1.621690 \n1 0 0 Q 0.213917 \n2 0 1 Q 0.088173 \n3 1 0 Q -0.570587 \n4 0 1 Q -0.462746 \n... ... ... ... ... \n2523 1 0 U -0.554516 \n2524 0 0 U 1.111896 \n2525 1 0 U -2.454503 \n2526 0 0 U -0.086061 \n2527 0 1 U 1.198814 \n\n feature_4_matched ... info_2_matched index_matched target_1 \\\n0 3.0 ... Q [5185] 5.204231 \n1 1.0 ... Q [7645] 4.699710 \n2 1.0 ... Q [4336] 2.617214 \n3 3.0 ... Q [13732] 6.290954 \n4 1.0 ... Q [6115] 1.772837 \n... ... ... ... ... ... \n2523 0.0 ... U [14434] 0.059858 \n2524 2.0 ... U [11908] 3.917107 \n2525 1.0 ... U [8008] -1.164799 \n2526 3.0 ... U [2806] 2.785862 \n2527 0.0 ... U [12136] -1.299679 \n\n target_1_matched target_1_matched_bias target_2 target_2_matched \\\n0 2.240444 3.279149 5.204231 2.240444 \n1 1.186502 3.542899 4.699710 1.186502 \n2 0.582712 2.032349 0.461260 3.671266 \n3 2.216480 4.056339 6.290954 5.221042 \n4 -0.206367 1.969371 1.772837 2.843365 \n... ... ... ... ... \n2523 -2.290294 -2.269874 0.059858 -2.290294 \n2524 6.630113 2.897786 7.444792 2.129195 \n2525 -3.272139 -1.770139 -1.164799 -0.811464 \n2526 7.040315 4.273907 2.785862 1.707816 \n2527 2.777553 3.756541 2.833841 -0.361463 \n\n target_2_matched_bias treatment treatment_matched \n0 3.450538 1 0 \n1 3.559034 1 0 \n2 -3.213327 1 0 \n3 1.041921 1 0 \n4 -1.085705 1 0 \n... ... ... ... \n2523 -2.292803 0 1 \n2524 -5.183594 0 1 \n2525 0.594225 0 1 \n2526 -1.064148 0 1 \n2527 -3.424400 0 1 \n\n[5000 rows x 24 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
indexfeature_3feature_4feature_1_malefeature_2_Creditfeature_2_Depositfeature_2_Investmentinfo_2feature_3_matchedfeature_4_matched...info_2_matchedindex_matchedtarget_1target_1_matchedtarget_1_matched_biastarget_2target_2_matchedtarget_2_matched_biastreatmenttreatment_matched
01783-1.9463303.00000Q-1.6216903.0...Q[5185]5.2042312.2404443.2791495.2042312.2404443.45053810
169610.1833541.01100Q0.2139171.0...Q[7645]4.6997101.1865023.5428994.6997101.1865023.55903410
2102220.0903881.00001Q0.0881731.0...Q[4336]2.6172140.5827122.0323490.4612603.671266-3.21332710
313003-0.5519193.00010Q-0.5705873.0...Q[13732]6.2909542.2164804.0563396.2909545.2210421.04192110
41555-0.4526231.01001Q-0.4627461.0...Q[6115]1.772837-0.2063671.9693711.7728372.843365-1.08570510
..................................................................
252311620-0.5144840.00010U-0.5545160.0...U[14434]0.059858-2.290294-2.2698740.059858-2.290294-2.29280301
252442281.2040382.00100U1.1118962.0...U[11908]3.9171076.6301132.8977867.4447922.129195-5.18359401
252513009-2.2863531.00010U-2.4545031.0...U[8008]-1.164799-3.272139-1.770139-1.164799-0.8114640.59422501
252610744-0.0763603.01100U-0.0860613.0...U[2806]2.7858627.0403154.2739072.7858621.707816-1.06414801
252770781.0388980.01001U1.1988140.0...U[12136]-1.2996792.7775533.7565412.833841-0.361463-3.42440001
\n

5000 rows × 24 columns

\n
" - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_matched" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T12:45:57.316011300Z", - "start_time": "2023-11-23T12:45:57.192144200Z" - } - } - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.105049200Z", - "start_time": "2023-11-23T12:46:30.976642500Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "Empty DataFrame\nColumns: [index, feature_3, feature_4, feature_1_male, feature_2_Credit, feature_2_Deposit, feature_2_Investment, info_2, feature_3_matched, feature_4_matched, feature_1_male_matched, feature_2_Credit_matched, feature_2_Deposit_matched, feature_2_Investment_matched, info_2_matched, index_matched, target_1, target_1_matched, target_1_matched_bias, target_2, target_2_matched, target_2_matched_bias, treatment, treatment_matched]\nIndex: []\n\n[0 rows x 24 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
indexfeature_3feature_4feature_1_malefeature_2_Creditfeature_2_Depositfeature_2_Investmentinfo_2feature_3_matchedfeature_4_matched...info_2_matchedindex_matchedtarget_1target_1_matchedtarget_1_matched_biastarget_2target_2_matchedtarget_2_matched_biastreatmenttreatment_matched
\n

0 rows × 24 columns

\n
" - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_matched[df_matched['info_2'] != df_matched['info_2_matched']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Results \n", - "### 3.1 ATE, ATT, ATC" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.109051200Z", - "start_time": "2023-11-23T12:46:31.023973500Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.2 SMD, PSI, KS-test, repeats" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.238438200Z", - "start_time": "2023-11-23T12:46:31.073195300Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "dict_keys(['psi', 'ks_test', 'smd', 'repeats'])" - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.334334500Z", - "start_time": "2023-11-23T12:46:31.105049200Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " column_treated anomaly_score_treated check_result_treated \\\n0 feature_1_male_treated 0.0 OK \n1 feature_2_Credit_treated 0.0 OK \n2 feature_2_Deposit_treated 0.0 OK \n3 feature_2_Investment_treated 0.0 OK \n4 feature_3_treated 0.0 OK \n5 feature_4_treated 0.0 OK \n6 info_2_treated 0.0 OK \n\n column_untreated anomaly_score_untreated \\\n0 feature_1_male_untreated 0.0 \n1 feature_2_Credit_untreated 0.0 \n2 feature_2_Deposit_untreated 0.0 \n3 feature_2_Investment_untreated 0.0 \n4 feature_3_untreated 0.0 \n5 feature_4_untreated 0.0 \n6 info_2_untreated 0.0 \n\n check_result_untreated \n0 OK \n1 OK \n2 OK \n3 OK \n4 OK \n5 OK \n6 OK ", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
column_treatedanomaly_score_treatedcheck_result_treatedcolumn_untreatedanomaly_score_untreatedcheck_result_untreated
0feature_1_male_treated0.0OKfeature_1_male_untreated0.0OK
1feature_2_Credit_treated0.0OKfeature_2_Credit_untreated0.0OK
2feature_2_Deposit_treated0.0OKfeature_2_Deposit_untreated0.0OK
3feature_2_Investment_treated0.0OKfeature_2_Investment_untreated0.0OK
4feature_3_treated0.0OKfeature_3_untreated0.0OK
5feature_4_treated0.0OKfeature_4_untreated0.0OK
6info_2_treated0.0OKinfo_2_untreated0.0OK
\n
" - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['psi']" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.493710300Z", - "start_time": "2023-11-23T12:46:31.151426Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " match_control_to_treat match_treat_to_control\nfeature_3 0.902708 0.991299\nfeature_4 1.000000 1.000000", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
match_control_to_treatmatch_treat_to_control
feature_30.9027080.991299
feature_41.0000001.000000
\n
" - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['ks_test']" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:31.522873200Z", - "start_time": "2023-11-23T12:46:31.183533400Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": "{'match_control_to_treat': 0.4, 'match_treat_to_control': 0.38}" - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['repeats']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.3 Validation\n", - "Validates estimated effect:\n", - "1. by replacing real treatment (`random_treatment`) with random placebo treatment.\n", - " Estimated effect must be droped to zero;\n", - "2. by adding random feature (`random_feature`). Estimated effect shouldn't change\n", - "significantly, p-val < 0.05;\n", - "3. estimates effect on subset of data (`subset_refuter`) (default fraction is 0.8). Estimated effect\n", - "shouldn't change significantly, p-val < 0.05." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.691819800Z", - "start_time": "2023-11-23T12:46:31.211070400Z" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 10/10 [00:07<00:00, 1.35it/s]\n" - ] - }, - { - "data": { - "text/plain": "{'target_1': [-0.022073635054381528, 0.0],\n 'target_2': [-0.01543623282222804, 0.1337448453290817]}" - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.validate_result(refuter=\"random_treatment\", effect_type=\"att\", n_sim=10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Save model" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.790842800Z", - "start_time": "2023-11-23T12:46:38.652166800Z" - } - }, - "outputs": [], - "source": [ - "model.save(\"test_model.pickle\")" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.829773700Z", - "start_time": "2023-11-23T12:46:38.759032Z" - } - }, - "outputs": [], - "source": [ - "model2 = Matcher.load(\"test_model.pickle\")" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.833878Z", - "start_time": "2023-11-23T12:46:38.791897200Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model2.results" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "ExecuteTime": { - "end_time": "2023-11-23T12:46:38.885010300Z", - "start_time": "2023-11-23T12:46:38.822482900Z" - } - }, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 2.653028 0.068668 0.00 2.518438 2.787617 target_1\nATC 2.623120 0.077746 0.00 2.470737 2.775502 target_1\nATT 2.683613 0.076124 0.00 2.534411 2.832815 target_1\nATE -0.113997 0.082506 0.17 -0.275710 0.047715 target_2\nATC -0.133361 0.092808 0.15 -0.315265 0.048542 target_2\nATT -0.094195 0.092107 0.31 -0.274724 0.086334 target_2", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE2.6530280.0686680.002.5184382.787617target_1
ATC2.6231200.0777460.002.4707372.775502target_1
ATT2.6836130.0761240.002.5344112.832815target_1
ATE-0.1139970.0825060.17-0.2757100.047715target_2
ATC-0.1333610.0928080.15-0.3152650.048542target_2
ATT-0.0941950.0921070.31-0.2747240.086334target_2
\n
" - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.results" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Read more: \n", - "[Simple Matching](./Tutorial_1_Simple_Matching.ipynb) \n", - "[Results Interpretation](./Tutorial_3_Results_Interpretation.ipynb) " - ], - "metadata": { - "collapsed": false - } - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.13" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/examples/tutorials/Tutorial_3_Results_Interpretation.ipynb b/examples/tutorials/Tutorial_3_Results_Interpretation.ipynb deleted file mode 100644 index df1b5b1f..00000000 --- a/examples/tutorials/Tutorial_3_Results_Interpretation.ipynb +++ /dev/null @@ -1,409 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Tutorial 3: Results of Matching\n", - "\n", - "In this tutorial you will understand how to interpret results of matching." - ], - "metadata": { - "collapsed": false - }, - "id": "59127a426dc5e28d" - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "initial_id", - "metadata": { - "collapsed": true, - "ExecuteTime": { - "end_time": "2023-11-23T07:49:33.655029700Z", - "start_time": "2023-11-23T07:49:33.636570Z" - } - }, - "outputs": [], - "source": [ - "import warnings \n", - "from hypex import Matcher\n", - "from hypex.dataset import DataGenerator\n", - "\n", - "warnings.simplefilter(action='ignore', category=FutureWarning)" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "outputs": [], - "source": [ - "sample_data = DataGenerator()" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:49:37.302480300Z", - "start_time": "2023-11-23T07:49:37.258433Z" - } - }, - "id": "bd06ad91f1ee33b4" - }, - { - "cell_type": "code", - "execution_count": 20, - "outputs": [ - { - "data": { - "text/plain": " info_col_1 info_col_2 feature_col_1 feature_col_2 feature_col_3 \\\n0 8077 O male Credit 0.305621 \n1 14527 O male Deposit -0.933765 \n2 9124 K female Investment 1.097520 \n3 5191 K male Credit -1.390627 \n4 7636 K male Credit 1.270699 \n... ... ... ... ... ... \n4995 2851 O male Investment -1.183287 \n4996 8644 O female Investment 0.478477 \n4997 2200 K female Credit 0.341917 \n4998 1393 K male Deposit -0.183003 \n4999 5359 O male Deposit -2.345591 \n\n feature_col_4 feature_col_5 feature_col_6 treatment_1 outcome_1 \n0 1.376506 -1.885554 1.0 0.0 1.760076 \n1 -1.143372 1.940034 1.0 0.0 1.255280 \n2 1.422774 1.171370 0.0 0.0 4.096455 \n3 1.313302 -0.664406 0.0 1.0 0.705794 \n4 1.970219 0.329824 3.0 0.0 7.636827 \n... ... ... ... ... ... \n4995 -0.585822 1.797529 3.0 1.0 7.516132 \n4996 -0.129496 3.083552 2.0 1.0 11.526241 \n4997 0.157617 -0.101076 3.0 0.0 3.667932 \n4998 1.505163 -0.929622 2.0 0.0 3.018969 \n4999 1.600512 -0.407026 1.0 1.0 0.755559 \n\n[5000 rows x 10 columns]", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
info_col_1info_col_2feature_col_1feature_col_2feature_col_3feature_col_4feature_col_5feature_col_6treatment_1outcome_1
08077OmaleCredit0.3056211.376506-1.8855541.00.01.760076
114527OmaleDeposit-0.933765-1.1433721.9400341.00.01.255280
29124KfemaleInvestment1.0975201.4227741.1713700.00.04.096455
35191KmaleCredit-1.3906271.313302-0.6644060.01.00.705794
47636KmaleCredit1.2706991.9702190.3298243.00.07.636827
.................................
49952851OmaleInvestment-1.183287-0.5858221.7975293.01.07.516132
49968644OfemaleInvestment0.478477-0.1294963.0835522.01.011.526241
49972200KfemaleCredit0.3419170.157617-0.1010763.00.03.667932
49981393KmaleDeposit-0.1830031.505163-0.9296222.00.03.018969
49995359OmaleDeposit-2.3455911.600512-0.4070261.01.00.755559
\n

5000 rows × 10 columns

\n
" - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sample_data.df" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:49:37.697429700Z", - "start_time": "2023-11-23T07:49:37.637510200Z" - } - }, - "id": "5280ce40ac6c03b" - }, - { - "cell_type": "markdown", - "source": [ - "## 1. Matching process\n", - "\n", - "If you want to understand how Matcher works, visit [simple matching tutorial](./Tutorial_1_Simple_Matching.ipynb)." - ], - "metadata": { - "collapsed": false - }, - "id": "c4272b48df925aa6" - }, - { - "cell_type": "code", - "execution_count": 21, - "outputs": [], - "source": [ - "info_col = [sample_data.info_col_names[0]]\n", - "\n", - "outcome = sample_data.outcome_name[0]\n", - "treatment = sample_data.treatment_name[0]" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:49:57.585006200Z", - "start_time": "2023-11-23T07:49:57.551159400Z" - } - }, - "id": "1b43563d32ff64aa" - }, - { - "cell_type": "code", - "execution_count": 22, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Get treated index: 100%|██████████| 5000/5000 [00:00<00:00, 32720.45it/s]\n" - ] - } - ], - "source": [ - "model = Matcher(input_data=sample_data.df, outcome=outcome, treatment=treatment, \n", - " info_col=info_col)\n", - "results, quality_results, df_matched = model.estimate()" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:50:00.766735800Z", - "start_time": "2023-11-23T07:49:59.839637900Z" - } - }, - "id": "70b2a9e73e8a1506" - }, - { - "cell_type": "markdown", - "source": [ - "## 2. Results interpretation\n", - "### 2.0 ATE, ATC, ATT \n", - "\n", - "For correct results' interpretation you may have main and alternative hypotheses before the experiment." - ], - "metadata": { - "collapsed": false - }, - "id": "4034676068c455f4" - }, - { - "cell_type": "markdown", - "source": [ - "**ATC (Average Treatment Effect on the Control)** - is the average of treatment effects for people who were assigned to control. \n", - "**ATT (Average Treatment Effect on the Treated)** - is the average of treatment effects for people who were assigned to the treatment. \n", - "**ATE (Average Treatment Effect)** - the average of treatment effects or weighted average value between ATC and ATT. " - ], - "metadata": { - "collapsed": false - }, - "id": "79c892ce7f578cd3" - }, - { - "cell_type": "markdown", - "source": [ - "If **ATE > 0**, it means that the treatment produces the desired results or improvement compared to a control group or baseline. \n", - "If **ATC > 0**, it means that \n", - "If **ATT > 0**, it means that " - ], - "metadata": { - "collapsed": false - }, - "id": "396273560e855c2" - }, - { - "cell_type": "code", - "execution_count": 23, - "outputs": [ - { - "data": { - "text/plain": " effect_size std_err p-val ci_lower ci_upper outcome\nATE 3.431695 0.081789 0.0 3.271388 3.592001 outcome_1\nATC 3.430632 0.090696 0.0 3.252867 3.608397 outcome_1\nATT 3.432778 0.091197 0.0 3.254031 3.611524 outcome_1", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
effect_sizestd_errp-valci_lowerci_upperoutcome
ATE3.4316950.0817890.03.2713883.592001outcome_1
ATC3.4306320.0906960.03.2528673.608397outcome_1
ATT3.4327780.0911970.03.2540313.611524outcome_1
\n
" - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:50:01.772325Z", - "start_time": "2023-11-23T07:50:01.708996400Z" - } - }, - "id": "e53777fe46c8f5c8" - }, - { - "cell_type": "markdown", - "source": [ - "### 2.1 PSI \n", - "Population Stability Index - reflects the difference between the training sample and the data on which the model is used." - ], - "metadata": { - "collapsed": false - }, - "id": "f059a00244369186" - }, - { - "cell_type": "markdown", - "source": [ - "PSI < 0.1 - no change \n", - "0.1 <= PSI < 0.2 – minor changes are required \n", - "PSI >= 0.2 - significant changes are required " - ], - "metadata": { - "collapsed": false - }, - "id": "e7f71e7bddc8dd7" - }, - { - "cell_type": "code", - "execution_count": 13, - "outputs": [ - { - "data": { - "text/plain": " column_treated anomaly_score_treated \\\n0 feature_col_1_male_treated 0.00 \n1 feature_col_2_Deposit_treated 0.00 \n2 feature_col_2_Investment_treated 0.00 \n3 feature_col_3_treated 0.01 \n4 feature_col_4_treated 0.02 \n5 feature_col_5_treated 0.01 \n6 feature_col_6_treated 0.00 \n\n check_result_treated column_untreated \\\n0 OK feature_col_1_male_untreated \n1 OK feature_col_2_Deposit_untreated \n2 OK feature_col_2_Investment_untreated \n3 OK feature_col_3_untreated \n4 OK feature_col_4_untreated \n5 OK feature_col_5_untreated \n6 OK feature_col_6_untreated \n\n anomaly_score_untreated check_result_untreated \n0 0.00 OK \n1 0.00 OK \n2 0.00 OK \n3 0.02 OK \n4 0.02 OK \n5 0.01 OK \n6 0.00 OK ", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
column_treatedanomaly_score_treatedcheck_result_treatedcolumn_untreatedanomaly_score_untreatedcheck_result_untreated
0feature_col_1_male_treated0.00OKfeature_col_1_male_untreated0.00OK
1feature_col_2_Deposit_treated0.00OKfeature_col_2_Deposit_untreated0.00OK
2feature_col_2_Investment_treated0.00OKfeature_col_2_Investment_untreated0.00OK
3feature_col_3_treated0.01OKfeature_col_3_untreated0.02OK
4feature_col_4_treated0.02OKfeature_col_4_untreated0.02OK
5feature_col_5_treated0.01OKfeature_col_5_untreated0.01OK
6feature_col_6_treated0.00OKfeature_col_6_untreated0.00OK
\n
" - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['psi']" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:46:28.147875200Z", - "start_time": "2023-11-23T07:46:28.081589100Z" - } - }, - "id": "3c90404b1ebc9973" - }, - { - "cell_type": "markdown", - "source": [ - "### 2.2 KS_test \n", - "\n", - "During the Kolmogorov-Smirnov test, features are checked for belonging to the same distribution." - ], - "metadata": { - "collapsed": false - }, - "id": "2bbdf4c565f8cb31" - }, - { - "cell_type": "markdown", - "source": [ - "*H0: features belong to the same distribution* \n", - "*H1: features do not belong to the same distribution* \n", - "\n", - "The values in the table are p-value. \n", - "If p-value is **less than 0.05**, *H0* is rejected." - ], - "metadata": { - "collapsed": false - }, - "id": "9dfb6b16d601353d" - }, - { - "cell_type": "code", - "execution_count": 15, - "outputs": [ - { - "data": { - "text/plain": " match_control_to_treat match_treat_to_control\nfeature_col_3 0.105390 0.125298\nfeature_col_4 0.213033 0.195898\nfeature_col_5 0.470989 0.162552\nfeature_col_6 1.000000 1.000000", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
match_control_to_treatmatch_treat_to_control
feature_col_30.1053900.125298
feature_col_40.2130330.195898
feature_col_50.4709890.162552
feature_col_61.0000001.000000
\n
" - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['ks_test']" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:47:06.626945700Z", - "start_time": "2023-11-23T07:47:06.579757200Z" - } - }, - "id": "351694a62b1c4430" - }, - { - "cell_type": "markdown", - "source": [ - "### 2.3 SMD\n", - "\n", - "Standardized Mean Difference - a measure of distance between two group means in terms of one or more variables" - ], - "metadata": { - "collapsed": false - }, - "id": "2def2c9c9bd9e011" - }, - { - "cell_type": "markdown", - "source": [ - "SMD 0.2 - small difference \n", - "SMD 0.5 - medium difference \n", - "SMD 0.8 - large difference \n", - "\n", - "*read more [there](https://pubmed.ncbi.nlm.nih.gov/32965803/#:~:text=SMDs%20of%200.2%2C%200.5%2C%20and%200.8%20are%20considered%20small%2C%20medium%2C%20and%20large%2C%20respectively.)*" - ], - "metadata": { - "collapsed": false - }, - "id": "d2da757bcb07204f" - }, - { - "cell_type": "code", - "execution_count": 16, - "outputs": [ - { - "data": { - "text/plain": " match_control_to_treat match_treat_to_control\nfeature_col_3 0.002974 0.003194\nfeature_col_4 0.000296 0.000736\nfeature_col_5 0.004730 0.004899\nfeature_col_6 0.006338 0.005453", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
match_control_to_treatmatch_treat_to_control
feature_col_30.0029740.003194
feature_col_40.0002960.000736
feature_col_50.0047300.004899
feature_col_60.0063380.005453
\n
" - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['smd']" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:47:33.591066200Z", - "start_time": "2023-11-23T07:47:33.540643200Z" - } - }, - "id": "d1a94ff92952db8f" - }, - { - "cell_type": "markdown", - "source": [ - "### 2.4 Repeats\n", - "\n", - "The fraction of duplicated indexes in the arrays " - ], - "metadata": { - "collapsed": false - }, - "id": "b820e42c0d0afd63" - }, - { - "cell_type": "code", - "execution_count": 17, - "outputs": [ - { - "data": { - "text/plain": "{'match_control_to_treat': 0.6, 'match_treat_to_control': 0.6}" - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_results['repeats']" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2023-11-23T07:47:48.085299400Z", - "start_time": "2023-11-23T07:47:48.020650500Z" - } - }, - "id": "6be2c140bd5f5f6a" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}