The following question set is adapted from the pre-interview assessment form used in [Algoritma]'s (https://algorit.ma) hiring for teaching members of the team.
-
Assuming a simple linear regression (ordinary least squares) trained on a dataset with one predictor. This model is likely to exhibit:
- A high bias and high variance
- A low bias and low variance
- A high bias but low variance
- A low bias but high variance
-
Which of the following is the most fitting definition of p-value:
- Probability of obtaining a result or value more extreme than was observed
- Probability of a null hypothesis to evaluate to False
- Probability of the alternate hypothesis to be correct
- Probability of a variable being insignificant to the true parameters of a model
-
Which is the correct formula for calculating model's sensitivity?
- True Positives / (True Positives + False Negatives)
- True Positives / (True Positives + False Positives)
- True Positives / Total Predictions
- True Negatives / Total Predictions
Refer to
threemodels.png
directly in the same directory if the image is not rendered for the following question.
-
You want a model that identify hateful tweets on Twitter and you're presented with three candidates (Model A, Model B, and Model C). You are asked to pick the model with the highest precision. Which of the following models have the highest precision?
- Model A
- Model B
- Model C
-
We want to be confident that our model can perform reasonably in real world environments, and not overfitted to the dataset it was trained on. What is a strategy that greatly diminish the possibility of overfitting?
- Gradient optimization
- Grid Search
- Train-Test Splitting
-
One difference between a supervised learning task and an unsupervised learning task is the presence of a target variable. Which of the following best describes a target variable?
- A target variable is also an indendent variable
- Target variable is an isolated variable taken in a separate data collection process
- Target variable is dependent to independent variable
-
Download
analytics.csv
, which is export as-is from the company's Google Analytics dashboard. Values in theLanguage
column is formatted to capture both the client (browser) language and keyboard language, but for this exercise we're only interested about the former. A value ofen-id
should hence be stored asen
, and a value ofid-jp
should similarly beid
. Fill missing values withmissing
. This should result inen
,id
,th
andmissing
as valid values in theLanguage
column. Which language has on average, the highestPages / Session
count?-
en
or English -
id
or Indonesian -
th
or Thai
-
-
Use any tools of your choice, run a closed-form, simple linear regression to predict
Goal Conversion Rate
(target) using the values ofPages / Session
(predictor). Call thismodel_A
. What is the multiple R-squared from your simple linear regression,model_A
, rounded to 3 decimal points? You can retrieve this value throughsklearn.metrics.r2_score
orsummary(model)$r.squared
- 0.786
- 0.826
- 0.866
-
Let beta0 be the intercept and beta1 be your slope. What is the value of
beta0
?- -25.188
- 8.65
- 0
- 0.00268
-
Add
Language
as an additional predictor to the earlier linear regression model. Call thismodel_B
. Did your multiple R-squared model improved as a result? Compare the adjusted R-squared of two modelsmodel_A
andmodel_B
.-
model_A
has a higher multiple R2 and adjusted R2 value -
model_B
has a higher multiple R2 and adjusted R2 value -
model_A
has a higher multiple R2 but lower adjusted R2 value -
model_B
has a higher multiple R2 but lower adjusted R2 value
-