Skip to content

Commit

Permalink
publish proj a2
Browse files Browse the repository at this point in the history
  • Loading branch information
nsreddy16 committed Oct 25, 2024
1 parent 33d653b commit 6998af7
Show file tree
Hide file tree
Showing 14 changed files with 1,140 additions and 4 deletions.
2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ book:
- regex/regex.md
- visualizations/visualizations.md
- projA1/projA1.md
# - projA2/projA2.md
- projA2/projA2.md
# - sql/sql.md

sidebar:
Expand Down
5 changes: 5 additions & 0 deletions docs/autograder_gradescope/autograder_gradescope.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
5 changes: 5 additions & 0 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,11 @@
<div class="sidebar-item-container">
<a href="./projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
5 changes: 5 additions & 0 deletions docs/jupyter101/jupyter101.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
5 changes: 5 additions & 0 deletions docs/jupyter_datahub/jupyter_datahub.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
5 changes: 5 additions & 0 deletions docs/pandas/pandas.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
9 changes: 9 additions & 0 deletions docs/projA1/projA1.html
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
<script src="../site_libs/quarto-search/fuse.min.js"></script>
<script src="../site_libs/quarto-search/quarto-search.js"></script>
<meta name="quarto:offset" content="../">
<link href="../projA2/projA2.html" rel="next">
<link href="../visualizations/visualizations.html" rel="prev">
<link href="../data100_logo.png" rel="icon" type="image/png">
<script src="../site_libs/quarto-html/quarto.js"></script>
Expand Down Expand Up @@ -151,6 +152,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link active"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down Expand Up @@ -618,6 +624,9 @@ <h3 class="anchored" data-anchor-id="my-ohe-columns-contain-a-lot-of-nan-values"
</a>
</div>
<div class="nav-page nav-page-next">
<a href="../projA2/projA2.html" class="pagination-link" aria-label="<span class='chapter-number'>8</span>&nbsp; <span class='chapter-title'>Project A2 Common Questions</span>">
<span class="nav-page-text"><span class="chapter-title">Project A2 Common Questions</span></span> <i class="bi bi-arrow-right-short"></i>
</a>
</div>
</nav>
</div> <!-- /content -->
Expand Down
769 changes: 769 additions & 0 deletions docs/projA2/projA2.html

Large diffs are not rendered by default.

Binary file added docs/projA2/under_overfit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions docs/regex/regex.html
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
40 changes: 40 additions & 0 deletions docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -378,5 +378,45 @@
"crumbs": [
"<span class='chapter-number'>7</span>  <span class='chapter-title'>Project A1 Common Questions</span>"
]
},
{
"objectID": "projA2/projA2.html",
"href": "projA2/projA2.html",
"title": "Project A2 Common Questions",
"section": "",
"text": "Questions 5d and 5f",
"crumbs": [
"<span class='chapter-number'>8</span>  <span class='chapter-title'>Project A2 Common Questions</span>"
]
},
{
"objectID": "projA2/projA2.html#questions-5d-and-5f",
"href": "projA2/projA2.html#questions-5d-and-5f",
"title": "Project A2 Common Questions",
"section": "",
"text": "General Debugging Tips\nQuestion 5 is a challenging question that mirrors a lot of data science work in the real world: cleaning, exploring, and transforming data; fitting a model, working with a pre-defined pipeline and evaluating your model’s performance. Here are some general debugging tips to make the process easier:\n\nSeparate small tasks into helper functions, especially if you will execute them multiple times. For example, a helper function that one-hot encodes a categorical variable may be helpful as you could perform it on multiple such columns. If you’re parsing a column with RegEx, it also might be a good idea to separate it to a helper function. This allows you to verify that you’re not making errors in these small tasks and prevents unknown bugs from appearing.\nFeel free to make new cells to play with the data! As long as you delete them afterward, it will not affect the autograder.\nThe feature_engine_final function looks daunting at first, but start small. First, try and implement a model with a single feature to get familiar with how the pipeline works, then slowly experiment with adding one feature at a time and see how that affects your training RMSE.\n\n\n\nMy training RMSE is low, but my validation/test RMSE is high\nYour model is likely overfitting to the training data and does not generalize to the test set. Recall the bias-variance tradeoff discussed in lecture. As you add more features and make your model more complex, it is expected that your training error will decrease. Your validation and test error may also decrease initially, but if your model is too complex, you end up with high validation and test RMSE.\n\n\n\nTo decrease model complexity, consider visualizing the relationship between the features you’ve chosen with the (Log) Sale Price and removing features that are not highly correlated. Removing outliers can also help your model generalize better and prevent it from fitting to noise in the data. Methods like cross-validation allow you to get a better sense of where you lie along the validation error curve. Feel free to take a look at the code used in Lecture 16 if you’re confused on how to implement cross-validation.\n\n\nValueError: Per-column arrays must each be 1-dimensional\nIf you’re passing the tests for question 5d but getting this error in question 5f, then your Y variable is likely a DataFrame, not a Series. sklearn models like LinearRegression expect X to be a 2D datatype (ie. DataFrame, 2D NumPy array) and Y to be a 1D datatype (ie. Series, 1D NumPy array).\n\n\nKeyError: 'Sale Price'/KeyError: 'Log Sale Price'\nKeyErrors are raised when a column name does not exist in your DataFrame. You could be getting this error because:\n\nThe test set does not contain a \"(Log) Sale Price\" as that’s what we’re trying to predict. Make sure you only reference the \"(Log) Sale Price\" column when working with training data (is_test_set=False).\nYou dropped the \"Sale Price\" column twice in your preprocessing code.\n\n\n\nValue Error: could not convert string to float\nThis error occurs if your final design matric contains non-numeric columns. For example, if you simply run X = data.drop(columns = [\"Log Sale Price\", \"Sale Price\"]), all the non-numeric columns of data are still included in X and you will see this error message. The fit function of a lm.LinearRegression object can take a pandas DataFrame as the X argument, but requires that the DataFrame is only composed of numeric values.\n\n\nValueError: Input X contains infinity or a value too large for dtype('float64')\nThe reason why your X data contains infinity is likely because you are taking the logarithm of 0 somewhere in your code. To prevent this, try:\n\nAdding a small number to the features that you want to perform the log transformation on so that all values are positive and greater than 0. Note that whatever value you add to your train data should also be added to your test data.\nRemoving zeroes before taking the logarithm. Note that this is only possible on the training data as you cannot drop rows from the test set.\n\n\n\nValueError: Input X contains NaN\nThe reason why your design matrix X contains NaN values is likely because you take the log of a negative number somewhere in your code. To prevent this, try:\n\nShifting the range of values for features that you want to perform the logging operation on to positive values greater than 0. Note that whatever value you add to your train data should also be added to your test data.\nRemoving negative values before taking the log. Note that this is only possible on the training data as you cannot drop rows from the test set.\n\n\n\nValueError: The feature names should match those that were passed during fit\nThis error is followed by one or both of the following:\nFeature names unseen at fit time: \n- FEATURE NAME 1\n- FEATURE NAME 2\n ...\n\nFeature names seen at fit time, yet now missing\n- FEATURE NAME 1\n- FEATURE NAME 2\n ...\nThis error occurs if the columns/features you’re passing in for the test dataset aren’t the same as the features you used to train the model. sklearn’s models expect the testing data’s column names to match the training data’s. The features listed under Feature names unseen at fit time are columns that were present in the training data but not the testing data, and features listed under Feature names seen at fit time, yet now missing were present in the testing data but not the training data.\nPotential causes for this error:\n\nYour preprocessing for X is different for training and testing. Double-check your code in feature_engine_final! Besides removing any references to 'Sale Price' and code that would remove rows from the test set, your preprocessing should be the same.\nSome one-hot-encoded categories are present in training but not in testing (or vice versa). For example, let’s say that the feature \"Data100\" has categories “A”, “B”, “C”, and “D”. If “A”, “B”, and “C” are present in the training data, but “B”, “C”, and “D” are present in the testing data, you will get this error:\nThe feature names should match those that were passed during fit. Feature names unseen at fit time: \n- Data100_D\n ...\n\nFeature names seen at fit time, yet now missing\n- Data100_A\n\n\n\nValueError: operands could not be broadcast together with shapes ...\nThis error occurs when you attempt to perform an operation on two NumPy arrays with mismatched dimensions. For example, np.ones(100000) - np.ones(1000000) is not defined since you cannot perform elementwise addition on arrays with different lengths. Use the error traceback to identify which line is erroring, and print out the shape of the arrays on the line before using .shape.\n\n\nTypeError: NoneType is not subscriptable\nThis error occurs when a NoneType variable is being accessed like a class, for example None.some_function(). It may be difficult to identify where the NoneType is coming from, but here are some possible causes:\n\nCheck that your helper functions always end with a return statement and that the result is expected!\npandas’ inplace= argument allows us to simplify code; instead of reassigning df = df.an_operation(inplace=False), you can choose to shorten the operation as df.an_operation(inplace=True). Note that any inplace=True argument modifies the DataFrame and returns nothing (read more about it in this stack overflow post). Both df = df.an_operation(inplace=True) and df.an_operation(inplace=True).another_operation() will result in this TypeError.\n\nCheck the return type of all functions you end up using. For example, np.append(arr_1, arr_2) returns a NumPy array. In contrast, Python’s .append function mutates the iterable and returns None. If you are unsure of what data type is being returned, looking up the documentation of the function , adding print statements, using type(some_function(input) are all useful ways to debug your code.",
"crumbs": [
"<span class='chapter-number'>8</span>  <span class='chapter-title'>Project A2 Common Questions</span>"
]
},
{
"objectID": "projA2/projA2.html#question-6",
"href": "projA2/projA2.html#question-6",
"title": "Project A2 Common Questions",
"section": "Question 6",
"text": "Question 6\n\nI’m getting negative values for the prop_overest plot\nNote that in the function body, the skeleton code includes:\n# DO NOT MODIFY THESE TWO LINES\n if subset_df.shape[0] == 0:\n return -1\nThe above two lines of code are included to avoid dividing by 0 when computing prop_overest in the case that subset_df does not have any rows. When interpreting the plots, you can disregard the negative portions as you know they are invalid values corresponding to this edge case. You can verify this by observing that your RMSE plot does not display the corresponding intervals.",
"crumbs": [
"<span class='chapter-number'>8</span>  <span class='chapter-title'>Project A2 Common Questions</span>"
]
},
{
"objectID": "projA2/projA2.html#gradescope",
"href": "projA2/projA2.html#gradescope",
"title": "Project A2 Common Questions",
"section": "Gradescope",
"text": "Gradescope\n\nI don’t have many Gradescope submissions left\nIf you’re almost out of Gradescope submissions, try using k-fold cross-validation to check the accuracy of your model. Results from cross-validation will be closer to the test set accuracy than results from the training data. Feel free to take a look at the code used in Lecture 16 if you’re confused on how to implement cross-validation.\n\n\n“Wrong number of lines ( __ instead of __ )”\nThis occurs when you remove outliers when preprocessing the testing data. Please do not remove any outliers from your test set. You may only remove outliers in training data.\n\n\nNumerical Overflow\nThis error is caused by overly large predictions that create an extremely large RMSE. The cell before you generate your submission runs submission_df[\"Value\"].describe(), which returns some summary statistics of your predictions. Your maximum value for Log Sale Price should not be over 25.\nFor your reference, a log sale price of 25 corresponds to a sale price of \\(e^{25} \\approx\\) 70 billion, which is far bigger than anything found in the dataset. If you see such large predictions, you can try removing outliers from the training data or experimenting with new features so that your model generalizes better.",
"crumbs": [
"<span class='chapter-number'>8</span>  <span class='chapter-title'>Project A2 Common Questions</span>"
]
}
]
5 changes: 5 additions & 0 deletions docs/visualizations/visualizations.html
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,11 @@
<div class="sidebar-item-container">
<a href="../projA1/projA1.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A1 Common Questions</span></a>
</div>
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="../projA2/projA2.html" class="sidebar-item-text sidebar-link"><span class="chapter-title">Project A2 Common Questions</span></a>
</div>
</li>
</ul>
</div>
Expand Down
Loading

0 comments on commit 6998af7

Please sign in to comment.