Skip to content

Commit

Permalink
Pushed A2
Browse files Browse the repository at this point in the history
  • Loading branch information
yashdave003 committed Mar 15, 2024
1 parent 610615a commit 082c693
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 13 deletions.
6 changes: 3 additions & 3 deletions docs/projA2/projA2.html
Original file line number Diff line number Diff line change
Expand Up @@ -244,14 +244,14 @@ <h2 class="anchored" data-anchor-id="question-5d-and-5f">Question 5d and 5f</h2>
<h3 class="anchored" data-anchor-id="general-debugging-tips">General Debugging Tips</h3>
<p>Question 5 is a challenging question that mirrors a lot of data science work in the real world: cleaning, exploring, and transforming data; fitting a model, working with a pre-defined pipeline and evaluating your model’s performance. Here are some general debugging tips to make the process easier:</p>
<ul>
<li>Separate small tasks into helper functions, especially if you will execute them multiple times. For example, one-hot-encoding a categorical variable is a good helper function to make because you could perform it on multiple such columns. If you’re parsing a column with RegEx, it also might be a good idea to separate it to a helper function. This allows you to verify that you’re not making errors in these small tasks and prevents unknown bugs from appearing.</li>
<li>Separate small tasks into helper functions, especially if you will execute them multiple times. For example, a helper function that one-hot encodes a categorical variable may be helpful as you could perform it on multiple such columns. If you’re parsing a column with RegEx, it also might be a good idea to separate it to a helper function. This allows you to verify that you’re not making errors in these small tasks and prevents unknown bugs from appearing.</li>
<li>Feel free to make new cells to play with the data! As long as you delete them afterward, it will not affect the autograder.</li>
<li>The <code>feature_engine_final</code> looks daunting at first, but start small. First, try and implement a model with a single feature to get familiar with how the function works, then slowly experiment with adding one feature at a time and see how that affects your training RMSE.</li>
<li>The <code>feature_engine_final</code> looks daunting at first, but start small. First, try and implement a model with a single feature to get familiar with how the pipeline works, then slowly experiment with adding one feature at a time and see how that affects your training RMSE.</li>
</ul>
</section>
<section id="my-training-rmse-is-low-but-my-validationtest-rmse-is-high" class="level3">
<h3 class="anchored" data-anchor-id="my-training-rmse-is-low-but-my-validationtest-rmse-is-high">My training RMSE is low, but my validation/test RMSE is high</h3>
<p>Your model is likely overfitting to the training data and does not generalize to the test set. Recall the bias-variance tradeoff discussed in lecture. As you add more features and make your model more complex, it is expected that your training error will decrease. Your validation and test error may also decrease initially, but if your model is too complex, you run into this issue.</p>
<p>Your model is likely overfitting to the training data and does not generalize to the test set. Recall the bias-variance tradeoff discussed in lecture. As you add more features and make your model more complex, it is expected that your training error will decrease. Your validation and test error may also decrease initially, but if your model is too complex, you end up with high validation and test RMSE.</p>
<center>
<img src="under_overfit.png" width="500">
</center>
Expand Down
14 changes: 7 additions & 7 deletions index.tex
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ \chapter*{About}\label{about}

\chapter{Jupyter 101}\label{jupyter-101}

\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, leftrule=.75mm, arc=.35mm, bottomrule=.15mm, rightrule=.15mm, left=2mm, coltitle=black, titlerule=0mm, colback=white, breakable, colbacktitle=quarto-callout-note-color!10!white, bottomtitle=1mm, opacitybacktitle=0.6, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, toprule=.15mm, opacityback=0]
\begin{tcolorbox}[enhanced jigsaw, colframe=quarto-callout-note-color-frame, colback=white, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Note}, leftrule=.75mm, bottomtitle=1mm, coltitle=black, breakable, bottomrule=.15mm, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacityback=0, toptitle=1mm, titlerule=0mm, toprule=.15mm, left=2mm, rightrule=.15mm, opacitybacktitle=0.6]

If you're using a MacBook, replace \texttt{ctrl} with \texttt{cmd}.

Expand Down Expand Up @@ -1351,10 +1351,10 @@ \subsection{General Debugging Tips}\label{general-debugging-tips}
\tightlist
\item
Separate small tasks into helper functions, especially if you will
execute them multiple times. For example, one-hot-encoding a
categorical variable is a good helper function to make because you
could perform it on multiple such columns. If you're parsing a column
with RegEx, it also might be a good idea to separate it to a helper
execute them multiple times. For example, a helper function that
one-hot encodes a categorical variable may be helpful as you could
perform it on multiple such columns. If you're parsing a column with
RegEx, it also might be a good idea to separate it to a helper
function. This allows you to verify that you're not making errors in
these small tasks and prevents unknown bugs from appearing.
\item
Expand All @@ -1363,7 +1363,7 @@ \subsection{General Debugging Tips}\label{general-debugging-tips}
\item
The \texttt{feature\_engine\_final} looks daunting at first, but start
small. First, try and implement a model with a single feature to get
familiar with how the function works, then slowly experiment with
familiar with how the pipeline works, then slowly experiment with
adding one feature at a time and see how that affects your training
RMSE.
\end{itemize}
Expand All @@ -1376,7 +1376,7 @@ \subsection{My training RMSE is low, but my validation/test RMSE is
in lecture. As you add more features and make your model more complex,
it is expected that your training error will decrease. Your validation
and test error may also decrease initially, but if your model is too
complex, you run into this issue.
complex, you end up with high validation and test RMSE.
Consider visualizing the relationship between the features you've chosen
and the (Log) Sale Price and removing the features that are not highly
Expand Down
7 changes: 4 additions & 3 deletions projA2/projA2.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ jupyter: python3
### General Debugging Tips
Question 5 is a challenging question that mirrors a lot of data science work in the real world: cleaning, exploring, and transforming data; fitting a model, working with a pre-defined pipeline and evaluating your model's performance. Here are some general debugging tips to make the process easier:

* Separate small tasks into helper functions, especially if you will execute them multiple times. For example, one-hot-encoding a categorical variable is a good helper function to make because you could perform it on multiple such columns. If you're parsing a column with RegEx, it also might be a good idea to separate it to a helper function. This allows you to verify that you're not making errors in these small tasks and prevents unknown bugs from appearing.
* Separate small tasks into helper functions, especially if you will execute them multiple times. For example, a helper function that one-hot encodes a categorical variable may be helpful as you could perform it on multiple such columns. If you're parsing a column with RegEx, it also might be a good idea to separate it to a helper function. This allows you to verify that you're not making errors in these small tasks and prevents unknown bugs from appearing.
* Feel free to make new cells to play with the data! As long as you delete them afterward, it will not affect the autograder.
* The `feature_engine_final` looks daunting at first, but start small. First, try and implement a model with a single feature to get familiar with how the function works, then slowly experiment with adding one feature at a time and see how that affects your training RMSE.
* The `feature_engine_final` looks daunting at first, but start small. First, try and implement a model with a single feature to get familiar with how the pipeline works, then slowly experiment with adding one feature at a time and see how that affects your training RMSE.

### My training RMSE is low, but my validation/test RMSE is high
Your model is likely overfitting to the training data and does not generalize to the test set. Recall the bias-variance tradeoff discussed in lecture. As you add more features and make your model more complex, it is expected that your training error will decrease. Your validation and test error may also decrease initially, but if your model is too complex, you run into this issue.

Your model is likely overfitting to the training data and does not generalize to the test set. Recall the bias-variance tradeoff discussed in lecture. As you add more features and make your model more complex, it is expected that your training error will decrease. Your validation and test error may also decrease initially, but if your model is too complex, you end up with high validation and test RMSE.

<center><img src = "under_overfit.png" width = "500"></img></a></center>

Expand Down

0 comments on commit 082c693

Please sign in to comment.