Train/Test Split & Overfitting

The most important rule in machine learning: never evaluate your model on the data it was trained on. A model that memorizes the training data will look perfect on it but fail on new data. This is called overfitting.

The Train/Test Split

Split your data into two parts:

  • Training set (~80%) — used to train the model
  • Test set (~20%) — held back, only used to evaluate the final model
  • The test set simulates "new, unseen data." If the model performs well on it, you can be more confident it will generalize.

    Python
    Loading editor...
    Loading Python runtime...

    What is Overfitting?

    Overfitting happens when a model learns the noise in the training data, not just the signal. Signs:

  • Training error is very low but **test error is high**
  • The model is too complex for the amount of data
  • Adding more features or model complexity makes test performance worse
  • The opposite problem is underfitting — the model is too simple to capture the pattern.

    Python
    Loading editor...
    Loading Python runtime...

    Notice how degree 12 has nearly zero training error but very high test error — that's overfitting. The model memorized the training points but can't generalize.

    Key Takeaways

  • Always split data into train and test sets before modeling
  • Evaluate on the test set to estimate real-world performance
  • Overfitting = low train error, high test error
  • More complex models are more prone to overfitting
  • The goal is to find the right balance between underfitting and overfitting
  • Unsupervised Learning