Lycaeum — Education & Interview Prep for AI, ML & Quant

Train/Test Split & Overfitting

The most important rule in machine learning: never evaluate your model on the data it was trained on. A model that memorizes the training data will look perfect on it but fail on new data. This is called overfitting.

The Train/Test Split

Split your data into two parts:

Training set (~80%) — used to train the model

Test set (~20%) — held back, only used to evaluate the final model

The test set simulates "new, unseen data." If the model performs well on it, you can be more confident it will generalize.

Python

Loading editor...

Loading Python runtime...

What is Overfitting?

Overfitting happens when a model learns the noise in the training data, not just the signal. Signs:

Training error is very low but **test error is high**

The model is too complex for the amount of data

Adding more features or model complexity makes test performance worse

The opposite problem is underfitting — the model is too simple to capture the pattern.

Python

Loading editor...

Loading Python runtime...

Notice how degree 12 has nearly zero training error but very high test error — that's overfitting. The model memorized the training points but can't generalize.

Key Takeaways

Always split data into train and test sets before modeling

Evaluate on the test set to estimate real-world performance

Overfitting = low train error, high test error

More complex models are more prone to overfitting

The goal is to find the right balance between underfitting and overfitting

Unsupervised Learning