Training vs Testing

👋

Welcome Back!

Once we grasped how features and labels form the foundation of ML models, we move on to how these models are trained and evaluated. This lesson shows how to split data into training and testing sets to teach models and assess their performance on unseen examples.

📚

Training and Testing Data

In ML, we split our data into a training set and a testing set. The training set is used to teach the model (it learns patterns from this data).

The testing set is a separate set of examples the model has never seen; it's used to evaluate how well the model learned.

It's like practicing problems (training) and then taking a quiz on different problems (testing).

❓

Why Split Data?

We split data to check if the model truly generalizes. If we only check performance on the training data, the model might simply memorize those answers.

By testing on new data, we get a realistic measure of how it will perform on unseen examples.

🔧

Validation Set

Often, we also set aside a validation set (a part of the training data) to tune the model's parameters before final testing.

This is like having practice exams: we try different model settings and choose the one that works best on the validation set. Then we do a final evaluation on the test set.

⚠️

Preventing Overfitting

Splitting data helps prevent overfitting. If a model does well on training but poorly on the test set, it means it memorized the training answers.

By comparing performance on training vs testing, we can spot this and adjust (for example, simplify the model or get more data).

📊

Example Split

For example, with 1000 data points, we might use 800 for training and 200 for testing. The model learns from the 800.

Then we check its accuracy on the 200. The test accuracy tells us how good the model is on new data.

🔄

Cross-Validation

When data is limited, we can use k-fold cross-validation: split the training data into k parts and train k times, each time using a different part as the validation set.

This way the model is trained and tested k times on different subsets. It gives a more reliable performance estimate, like taking multiple practice quizzes.

Quiz: Purpose of Test Data

Why Do We Use A Separate Test Set In Machine Learning?

A

To make training faster

B

To get an unbiased performance on new data

C

To provide ground truth labels during training

D

To reduce the number of features

Fill in the Blank

We split the data into training and test sets to evaluate how well the model can ___ to new data.

💡 Drag the correct word from below into the blank to complete the sentence.

We split the data into training and test sets to evaluate how well the model can

to new data.

Memorize

Overfit

Generalize

Train

Reflection

💭

Imagine you study for a test with a set of practice problems and then take a different set of test problems.

How is this like splitting data into training and test sets? Why is it important that the problems (data) are different?

Lesson Completed!

Fantastic work! You now understand training, testing, and validation—key concepts for building reliable ML models!