After understanding the pitfalls of overfitting and underfitting, we now focus on measuring a modelβs performance. Youβll learn key metrics such as accuracy, precision, and recall, which help determine whether a model is truly effective for its intended task.
Accuracy is the fraction of all predictions that are correct. If a model correctly labels 95 out of 100 test examples, its accuracy is 95%.
Accuracy is easy to understand but can be misleading if classes are unbalanced. For example, if 99% of emails are non-spam, a model that always predicts "not spam" is 99% accurate but useless for finding spam.
Precision measures how many of the positive predictions were actually correct. For example, in spam detection: of all emails flagged as spam, what fraction were truly spam?
A high precision means few false alarms. It answers: "When the model predicts positive, how often is it right?"
Recall (also called sensitivity) measures how many of the actual positives were correctly identified. Using the spam example: of all real spam emails, how many did the model identify?
A high recall means few missed cases. It answers: "How many of the actual positives did we find?"
Often we look at both precision and recall. A model might have high precision but low recall (very confident but misses many positives), or vice versa.
The F1 score combines them by taking their harmonic mean, giving one metric that balances both. We use it when we need a single performance measure for imbalanced problems.
For regression tasks, we use metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE) to measure average prediction error. Lower values mean better fit.
For classification beyond precision/recall, metrics like ROC-AUC measure the tradeoff between true positive rate and false positive rate. Different problems need different metrics.
The choice depends on the problem. For medical diagnoses, missing a disease (low recall) may be worse than a false alarm, so we might prioritize recall.
For email spam, missing a few spam messages is okay, but sending an important email to spam (false positive) is bad, so precision is key. Always consider the cost of errors in your application.
Which Metric Measures The Fraction Of Actual Positive Examples That The Model Correctly Identified?
In classification, precision tells us how many predicted positives were correct, while ___ tells us how many actual positives were found.
Imagine a test that screens for a serious disease. Would it be more important for the test to have high precision or high recall?
Explain your reasoning (hint: consider the consequences of missing a diagnosis versus false alarms).
Outstanding! You now understand how to evaluate ML models using accuracy, precision, recall, and more!