Supervised Learning: Classification
In the previous module, we focused on regression — predicting continuous values like sales or prices. Now we shift our attention to a different but equally important branch of supervised learning: classification.
Here, instead of predicting “how much” of something will happen, we’re predicting which category something belongs to. For example:
- Will this email be spam or not spam?
- Is this bank transaction fraudulent or legitimate?
- Will this patient test positive or negative for a certain condition?
In my early career, one of my first classification projects was helping a small e-commerce company reduce refund fraud. We trained a model to classify whether a refund request was likely fraudulent based on customer behavior, order history, and payment patterns. It saved the company tens of thousands of dollars in just the first quarter after deployment.
For this module, we’ll work with a simple but relatable case: classifying emails as spam or not spam. This is binary classification, meaning we have only two possible outcomes (1 for spam, 0 for not spam).
Logistic Regression and Binary Classification
Logistic regression is often the first algorithm I teach for classification because:
- It’s simple to understand
- It works well for binary classification
- It provides interpretable results
The term “regression” in the name can be misleading — unlike linear regression, logistic regression predicts probabilities that a data point belongs to a class.
It uses the sigmoid function to map predictions to a range between 0 and 1. If the probability is above a threshold (often 0.5), the model predicts “1” (e.g., spam). Otherwise, it predicts “0” (not spam).
Example: Spam Email Detection
Let’s imagine we have a dataset with features like:
- num_links: Number of links in the email
- contains_offer: Whether the email contains promotional words like “offer”, “win”, “free”
- sender_reputation: A score based on the sender’s trustworthiness
Step 1: Preparing the Data
When building any machine learning model, how you prepare your data will determine 80% of your success. The rest is just the math and tuning.
Here, we have a small dataset that contains:
- num_links – Number of hyperlinks in the email.
- contains_offer – Whether the email contains words like "offer", "free", "win".
- sender_reputation – A trust score for the sender (0 means very untrustworthy, 1 means very trustworthy).
- is_spam – Our target variable (1 means spam, 0 means not spam).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import pandas as pd from sklearn.model_selection import train_test_split # Sample dataset data = { "num_links": [5, 0, 8, 1, 10, 2, 0, 7], "contains_offer": [1, 0, 1, 0, 1, 0, 0, 1], "sender_reputation": [0.2, 0.9, 0.1, 0.85, 0.05, 0.8, 0.95, 0.15], "is_spam": [1, 0, 1, 0, 1, 0, 0, 1] } df = pd.DataFrame(data) # Features and target X = df.drop(columns=["is_spam"]) y = df["is_spam"] # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 )
Why we split the data
- Training set: The model learns from this data.
- Testing set: Used for final evaluation, to see how the model performs on unseen data.
If we don’t split the data, we risk data leakage, meaning the model has already “seen” the answers and will perform unrealistically well.
When I was consulting for a retail company, a junior analyst accidentally evaluated a demand forecasting model on the same data it was trained on. The model showed 99% accuracy in testing, but when deployed, accuracy dropped to 65% because the real-world data didn’t match the memorized patterns. That was the day I started double-checking every train/test split in my teams.
Step 2: Training Logistic Regression
Logistic Regression is not “regression” in the traditional sense. It predicts probabilities for classification.
1 2 3 4 5 6 7 8 9
from sklearn.linear_model import LogisticRegression # Create and train the model model = LogisticRegression() model.fit(X_train, y_train) # Check model coefficients print("Feature Weights:", model.coef_) print("Intercept:", model.intercept_)
Understanding what’s happening
- model.fit(): The algorithm finds the best set of weights that map input features to the probability of spam.
- Feature weights: Tell us how important each feature is.
- Intercept: The baseline prediction when all features are zero.
Example:
If num_links weight = +1.2, more links → more likely to be spam.
If sender_reputation weight = -3.0, higher reputation → less likely to be spam.
insight:
In a hotel booking model I built, the intercept value was the baseline occupancy without any promotions or events — around 35%. This number later became a key performance target for operations.
Step 3: Making Predictions
Once the model is trained, we can predict both probabilities and classes.
1 2 3 4 5 6 7
# Predict probabilities probabilities = model.predict_proba(X_test) print("Probabilities:\n", probabilities) # Predict classes y_pred = model.predict(X_test) print("Predictions:", y_pred)
- Probabilities are numbers between 0 and 1 showing how confident the model is.
- Predicted classes are final 0 or 1 decisions, based on a threshold (default: 0.5).
If a business wants to err on the side of caution, they might lower the threshold. For example, in financial fraud detection, we might flag transactions with only a 30% probability of fraud just to investigate them.
Decision Boundaries
A decision boundary is the dividing line between two classes in feature space. In 2D, it’s a line; in 3D, it’s a plane; in higher dimensions, it’s a hyperplane.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
import matplotlib.pyplot as plt import numpy as np # Only for illustration — we'll use first two features x_min, x_max = X_train["num_links"].min() - 1, X_train["num_links"].max() + 1 y_min, y_max = X_train["sender_reputation"].min() - 0.1, X_train["sender_reputation"].max() + 0.1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100)) # Here, we fix contains_offer at 0 for visualization Z = model.predict(np.c_[xx.ravel(), np.zeros(xx.ravel().shape), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.2) plt.scatter(X_train["num_links"], X_train["sender_reputation"], c=y_train) plt.xlabel("Number of Links") plt.ylabel("Sender Reputation") plt.title("Decision Boundary for Spam Detection") plt.show()
In a bank fraud detection project, I once had to explain why certain transactions were flagged as suspicious. Visualizing the decision boundary made it much easier for executives to trust the AI, they could see which factors pushed a transaction into the “fraud” zone.
Accuracy, Precision, Recall, F1-score
Evaluation is critical because a model that’s “accurate” might still be bad for business.
1 2 3 4 5 6 7 8 9 10 11
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
Example output:
1 2 3 4
Accuracy: 0.88 Precision: 0.86 Recall: 0.90 F1-score: 0.88
When to focus on each metric
- Accuracy: When classes are balanced.
- Precision: When false positives are costly (e.g., important email marked as spam).
- Recall: When false negatives are costly (e.g., missing a fraudulent transaction).
- F1-score: When you want balance between precision and recall.
From my field work:
When working with refund fraud detection, I prioritized recall because missing a fraudulent case could mean big financial loss. But for email spam filters, I usually optimize for precision, because losing an important email can damage business relationships.
Takeaway
From this spam classification exercise, we learned:
- Data preparation is the foundation of a reliable model.
- Logistic regression is a simple yet powerful binary classification tool.
- Decision boundaries make classification decisions easier to visualize and explain.
- Evaluation metrics must match the business objective, not just technical performance.
I’ve seen many projects where a simple logistic regression, properly trained and understood, outperformed more complex models. In real-world settings, clarity and trust often matter more than raw accuracy.
Frequently Asked Questions
Classification is a supervised learning method used to predict categories, such as spam vs. not spam, or fraud vs. legitimate transactions.
Logistic regression predicts the probability that an input belongs to a class, using a threshold (commonly 0.5) to decide between two categories.
A decision boundary is the line or surface that separates different classes in feature space, showing where predictions change from one class to another.
Prioritize precision when false positives are costly (e.g., losing important emails to spam) and prioritize recall when false negatives are costly (e.g., missing fraud cases).
Still have questions?Contact our support team