Hands-on Project: Regression and Classification
Up until now, we’ve learned the theory of regression and classification. But theory without practice is like learning a recipe without ever cooking the dish. To become confident in machine learning, you must build real projects, see outputs, and understand mistakes.
In this lesson, we’ll build two hands-on projects:
- Predicting house prices using Linear Regression.
- Classifying emails as spam or not spam using Logistic Regression.
These two projects represent the two core pillars of supervised learning:
- Regression: Predicting continuous values (like prices or sales).
- Classification: Predicting categories (like spam vs not spam).
My first consulting project was predicting apartment rental prices. The real estate company initially relied on “gut feeling” from brokers, which often underpriced properties. After our model went live, they priced apartments more accurately and profits went up by 15%. Similarly, the first spam detection project I built for a startup reduced their manual email filtering time by 80%. These simple models had real business impact, exactly what I want you to experience in this lesson.
Project 1: Predicting House Prices with Linear Regression
Step 1: Problem Framing
Imagine you are working for a real estate agency. Every day, clients ask the same question:
“How much should we price this house?”
Traditionally, real estate agents rely on experience and intuition. They look at recent sales in the area, compare bedrooms and bathrooms, and make educated guesses. But guesswork has limitations. If a house is underpriced, the agency loses money. If overpriced, the house stays on the market too long.
This is where machine learning comes in. With enough historical data, we can build a model that learns the relationship between house features and price. For this project, our features are:
- Size in square feet (sqft) → larger homes typically cost more.
- Number of bedrooms (bedrooms) → more bedrooms often raise value, though not always linearly.
- Number of bathrooms (bathrooms) → similar to bedrooms; an extra bathroom adds convenience and price.
- Location score (location_score) → a rating (1–10) based on neighborhood desirability, amenities, and safety.
If you don’t define the problem clearly, your model may drift. For example, predicting “house quality” instead of “house price” requires a totally different target. A well-framed problem gives us a target variable (price) and features (sqft, bedrooms, bathrooms, location_score).
Step 2: Preparing the Data
Let’s create a small sample dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
import pandas as pd from sklearn.model_selection import train_test_split # Sample dataset data = { "sqft": [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450], "bedrooms": [3, 3, 4, 4, 2, 3, 4, 5], "bathrooms": [2, 2, 3, 2, 1, 2, 3, 4], "location_score": [8, 7, 9, 6, 5, 7, 9, 10], "price": [200000, 240000, 280000, 340000, 150000, 250000, 400000, 475000] } df = pd.DataFrame(data) # Features and target X = df.drop(columns=["price"]) # predictors y = df["price"] # target # Split into training (70%) and testing (30%) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 )
Why split data?
- Training set: Used to teach the model.
- Test set: Held back to see if the model can predict new houses it hasn’t seen.
Mistake I’ve seen:
A colleague once trained and tested on the same dataset. The model showed 99% accuracy, but when applied to real houses, the predictions were way off. Why? Because the model simply memorized the training data, a problem called overfitting. Splitting the dataset prevents this false sense of success.
Step 3: Training the Model
Now we use Linear Regression to learn the relationship between features and price.
1 2 3 4 5 6 7 8
from sklearn.linear_model import LinearRegression # Create and train model model = LinearRegression() model.fit(X_train, y_train) print("Feature Weights:", model.coef_) print("Intercept:", model.intercept_)
How to interpret results:
- Feature weights (coefficients):
- If sqft weight = 120 → each additional square foot adds about $120.
- If bedrooms weight = 15,000 → each bedroom increases price by $15,000.
- Intercept: The baseline house price when all features are zero (not meaningful in practice, but mathematically necessary).
In one housing dataset I worked with, location score dominated the coefficients. Even a small 2-bedroom home in a high-scoring location was valued more than a 4-bedroom home in a poor neighborhood. This perfectly reflected real market dynamics: location, location, location.
Step 4: Making Predictions
With the model trained, let’s predict house prices for the test set.
1 2 3 4
# Predict prices for test set y_pred = model.predict(X_test) print("Predicted Prices:", y_pred) print("Actual Prices:", y_test.values)
What happens here?
The model applies the learned weights and intercept to unseen houses in X_test. It then outputs estimated prices.
Example:
If the actual price of a home is $255,000 and the model predicts $260,000, that’s pretty close.
Business interpretation:
In real estate, being within ±5% of the true price is valuable. A $5,000 difference on a $250,000 home is acceptable. But a $50,000 error might lead to serious losses.
Step 5: Evaluating the Model
We don’t just want predictions, we want to know how good they are.
1 2 3 4 5 6 7 8 9
from sklearn.metrics import mean_squared_error, r2_score import numpy as np mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print("RMSE:", rmse) print("R²:", r2)
Metrics explained:
- RMSE (Root Mean Squared Error):
- Average error in dollars.
- If RMSE = $12,000, predictions are usually off by about that much.
- R² (R-squared):
- Explains how much variation in prices the model accounts for.
- R² = 0.9 means the model explains 90% of house price variation.
A real estate firm I consulted cared only about RMSE. “Tell me how much money we’re off by on average,” they said. R² didn’t matter to them because clients negotiate in dollars, not percentages. This taught me that choosing the right evaluation metric depends on business needs.
Key Takeaways
From this house price prediction project, we learned:
- Linear regression is a simple but powerful tool for predicting continuous values.
- Coefficients show what really matters (location, size, etc.).
- RMSE provides a practical business measure of error.
- Always frame the problem correctly and split your data to avoid misleading results.
Project 2: Classifying Spam Emails with Logistic Regression
Step 1: Problem Framing
Email providers like Gmail, Outlook, and Yahoo process billions of emails daily. One of their biggest challenges is separating spam (unwanted promotional or malicious emails) from legitimate messages (personal or business communication).
Why is this difficult? Because the cost of mistakes is high:
- Too strict (false positives): Important emails get marked as spam, damaging user trust. Imagine your job offer or bank statement going to the spam folder.
- Too lenient (false negatives): Spam slips into the inbox, frustrating users and possibly spreading malware.
Business importance:
A strong spam filter saves time, protects users from scams, and improves trust in the platform. A weak filter risks user abandonment.
Our goal: Build a binary classification model that predicts whether an email is spam (1) or not spam (0).
Step 2: Preparing the Data
We’ll create a simplified dataset with three features:
- num_links: spam often contains many links.
- contains_offer: whether the email contains promotional keywords like “offer”, “win”, “free”.
- sender_reputation: a trust score between 0 (untrusted sender) and 1 (trusted sender).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import pandas as pd from sklearn.model_selection import train_test_split # Sample dataset data = { "num_links": [5, 0, 8, 1, 10, 2, 0, 7], "contains_offer": [1, 0, 1, 0, 1, 0, 0, 1], "sender_reputation": [0.2, 0.9, 0.1, 0.85, 0.05, 0.8, 0.95, 0.15], "is_spam": [1, 0, 1, 0, 1, 0, 0, 1] } df = pd.DataFrame(data) # Features and target X = df.drop(columns=["is_spam"]) y = df["is_spam"] # Train-test split (70% training, 30% testing) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 )
Why split the data?
- Training set: Teaches the model what spam looks like.
- Test set: Ensures the model generalizes to new emails.
💡 Mistake I’ve seen:
Beginners often evaluate on training data and brag about 100% accuracy. That isn’t learning, it’s memorization. The model looks perfect until a new spam email arrives and slips through.
Step 3: Training Logistic Regression
Logistic regression is a simple but powerful algorithm for binary classification. It predicts probabilities between 0 and 1, and then applies a threshold (default: 0.5) to decide class labels.
1 2 3 4 5 6 7 8
from sklearn.linear_model import LogisticRegression # Create and train the model model = LogisticRegression() model.fit(X_train, y_train) print("Feature Weights:", model.coef_) print("Intercept:", model.intercept_)
Interpretation of weights:
- A positive weight: increases likelihood of spam.
- A negative weight: decreases likelihood of spam.
Personal project:
In a corporate spam detection system I worked on, “sender reputation” had the strongest negative weight — confirming what we intuitively know: emails from trusted internal domains almost never land in spam.
Step 4: Making Predictions
Once trained, the model can output both probabilities and class predictions.
1 2 3 4 5 6 7
# Predict probabilities probs = model.predict_proba(X_test) print("Probabilities:\n", probs) # Predict classes y_pred = model.predict(X_test) print("Predictions:", y_pred)
- Probabilities: Example → [0.15, 0.85] means “15% not spam, 85% spam.”
- Classes: Converts probabilities into final 0 or 1 predictions using threshold.
Practical trick:
Some businesses adjust the threshold depending on goals:
- Lower threshold (e.g., 0.3) → catch more spam (higher recall) but risk flagging good emails.
- Higher threshold (e.g., 0.7) → fewer false alarms (higher precision) but risk letting spam through.
This flexibility makes logistic regression highly practical.
Step 5: Evaluating the Model
Accuracy alone is not enough. Why? Because if 90% of emails are legitimate, a dumb model that predicts “not spam” for everything gets 90% accuracy but is useless.
That’s why we use multiple metrics:
1 2 3 4 5 6 7 8 9 10 11
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}")
Interpretation:
- Accuracy: Overall correctness.
- Precision: Of all predicted spam, how many were truly spam? (Protects against false alarms.)
- Recall: Of all actual spam, how many did we catch? (Protects against missed spam.)
- F1-score: Balance between precision and recall.
In fraud detection, I’ve optimized for recall (catch every possible fraud case, even with false alarms). But in email spam filtering, I usually optimize for precision, because wrongly flagging an important email (a job offer, a bank notification) is more damaging than letting a little spam through.
Key points
From this spam classification project, we’ve learned:
- Logistic regression is excellent for binary problems like spam vs not spam.
- Adjusting thresholds lets businesses balance between recall and precision.
- Evaluation metrics must be tied to business goals, not just technical numbers.
Frequently Asked Questions
You’ll learn how to prepare data, train regression and classification models, make predictions, and evaluate performance using real-world metrics.
House pricing demonstrates regression (predicting numbers), while spam detection shows classification (categorizing data). These are intuitive, relatable problems.
We use beginner-friendly tools: Pandas for data handling, Scikit-learn for model training, and NumPy for numerical calculations.
For regression, check RMSE (average error in dollars). For classification, look at precision, recall, and F1-score depending on your business goal.
No. Each step is explained in plain language with examples, making it accessible to beginners with no prior ML or math background.
Still have questions?Contact our support team