Loading ad...

Data and Problem Framing

I often tell my students:

"The single most important part of a machine learning project is not the model. It’s the problem you define and the data you prepare."

When I first started with ML, I made the rookie mistake of rushing into building a model without really understanding the problem or preparing the data. The results looked good in my test environment, but when I tried them in the real world… they failed. Miserably.

In this lesson, I want to make sure you never fall into that trap.

We’ll take a real-world problem, predicting daily coffee sales for a small café and use it to learn three key skills:

  1. How to define an ML problem
  2. How to understand datasets and features
  3. How to preprocess data (cleaning, normalization, encoding)

By the end, you’ll know exactly how to take an idea and turn it into a well-structured, ready-to-model dataset.

Project: Predicting Coffee Sales

Imagine you own a small café. Every day, customers come in for coffee. Some days you sell 150 cups, other days 300. It varies based on weather, weekends, holidays, promotions, and even events happening nearby.

Your business question is:

"How many cups of coffee will I sell tomorrow?"

Why is this important?

  • If you prepare too little, you run out of stock and lose sales.
  • If you prepare too much, you waste coffee and money.
  • If you can predict tomorrow’s sales, you can schedule staff and plan inventory efficiently.

This is the exact kind of problem machine learning is great at solving.

How to Define an ML Problem

Before you even look at the data, you must turn your business question into an ML problem.

Step 1: Identify the target

The target is what you want your model to predict.

In our café example:

  • Target: coffees_sold (the number of cups sold tomorrow)

Step 2: Decide the problem type

Machine learning problems usually fall into two categories:

  • Regression: Predicting a number (like coffee sales or house prices)
  • Classification: Predicting a category (like “spam” or “not spam”)

Since coffee sales are a number, this is a regression problem.

Step 3: Think about useful information

The information your model will use to make predictions is called features. Think of them as clues.

For the café, possible features include:

  • Weather: average temperature, whether it’s raining
  • Day info: weekend or weekday, holiday or not
  • Marketing: whether a promotional email was sent
  • Foot traffic: how many people passed the café that day

Step 4: Decide how to measure success

The metric is how we’ll judge if our model is good.

For coffee sales:

  • Mean Absolute Error (MAE) is a good choice. It tells us how many cups, on average, we’re off by. If MAE = 8, we’re wrong by 8 cups on average.

Code Example:

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# The target (what we want to predict)
TARGET = "coffees_sold"

# Features (the clues that help us predict the target)
FEATURES = [
    "avg_temp_c",       # Average temperature
    "is_raining",       # 1 if raining, 0 if not
    "foot_traffic",     # Number of people passing by
    "is_weekend",       # 1 if weekend, 0 if weekday
    "is_holiday",       # 1 if holiday, 0 if not
    "email_campaign",   # 1 if we sent a promotional email, 0 if not
    "discount_active"   # 1 if discount is active, 0 if not
]

# Metric for success
METRIC = "MAE"  # Mean Absolute Error

print("Target:", TARGET)
print("Features:", FEATURES)
print("Metric:", METRIC)

Explanation:

  • We’re just making a list of what we’re predicting (target) and what we’ll use to predict it (features).
  • The print commands simply show these lists on the screen so we can double-check them.

Output:

1
2
3
Target: coffees_sold
Features: ['avg_temp_c', 'is_raining', 'foot_traffic', 'is_weekend', 'is_holiday', 'email_campaign', 'discount_active']
Metric: MAE

Skipping this step leads to chaos. If you don’t clearly write down your target, features, and metric, you’ll end up changing the goal halfway through without realizing it.

Datasets and Features

Now that we know what we’re predicting and what clues we’ll use, let’s look at our dataset.

Think of a dataset as a table:

  • Rows: individual examples (in our case, each day of sales)
  • Columns: different pieces of information (features and target)

Code Example: Creating a small café dataset

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd  # Pandas helps us work with tables

# Create 7 days of example data
data = {
    "date": pd.to_datetime([
        "2025-07-01","2025-07-02","2025-07-03",
        "2025-07-04","2025-07-05","2025-07-06","2025-07-07"
    ]),
    "avg_temp_c": [31, 33, 34, 36, 38, 37, 32],
    "is_raining": [0, 0, 1, 0, 0, 1, 0],
    "foot_traffic": [420, 460, 380, 510, 620, 300, 450],
    "email_campaign": [0, 1, 0, 0, 0, 0, 1],
    "discount_active": [0, 0, 0, 1, 1, 0, 0],
    "is_holiday": [0, 0, 0, 1, 1, 0, 0],
    "coffees_sold": [210, 235, 190, 280, 320, 170, 240]
}

df = pd.DataFrame(data)
df

Explanation:

  • pandas is a tool for working with datasets in Python.
  • pd.to_datetime turns text into date format so we can work with dates later.
  • Each list inside data is a column.
  • pd.DataFrame(data) turns this into a table.

Add more useful features from dates

We can create new features from the date column.

python
1
2
3
4
5
6
7
8
9
# Extract day name (e.g., Monday)
df["day_of_week"] = df["date"].dt.day_name()

# Is it a weekend?
df["is_weekend"] = df["day_of_week"].isin(["Saturday","Sunday"]).astype(int)

# Which month?
df["month"] = df["date"].dt.month
df

Sometimes, sales patterns repeat on weekends or holidays. Adding these features gives our model more context.

Never mix up features (inputs) with the target (output). If you accidentally include the target in your features, the model will “cheat” during training — looking great in tests but failing in reality.

Data Preprocessing Basics

Now we prepare the data for modeling.

Cleaning Data

Real data often has problems:

  • Missing values
  • Duplicates
  • Wrong formats

Let’s simulate some issues and fix them.

Code: Making data messy

python
1
2
3
4
5
6
7
8
9
10
# Copy the dataset
df_dirty = df.copy()

# Add missing values
df_dirty.loc[2, "avg_temp_c"] = None
df_dirty.loc[4, "day_of_week"] = None

# Add a duplicate row
df_dirty = pd.concat([df_dirty, df_dirty.iloc[[0]]], ignore_index=True)
df_dirty

Step 1: Remove duplicates

python
1
df_clean = df_dirty.drop_duplicates(subset=["date"]).reset_index(drop=True)

Step 2: Fill missing values

python
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.impute import SimpleImputer

numeric_cols = ["avg_temp_c", "foot_traffic"]
categorical_cols = ["is_raining", "email_campaign", "discount_active",
                    "is_holiday", "day_of_week", "is_weekend", "month"]

# Fill numeric with median
num_imputer = SimpleImputer(strategy="median")
df_clean[numeric_cols] = num_imputer.fit_transform(df_clean[numeric_cols])

# Fill categorical with most common value
cat_imputer = SimpleImputer(strategy="most_frequent")
df_clean[categorical_cols] = cat_imputer.fit_transform(df_clean[categorical_cols])

Deleting rows with missing data without thinking — you might lose valuable information.

Normalization (Scaling numbers)

If one feature is in the hundreds and another is between 0 and 1, some models might get confused.

Code: Scaling numeric features

python
1
2
3
4
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_clean[["avg_temp_c", "foot_traffic"]] = scaler.fit_transform(df_clean[["avg_temp_c", "foot_traffic"]])

Encoding (Turning text into numbers)

ML models can’t work directly with text categories.

Code: One-hot encoding

python
1
2
df_encoded = pd.get_dummies(df_clean, columns=["day_of_week"], drop_first=True)
df_encoded

Assigning numbers to categories directly (like Monday=1, Tuesday=2) — the model will think there’s an order, which isn’t true.

From our work so far:

  • Framing the problem kept us focused.
  • Understanding the dataset helped us choose the right features.
  • Preprocessing gave us a clean, usable dataset.

Now that our data is ready, the next module will cover training a model to actually make predictions.

Exercises

  • Add a new feature is_hot_day (1 if avg_temp_c > 35, else 0).
  • Simulate missing values in foot_traffic and fill them with the median.
  • One-hot encode the month column.
  • Explain in your own words the difference between a feature and a target.

Frequently Asked Questions

This module teaches you how to clearly define a machine learning problem, identify the target and features, and prepare your dataset for modeling through practical, beginner-friendly steps.

No. The module is written for complete beginners with no technical background. All terms and code are explained in plain language.

Problem framing ensures you are solving the right problem with the right data. Without it, even the most advanced algorithms can fail in real-world situations.

The module uses a neighborhood café’s coffee sales prediction problem to explain each concept, making it relatable and easy to understand.

We use beginner-friendly Python libraries like Pandas, Scikit-learn, and basic Python commands to explore and prepare data.

Still have questions?Contact our support team