Machine Learning Overview
Machine Learning is a way to let a computer learn patterns from historical data.
In normal programming, we write rules by ourselves:
if user clicks add_to_cart:
maybe user wants to buy
In machine learning, we prepare examples and let the model learn the pattern:
past user behavior + purchase result -> model learns the pattern
For data engineers, the important idea is:
Machine learning depends on clean, well-organized data.
If the data is messy, duplicated, missing, or designed with the wrong meaning, the model will also be unreliable.
Python Toolset
For this beginner section, we will use:
- Python
- pandas
- scikit-learn
Install the main packages:
pip install pandas scikit-learn
These tools are enough to learn the basic workflow:
raw data -> pandas DataFrame -> feature table -> scikit-learn model -> evaluation
Example Scenario
We will use an e-commerce behavior example.
The raw data may contain events like:
session_startview_itemadd_to_cartpurchase
The business question is:
Can we predict whether a user will purchase soon?
This is called a purchase propensity problem.
Basic Terms
Row
A row is one training example.
For this example, one row should represent one user.
one user = one row
Feature
A feature is an input column used by the model.
Examples:
- How many sessions did the user have?
- How many products did the user view?
- How many times did the user add items to cart?
- How many days since the user's last activity?
Label
A label is the answer we want the model to learn.
For this example:
label = did this user purchase?
The label can be:
1: yes, the user purchased0: no, the user did not purchase
Model
A model is the result of training.
After training, the model can receive new user behavior and output a prediction.
Prediction
A prediction is the model's guess.
For example:
user_001 -> 0.82 probability of purchase
user_002 -> 0.13 probability of purchase
Machine Learning Workflow
In this basic course, we will focus on:
- Data preparation
- Feature engineering
- Model training
- Model evaluation
- Prediction
Common Machine Learning Types
Classification
Classification predicts a category.
Examples:
- Will the user purchase?
yesorno - Is this transaction fraud?
yesorno - Is this email spam?
spamornot spam
Our purchase prediction example is a classification problem.
Regression
Regression predicts a number.
Examples:
- How much revenue will we get tomorrow?
- How long will delivery take?
- What will the house price be?
Clustering
Clustering groups similar data together.
Examples:
- Group users by behavior
- Group products by buying pattern
- Group articles by topic
Why Data Engineers Should Learn This
Data engineers do not always train models every day, but they often build the data foundation for machine learning.
Common data engineering responsibilities include:
- Collect raw data from systems
- Clean and transform data
- Build reliable feature tables
- Schedule pipelines
- Monitor data quality
- Deliver data to analysts, data scientists, or ML systems
Machine learning projects often fail because the data pipeline is weak, not because the model algorithm is weak.