Null Value
Create sample DataFrame
-
Create sample CSV file
employees.csv
employee_id,name,department,salary,email
E001,Alice Chen,Engineering,95000,[email protected]
E002,Bob Smith,Marketing,72000,
E003,Carol White,,68000,[email protected]
E004,,,55000,
E005,Eve Johnson,Sales,,[email protected]
E006,,,, -
Read CSV as DataFrame
import pandas as pd
df = pd.read_csv("employees.csv")
print(df)employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
3 E004 NaN NaN 55000.0 NaN
4 E005 Eve Johnson Sales NaN [email protected]
5 E006 NaN NaN NaN NaN -
Create a DataFrame with None in Python
import pandas as pd
data = {
"employee_id": ["E001", "E002", "E003"],
"name": ["Alice Chen", "Bob Smith", None],
"department": ["Engineering", None, "HR"],
"salary": [95000, 72000, None],
}
df_none = pd.DataFrame(data)
print(df_none)employee_id name department salary
0 E001 Alice Chen Engineering 95000.0
1 E002 Bob Smith None 72000.0
2 E003 None HR NaN -
See what is different between these two DataFrames
print(df.dtypes)
print()
print(df_none.dtypes)# df (read from CSV) — empty cells become NaN
employee_id object
name object
department object
salary float64 # int becomes float because NaN is a float
email object
dtype: object
# df_none (created in Python) — None in numeric column also becomes NaN
employee_id object
name object
department object
salary float64 # same: int becomes float due to None -> NaN
dtype: objectNote: In both cases, missing numeric values are stored as
NaN(a float), which forces the column dtype tofloat64. Missing values in object (string) columns are stored asNaNwhen read from CSV, or may stay asNonewhen created in Python. -
How to create NaN
import numpy as np
import pandas as pd
nan1 = float("nan") # built-in Python
nan2 = np.nan # NumPy NaN
nan3 = pd.NA # pandas NA (newer, works with nullable dtypes)
none = None # Python None (treated as NaN in most pandas contexts)
print(nan1, nan2, nan3, none)
# nan nan <NA> None
Calculate on NaN
import numpy as np
print(1 + np.nan) # nan — any arithmetic with NaN returns NaN
print(np.nan == np.nan) # False — NaN is not equal to itself
print(np.nan > 0) # False
print(np.nan < 0) # False
# pandas aggregations skip NaN by default
import pandas as pd
s = pd.Series([95000, 72000, None, 68000, None])
print(s.sum()) # 235000.0 (NaN skipped)
print(s.mean()) # 78333.333... (NaN skipped)
print(s.count()) # 3 (only non-NaN values counted)
nan
False
False
False
235000.0
78333.333333333336
3
Why using NaN
Using NaN instead of a placeholder like 0 or "unknown" keeps calculations accurate and signals that the data is genuinely missing.
import pandas as pd
# Bad practice: filling missing salary with 0
df_bad = pd.DataFrame({
"name": ["Alice Chen", "Bob Smith", "Carol White"],
"salary": [95000, 0, 68000], # 0 means "missing", but distorts stats
})
print("Mean salary (with 0):", df_bad["salary"].mean()) # 54333.33 — wrong!
# Good practice: use NaN for missing salary
df_good = pd.DataFrame({
"name": ["Alice Chen", "Bob Smith", "Carol White"],
"salary": [95000, None, 68000], # NaN for missing
})
print("Mean salary (with NaN):", df_good["salary"].mean()) # 81500.0 — correct
Mean salary (with 0): 54333.333333333336
Mean salary (with NaN): 81500.0
# NaN allows pandas to distinguish "no data" from "zero"
# For example, counting actual responses in a survey
survey = pd.DataFrame({
"employee_id": ["E001", "E002", "E003", "E004", "E005"],
"satisfaction_score": [4, 5, None, 3, None], # None = did not respond
})
print("Responses received:", survey["satisfaction_score"].count()) # 3
print("Average score: ", survey["satisfaction_score"].mean()) # 4.0
Responses received: 3
Average score: 4.0
# None in a numeric column becomes NaN; arithmetic still works
s = pd.Series([95000, None, 68000])
print(s + 5000)
0 100000.0
1 NaN
2 73000.0
dtype: float64
# But might be incorrect
df = pd.DataFrame({
"base_salary": [95000, 72000, None],
"bonus": [10000, None, 5000],
})
df["total_compensation"] = df["base_salary"] + df["bonus"]
print(df)
base_salary bonus total_compensation
0 95000.0 10000.0 105000.0
1 72000.0 NaN NaN
2 NaN 5000.0 NaN
Deal with NaN
-
Create sample DataFrame first
import pandas as pd
data = {
"employee_id": ["E001", "E002", "E003", "E004", "E005", "E006"],
"name": ["Alice Chen", "Bob Smith", "Carol White", None, "Eve Johnson", None],
"department": ["Engineering", "Marketing", None, None, "Sales", None],
"salary": [95000, 72000, 68000, 55000, None, None],
"email": [
"[email protected]",
None,
"[email protected]",
None,
"[email protected]",
None,
],
}
df = pd.DataFrame(data)
print(df)employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
3 E004 None None 55000.0 None
4 E005 Eve Johnson Sales NaN [email protected]
5 E006 None None NaN None -
Check if values are NaN
print(df.isnull())employee_id name department salary email
0 False False False False False
1 False False False False True
2 False False True False False
3 False True True False True
4 False False False True False
5 False True True True True# Count missing values per column
print(df.isnull().sum())employee_id 0
name 2
department 2
salary 2
email 3
dtype: int64 -
Check if values are not NaN
print(df.notnull())employee_id name department salary email
0 True True True True True
1 True True True True False
2 True True False True True
3 True False False True False
4 True True True False True
5 True False False False False# Filter rows where salary is not NaN
print(df[df["salary"].notnull()])employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
3 E004 None None 55000.0 None -
Drop NaN
# Drop rows that have ANY NaN
print(df.dropna())employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]# Drop rows only if ALL values in the row are NaN
print(df.dropna(how="all"))
# Drop rows where specific columns have NaN
print(df.dropna(subset=["name", "salary"]))# dropna(how="all") — keeps rows that have at least one non-NaN value
employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
3 E004 None None 55000.0 None
4 E005 Eve Johnson Sales NaN [email protected]
# dropna(subset=["name", "salary"]) — drops rows missing name OR salary
employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
3 E004 None None 55000.0 None# Drop rows that have fewer than `thresh` non-NaN values
# thresh=4 means keep only rows with at least 4 non-NaN values (out of 5 columns)
print(df.dropna(thresh=4))# Row 3 (E004) has 3 non-NaN values (employee_id, name... wait — None counts as NaN)
# employee_id is always filled; let's count per row:
# E001: 5 non-NaN ✓
# E002: 4 non-NaN ✓ (email is NaN)
# E003: 4 non-NaN ✓ (department is NaN)
# E004: 3 non-NaN ✗ (name, department, email are NaN)
# E005: 4 non-NaN ✓ (salary is NaN)
# E006: 1 non-NaN ✗ (only employee_id is filled)
employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 NaN
2 E003 Carol White NaN 68000.0 [email protected]
4 E005 Eve Johnson Sales NaN [email protected] -
Fill NaN
# Fill all NaN with a fixed value
print(df.fillna("N/A"))employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 N/A
2 E003 Carol White N/A 68000.0 [email protected]
3 E004 N/A N/A 55000.0 N/A
4 E005 Eve Johnson Sales N/A [email protected]
5 E006 N/A N/A N/A N/A# Fill NaN per column with different values
fill_values = {
"name": "Unknown",
"department": "Unassigned",
"salary": df["salary"].median(), # fill missing salary with median
"email": "[email protected]",
}
print(df.fillna(fill_values))employee_id name department salary email
0 E001 Alice Chen Engineering 95000.0 [email protected]
1 E002 Bob Smith Marketing 72000.0 [email protected]
2 E003 Carol White Unassigned 68000.0 [email protected]
3 E004 Unknown Unassigned 55000.0 [email protected]
4 E005 Eve Johnson Sales 70000.0 [email protected]
5 E006 Unknown Unassigned 70000.0 [email protected]