What is Data Cleaning in Data Science?
π Introduction
Data Cleaning (or Data Preprocessing) is the process of detecting, correcting, or removing inaccurate, incomplete, or irrelevant data from datasets. It is a crucial step in data science and machine learning, as poor-quality data can lead to incorrect insights and inaccurate predictions.
Data Science Course in Puneπ‘ Fact: 80% of a data scientistβs time is spent cleaning and preparing data!
π οΈ Why is Data Cleaning Important?
πΉ Improves Data Accuracy β Reduces errors and inconsistencies
πΉ Enhances Model Performance β Clean data leads to better predictions
πΉ Prevents Bias β Eliminates duplicate or misleading records
πΉ Ensures Data Consistency β Standardizes formats and missing values
π Example:
Imagine a company analyzing customer transactions. If the dataset contains missing prices, incorrect dates, or duplicate entries, the sales analysis will be flawed.
π Key Steps in Data Cleaning
1οΈβ£ Handling Missing Data
β
Techniques to fill missing values:
Drop missing values (if the dataset is large)
Fill with mean/median/mode (for numerical data)
Use forward or backward fill (for time-series data)
π Example in Python:
python
Copy
Edit
import pandas as pd
df.fillna(df.mean(), inplace=True) # Fill missing values with mean
2οΈβ£ Removing Duplicates
β
Duplicates can skew analysis and lead to incorrect conclusions.
π Example in Python:
Data Science Classes in Punepython
Copy
Edit
df.drop_duplicates(inplace=True)
3οΈβ£ Standardizing Data Formats
β
Ensure uniform formats for:
Date formats (YYYY-MM-DD vs. MM/DD/YYYY)
Text cases (uppercase/lowercase)
Units of measurement (e.g., km vs. miles)
π Example in Python:
python
Copy
Edit
df['date_column'] = pd.to_datetime(df['date_column']) # Standardize date format
df['name'] = df['name'].str.lower() # Convert text to lowercase
4οΈβ£ Handling Outliers
β
Outliers can distort analysis and affect ML models.
πΉ Techniques:
Remove extreme values using IQR (Interquartile Range)
Use log transformations to normalize skewed data
π Example in Python:
python
Copy
Edit
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= (Q1 - 1.5 * IQR)) & (df['column'] <= (Q3 + 1.5 * IQR))]
5οΈβ£ Correcting Data Entry Errors
β
Common issues:
Typos (e.g., "USA" vs. "U.S.A")
Inconsistent naming ("Male" vs. "M")
Incorrect spellings
π Example in Python:
Data Science Training in Punepython
Copy
Edit
df['country'] = df['country'].replace({'U.S.A': 'USA', 'United States': 'USA'})