STEP 1. Find the Dirt.
Start data cleaning by determining what is wrong with your data.
Look for the following:
Are there rows with empty values? Entire columns with no data? Which data is missing and why?
How is data distributed? Remember, visualizations are your friends. Plot outliers. Check distributions to see which groups or ranges are more heavily represented in your dataset.
Keep an eye out for the weird: are there impossible values? Like “date of birth: male”, “address: -1234”.
Is your data consistent? Why are the same product names written in uppercase and other times in camelCase?
STEP 2: SCRUB THE DIRT
Missing Data
Outliers Contaminated
Data Inconsistent : You have to expect inconsistency in your data.
Especially when there is a higher possibility of human error (e.g. when salespeople enter the product info on proforma invoices manually).
The best way to spot inconsistent representations of the same elements in your database is to visualize them.
Plot bar charts per product category.
Do a count of rows by category if this is easier.
When you spot the inconsistency, standardize all elements into the same format.
Humans might understand that ‘apples’ is the same as ‘Apples’ (capitalization) which is the same as ‘appels’ (misspelling), but computers think those three refer to three different things altogether.
Lowercasing as default and correcting typos are your friends here.
Data Invalid
Data Duplicate
Data Data Type Issues
Structural Errors
The majority of data cleaning is running reusable scripts, which perform the same sequence of actions. For example: 1) lowercase all strings, 2) remove whitespace, 3) break down strings into words.
Problem discovery. Use any visualization tools that allow you to quickly visualize missing values and different data distributions.
Identify the problematic data
Clean the data
Remove, encode, fill in any missing data
Remove outliers or analyze them separately
Purge contaminated data and correct leaking pipelines
Standardize inconsistent data
Check if your data makes sense (is valid)
Deduplicate multiple records of the same dataForesee and prevent type issues (string issues, DateTime issues)
Remove engineering errors (aka structural errors)
Rinse and repeat
HANDLING MISSING VALUES
The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with NaN or None.
nfl_data.sample(5)
Ok, now we know that we do have some missing values. Let's see how many we have in each column.
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()
# look at the # of missing points in the first ten columns
missing_values_count[0:10]
That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
(total_missing/total_cells) * 100
Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.
One of the most important question you can ask yourself to help figure this out is this:
Is this value missing becuase it wasn't recorded or becuase it dosen't exist?
If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be.
These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row.
# if relevant
# replace all NA's with 0
subset_nfl_data.fillna(0)
# replace all NA's the value that comes directly after it in the same column,
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)
# The default behavior fills in the mean value for imputation.
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
----------
import pandas as pd
import numpy as np
# for Box-Cox Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
# return a dataframe showing the number of NaNs and their percentage
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
# replace NaNs with 0
df.fillna(0, inplace=True)
# replace NaNs with the column mean
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# replace NaNs with the column median
df['column_name'].fillna(df['column_name'].median(), inplace=True)
# linear interpolation to replace NaNs
df['column_name'].interpolate(method='linear', inplace=True)
# replace with the next value
df['column_name'].fillna(method='backfill', inplace=True)
# replace with the previous value
df['column_name'].fillna(method='ffill', inplace=True)
# drop rows containing NaNs
df.dropna(axis=0, inplace=True)
# drop columns containing NaNs
df.dropna(axis=1, inplace=True)
# replace NaNs depending on whether it's a numerical feature (k-NN) or categorical (most frequent category)
from sklearn.impute import SimpleImputer
missing_cols = df.isna().sum()[lambda x: x > 0]
for col in missing_cols.index:
if df[col].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
imputer = KNNImputer(n_neighbors=5)
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent', or 'constant'
imputer = SimpleImputer(strategy='constant', fill_value=0) # replace with 0
df[col] = imputer.fit_transform(df[col].values.reshape(-1, 1))
# if test set
# df_test[col] = imputer.fit_transform(df_test[col].values.reshape(-1, 1))
else:
df[col] = df[col].fillna(df[col].mode().iloc[0])
# if test set
# df_test[col] = df_test[col].fillna(df_test[col].mode().iloc[0])
parsing date
https://strftime.org/ Some examples:
1/17/07 has the format "%m/%d/%y"
17-1-2007 has the format "%d-%m-%Y"
# create a new column, date_parsed, with the parsed dates
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format = "%m/%d/%y")
One of the biggest dangers in parsing dates is mixing up the months and days. The to_datetime() function does have very helpful error messages, but it doesn't hurt to double-check that the days of the month we've extracted make sense
# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()
# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)
reading files with encoding problems
# try to read in a file not in UTF-8
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv")
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))
# check what the character encoding might be
print(result)
So chardet is 73% confidence that the right encoding is "Windows-1252". Let's see if that's correct:
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')
# look at the first few lines
kickstarter_2016.head()
INCONSISTENT DATA
INCONSISTENT DATA
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()
# sort them alphabetically and then take a closer look
cities.sort()
cities
Just looking at this, I can see some problems due to inconsistent data entry: 'Lahore' and 'Lahore ', for example, or 'Lakki Marwat' and 'Lakki marwat'.
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()
It does look like there are some remaining inconsistencies: 'd. i khan' and 'd.i khan' should probably be the same.
I'm going to use the fuzzywuzzy package to help identify which string are closest to each other.
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# take a look at them
matches
We can see that two of the items in the cities are very close to "d.i khan": "d. i khan" and "d.i khan". We can also see the "d.g khan", which is a seperate city, has a ratio of 88. Since we don't want to replace "d.g khan" with "d.i khan", let's replace all rows in our City column that have a ratio of > 90 with "d. i khan".
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()
# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)
# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match
# let us know the function's done
print("All done!")
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")
REMOVING A CARACTER THAT WE DONT WANT
df['GDP'] = df['GDP'].str.replace('$', "")
TO CONVERT STR TO NUMERICAL
#df['GDP'] = df['GDP'].astype(float)
#Si on est géné par des caractère dans la conversion
df['GDP'] = df['GDP'].str.replace(',', '').astype(float)
TO ENCODE CATEGORICAL VARIABLES
# For variables taking more than 2 values
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
df['Country'] = ordinal_encoder.fit_transform(df[['Country']])
#TO define the encoding ourself
custom_categories = [['High School', 'Bachelor', 'Master', 'Ph.D'], [0, 1, 2, 3]]
ordinal_encoder = OrdinalEncoder(categories=custom_categories)