Data Cleaning
Created | |
---|---|
Tags |
Glob
To deal with files we use glob. this uses wildcards and path directories to acces files.
import glob
file_list = glob.glob("file[0-9].txt")
#file_list contains all the names of the files which are of the described pattern
Combining data from files
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Melt in pandas
This is used for re-shaping data stored in panda dataframes from wide format to long format (less columns more rows)
df = df.melt(id_vars=None, value_vars=None,
var_name=None, value_name='value', col_level=None)
- frame : The dataframe
- id_vars : the columns which are to remained as columns(parameters)
- value_vars : the rest are converted into just values, anything not defined under id_vars is taken as value_vars
- var_name : name for the column with variables
- value_name : name for the columns with values
- col_level : incase the dataframe is multilevel (lists inside lists sort of)
Duplicates
To list out the duplicates columns we can use
df.duplicated()
To remove the duplicate rows
df.drop_duplicates(subset = None, keep = 'first', in-place = False)
- subset is the list of columns to be considered while finding the duplicates.
- keep: can be 'last', 'first' or False.
- first : the first value is considered as unique the rest are duplicates, same idea with last
- False : all the values which repeat are duplicates. this is to keep the ones which only occur once.
Spliting columns
Suppose there is a column for dates, but you want the columns for each day, month and year. we need to split the column.
#df['date'] = '06082001' format
df['day'] = df.date.str[:2]
df['month'] = df.date.str[2:4]
df['day'] = df.date.str[4:]
Split by character
You want to split by '_' we can split them into pieces with '_' as separators and then each piece will be indiced as a list.
split_col = df.col_name.str.split('_')
#makes a pandas series with each element being list of the split values
df['val1'] = df.split_cold.str.get(0) #1st value is taken after the split
df['val2'] = df.split_col.str.get(1)
Types
To look at types of the columns
df.dtypes
String parsing
We want to replace "$1" with just 1
df.col = df['col'].replace('[\$,]', '', regex = True)
df.col = pd.to_numeric(df.col)
Now we have a column like "30 dollars" we can extract the number out of it by:
split_df = df['col'].str.split('(\d+)', expand=True)
Missing values
There are multiple ways to deal with the missing values, one of them being dropping the whole row when there is a missing value.
df = df.dropna()
#drops the rows with missing values
As any seen before if we want to do this only for some columns with nan values we pass subset parameter.
We can also fill the values with the mean or some other aggregate of the rest of the values
df = df.fillna(value = {"col" : df.col.mean(), "col2" : df.col2.mean()})