Data Cleaning

Created
Tags

Glob

To deal with files we use glob. this uses wildcards and path directories to acces files.

import glob

file_list = glob.glob("file[0-9].txt")
#file_list contains all the names of the files which are of the described pattern

Combining data from files

li = []
for filename in all_files:
  df = pd.read_csv(filename, index_col=None, header=0)
  li.append(df)


frame = pd.concat(li, axis=0, ignore_index=True)

Melt in pandas

This is used for re-shaping data stored in panda dataframes from wide format to long format (less columns more rows)

df = df.melt(id_vars=None, value_vars=None,
var_name=None, value_name='value', col_level=None)

Duplicates

To list out the duplicates columns we can use

df.duplicated()

To remove the duplicate rows

df.drop_duplicates(subset = None, keep = 'first', in-place = False)

Spliting columns

Suppose there is a column for dates, but you want the columns for each day, month and year. we need to split the column.

#df['date'] = '06082001' format
df['day'] = df.date.str[:2] 
df['month'] = df.date.str[2:4] 
df['day'] = df.date.str[4:] 

Split by character

You want to split by '_' we can split them into pieces with '_' as separators and then each piece will be indiced as a list.

split_col = df.col_name.str.split('_')
#makes a pandas series with each element being list of the split values
df['val1'] = df.split_cold.str.get(0) #1st value is taken after the split 
df['val2'] = df.split_col.str.get(1) 

Types

To look at types of the columns

df.dtypes

String parsing

We want to replace "$1" with just 1

df.col = df['col'].replace('[\$,]', '', regex = True)
df.col = pd.to_numeric(df.col)

Now we have a column like "30 dollars" we can extract the number out of it by:

split_df = df['col'].str.split('(\d+)', expand=True)

Missing values

There are multiple ways to deal with the missing values, one of them being dropping the whole row when there is a missing value.

df = df.dropna()
#drops the rows with missing values

As any seen before if we want to do this only for some columns with nan values we pass subset parameter.

We can also fill the values with the mean or some other aggregate of the rest of the values

df = df.fillna(value = {"col" : df.col.mean(), "col2" : df.col2.mean()})