5 Cleaning data

Previous steps

import pandas as pd
data_file = 'data/data.csv'
df = pd.read_csv(data_file)
print(df.shape)

(36151, 9)

column_names = df.columns
df.head(5)

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	VOYAGEID2	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	NaN	Kingston	NaN	283.0	Neale, Alexander
1	1713.0	16110	Greyhound Gally	Bristol	NaN	Jamaica, place unspecified	NaN	NaN	Selkirk, Alexander<br/> Forrest, Henry
2	1714.0	16111	Jacob	Bristol	NaN	Kingston	NaN	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	NaN	Port Royal	NaN	278.0	Plummer, John
4	1713.0	16113	Lawford Gally	Bristol	Africa, port unspecified	Newcastle (Nevis)	NaN	NaN	Stretton, Joseph

Now that we got some familiarity with our dataset, it is time to clean our data, i.e. to get rid of all those NaN values and anything else that might effect our data analysis. Where to start? Well, inspecting the DataFrame by eye, we see several NaNs in the first 5 rows of our DataFrame. The first column we see NaNs is “Voyage itinerary imputed principal place of slave purchase”, the fourth column (index 5). It would be nice to check if also other column have NaNs.
Let’s start with the first column, “Year of arrival at port of disembarkation” (index 0), let’s check if this column contains any NaN and then we will repeat the same process for all the other columns.

Question(s)

Is there any NaN in the first column? How many are they?

What to do?

Go through all the 36151 entries of the first column and look for NaN values.

(Python) Tools

pandas column selector method .iloc;
pandas method .is_na;
Python function .sum();
Python function print().

Coding

arr_year = df.iloc[:,0], we first select the first column of the DataFrame using the pandas method .iloc[] and we store the selected column in the new variable arr_year (we already used this method in the previous chapter);
arr_year_na = arr_year.isna(), we apply the pandas method .isna(). When we apply the .isna() method, we obtain a result with the same dimensions of the object we applied it to, containing either True or False depending on if the corresponding value is a NaN (or na, non arithmetic) value or not. Indeed the result of this method just answers the question: is this value na? We store the result in the variable arr_year_na (you are free to use a more descriptive name);
print(arr_year_na.sum()), in Python the boolean values True and False are equivalent to 1 and 0, respectively. This means that if we have an array (a list or sequence of values) containing True and False, if we sum all the values, we would obtain the number of True values (as they count as 1 and everything else counts as 0). These True values correspond to the cases when the method .isna() found a NaN, so that summing all these values means, as a matter of fact, counting how many NaNs have been found.

Expert Coding

print(df.iloc[:,0].isna()), we can apply methods and function one after another, using less coding lines and saving space in our computer memory

arr_year = df.iloc[:,0]
arr_year_na = arr_year.isna()
print(arr_year_na)
print('Total number of NaNs in the first column:',arr_year_na.sum())

0        False
1        False
2        False
3        False
4        False
         ...  
36146    False
36147    False
36148    False
36149    False
36150    False
Name: Year of arrival at port of disembarkation, Length: 36151, dtype: bool
Total number of NaNs in the first column: 1

solution = 'The first column contains 1 NaN value'
question_box(solution=solution)

Answer

The first column contains 1 NaN value

In this way we found our that the first column has 1 NaN (or na) value, that would have been quite hard to spot by eye scrolling 36151 lines!
It is great that we found 1 NaN in the first column, but where exactly it is located? What’s the corresponding Voyage ID of that value?

Question(s)

Where is the NaN value located in the first column?

What to do?

Go through all the 36151 entries of the first column and look for the NaN value.

(Python) Tools

Python masking.

Coding

df[arr_year_na], here we will use one of the most useful features when working with DataFrames: masking. From our previous coding, arr_year_na is an object with the same shape and features of the first column, but instead of containing years, it contains True and False values, where True corresponds to NaNs found applying the method .isna(). If we consider our DataFrame df, we can use this object to select data from it. What are we going to select? Well, arr_year_na has a name, and that is the name of the first column, so we will select that. We will also select only the rows where arr_year_na is True.

df[arr_year_na]

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	VOYAGEID2	Captives arrived at 1st port	Captain's name
32248	NaN	91909	Kitty	Liverpool	NaN	NaN	NaN	NaN	Fisher, Joseph

Answer

The row containing a NaN in the first column ("Year of arrival at port of disembarkation") has index 32248, it misses information about its itinerary (only the starting port is present) and number of captives, and its ID is 91909

In this way we can inspect NaNs one by one and we can make a decision about how to handle them. In our DataFrame there are thousands of NaNs (as you will see in a minute) and going through ALL of them one by one is not a good idea. Let’s first try to figure out if the other columns have also NaNs and how many are they. The process will be quite straightforward as we already did it for one of the columns, so what we need to do now is to repeat the same procedure for all the other columns.

Question(s)

Which column has NaNs? How many are they?

What to do?

Go through the DataFrame columns one by one and count the NaNs

Coding

for column_name in column_names:
    selected_column = df[column_name]
    selected_column_na = selected_column.isna()
    n_nan = selected_column_na.sum()
    print(column_name,"has",n_nan,"NaN")

, we start with a for loop scrolling all the elements contained in the variable column_names. In this variable we previously stored all the column names. In this way the variable column_name will containg a single column name for each iteration of the loop. We then use the column name to select that specific column in our DataFrame, storing this selected column in the variable selected_column. We apply the method .isna() to the selected column to obtain, instead of a column of numerical or string values, a column of True and False (boolean) values. We use the method .sum() to sum all the True and False values. Keeping in mind that True is equivalent to 1 and False to 0, this will result in counting all the NaN values in the selected column. We store this last result in the variable n_nan. We finally print this result, together with the name of the selected column, using the python function print().

Expert Coding

for i,column_name in enumerate(column_names):
    print(f"{i}) {column_name} has {df[column_name].isna().sum()} NaN")

, we can apply the methods described above one after another, without storing any intermidiate result in a new variable. We can also use the python function enumerate() to not only scroll through the df column names, but also for counting the loop iterations starting from the first one (index 0). The iteration index will be stored in the variable i.

for column_name in column_names:
    selected_column = df[column_name]
    selected_column_na = selected_column.isna()
    n_nan = selected_column_na.sum()
    print(column_name,'has',n_nan,'NaN')

Year of arrival at port of disembarkation has 1 NaN
Voyage ID has 0 NaN
Vessel name has 1614 NaN
Voyage itinerary imputed port where began (ptdepimp) place has 4508 NaN
Voyage itinerary imputed principal place of slave purchase (mjbyptimp)  has 2210 NaN
Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place has 4191 NaN
VOYAGEID2 has 36101 NaN
Captives arrived at 1st port has 17743 NaN
Captain's name has 4028 NaN

and if we want to keep in mind the column index of each column…

for i,column_name in enumerate(column_names): \
    print(f"{i}) {column_name} has {df[column_name].isna().sum()} NaN")

0) Year of arrival at port of disembarkation has 1 NaN
1) Voyage ID has 0 NaN
2) Vessel name has 1614 NaN
3) Voyage itinerary imputed port where began (ptdepimp) place has 4508 NaN
4) Voyage itinerary imputed principal place of slave purchase (mjbyptimp)  has 2210 NaN
5) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place has 4191 NaN
6) VOYAGEID2 has 36101 NaN
7) Captives arrived at 1st port has 17743 NaN
8) Captain's name has 4028 NaN

Answer

The only column that does not have NaNs is the Voyage ID column, all the others (excluding the Year of Arrival [...]) have thousands of NaNs.

At this point we have a general idea of the amount of data missing in our DataFrame. The following question is how to deal with this missing data? There are several things we can do, the easiest option would be just exclude it from our DataFrame. However, in order to answer a research question, we often do not need to use or explore ALL the available information and we would usually be interested in some parameters more than others. In this case our data selection could be performed looking at one or more specific columns. What to do with the rest of the NaNs? We can either leave them as they are and trying to figure out how our analysis program will “digest” these values or find good substitute for them. The value of this substitute will depend on the data type of the columns containing the NaN and on our decision. For example the NaN in the columns containing a descriptive string, like the vessel name or the starting port, could be substituted by the string “unknown”. NaNs in the “Captives arrived […]” column could be left as they are (you may be tempted to change them to 0, but zero captives is quite different from unknown number of captives) or substituted by, for example, the average of captives during the same year.
Each choice will have different implications to our final results, the most important thing in this stage is to clearly document our criteria for filtering NaN. In our specific case we will be mostly interested in the data containing the number of captives, so we want to filter our all those rows where the number of captives is NaN. We will then exclude the columns VOYAGEID2 as we already have a voyage ID and it is not listed in the data variable description. To resume, here there are our cleaning criteria: - All the rows not containing data about the number of captives have been removed; - All the NaN values in columns with descriptive information (e.g. names) have been substituted with “unknown”; - The column VOYAGEID2 has been removed from the DataFrame.

Task(s)

Cleaning data

What to do?

Remove all the data of the column VOYAGEID2;
Go through all the values of the column "Captives arrived at 1st port" and remove rows containing NaNs;
Go through all the values of the descriptive columns and change NaN into "unknown"
Check how much data has been filtered out.

(Python) Tools

Python function print();
pandas attribute .columns;
pandas method drop();
pandas method .head();
pandas method .dropna();
pandas method .fillna();
pandas attribute .shape;
Python function len();
Python operators * (multiplication), / (division), and - (subtraction)

Coding

print(df.columns), we start printing the name of our DataFrame columns, just as a reminder. We use the Python function print() and the pandas attribute .columns that we already used several times;
```
column_to_remove = "VOYAGEID2"
column_to_remove_nan = "Captives arrived at 1st port"
```
, we store the name of the columns we are interested in into two variables. This is not stricktly necessary, as we could use the name of the columns directly as input for methods and functions. However, with this extra step, our code will be more readable as we are using two very descriptive variable names: column_to_remove will, indeed, contain the name of the column we want to remove from our DataFrame (VOYAGEID2) and column_to_remove_nan will contain the name of the column to use as a reference for filtering rows, all the rows having NaN in this column will be removed;
cleaned_df_step1 = df.drop(column_to_remove,axis=1), we use the method .drop() to remove a specific column. Each row of a pandas DataFrame has an index. This index is different from the index that every Python list of objects has, i.e. an integer number starting from 0. This DataFrame index is an additional index that can be not only a number, but also a string or a label. The pandas method .drop() can be used either to remove columns, depending on column names, or rows, depending on pandas index. To indicate that we want to target columns, and not rows, we need to specify the parameter axis equal to 1 (0 would target the rows). Together with that, we also need to specify the name of the column we want to remove, stored in the variable column_to_remove. In this way, we apply the method .drop() to our "dirty" DataFrame df and we store the result into the variable cleaned_df_step1;
cleaned_df_step1.head(5), after every filtering operation, it is a good idea to check out the effect on our output result. In this case we use the pandas method .head() to visualise the first five rows of our DataFrame cleaned_df_step1
cleaned_df_step2 = cleaned_df_step1.dropna(subset=[column_to_remove_nan]), we use the method .dropna() to filter out all the rows having NaN in a list of columns. The list of columns needs to be specified in the function parameter subset. In this case we have a single column name stored in the variable column_to_remove_nana and to make it a list we put this variable in between square brackets [ ]. We apply the method to the DataFrame cleaned_df_step1 and store the result in cleaned_df_step2;
cleaned_df = cleaned_df_step2.fillna("unknown"), we finally use the method .fillna() to fill all the NaN in our DataFrame with a certain value, in this case the string (word) "unknown". We apply the method to the DataFrame cleaned_df_step2 and store the result into cleaned_df;
```
print(cleaned_df.shape)
n_filtered_rows = len(df)-len(cleaned_df)
per_cent = (n_filtered_rows/len(df))*100
```
, this block of Python instructions checks the size of our filtered DataFrame cleaned_df and compares it with our original, raw, DataFrame df. We first visualise the size of cleaned_df using the attribute .shape, then we obtain simply the number of rows of each DataFrame using the Python function len(). We subtract the number of rows of the DataFrames before and after filtering to check how many rows we filtered out and we store the result into the variable n_filtered_rows. We finally compute the percent of filtered rows compared to the initial size of the DataFrame.

# Display the name of the columns first
print(df.columns)

# Select our target columns for clearning the data
column_to_remove = 'VOYAGEID2'
column_to_remove_nan = 'Captives arrived at 1st port'

# Perform Data Cleaning visualising the result step by step
# step1, removing column VOYAGEID2 from the DataFrame
cleaned_df_step1 = df.drop(column_to_remove,axis=1)
cleaned_df_step1.head(5)

Index(['Year of arrival at port of disembarkation', 'Voyage ID', 'Vessel name',
       'Voyage itinerary imputed port where began (ptdepimp) place',
       'Voyage itinerary imputed principal place of slave purchase (mjbyptimp) ',
       'Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place',
       'VOYAGEID2', 'Captives arrived at 1st port', 'Captain's name'],
      dtype='object')

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	NaN	Kingston	283.0	Neale, Alexander
1	1713.0	16110	Greyhound Gally	Bristol	NaN	Jamaica, place unspecified	NaN	Selkirk, Alexander<br/> Forrest, Henry
2	1714.0	16111	Jacob	Bristol	NaN	Kingston	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	NaN	Port Royal	278.0	Plummer, John
4	1713.0	16113	Lawford Gally	Bristol	Africa, port unspecified	Newcastle (Nevis)	NaN	Stretton, Joseph

# step2, removing all the rows haveing NaN in the "Captives arrived at 1st port" column
cleaned_df_step2 = cleaned_df_step1.dropna(subset=[column_to_remove_nan])
cleaned_df_step2.head(5)

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	NaN	Kingston	283.0	Neale, Alexander
2	1714.0	16111	Jacob	Bristol	NaN	Kingston	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	NaN	Port Royal	278.0	Plummer, John
5	1714.0	16114	Mercy Gally	Bristol	Africa, port unspecified	Barbados, place unspecified	190.0	Scott, John
6	1714.0	16115	Mermaid Gally	Bristol	Cape Verde Islands	Kingston	72.0	Banbury, John<br/> Copinger, James

# step3, changing all the other NaN into unknown
cleaned_df = cleaned_df_step2.fillna("unknown")
cleaned_df.head(5)

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	unknown	Kingston	283.0	Neale, Alexander
2	1714.0	16111	Jacob	Bristol	unknown	Kingston	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	unknown	Port Royal	278.0	Plummer, John
5	1714.0	16114	Mercy Gally	Bristol	Africa, port unspecified	Barbados, place unspecified	190.0	Scott, John
6	1714.0	16115	Mermaid Gally	Bristol	Cape Verde Islands	Kingston	72.0	Banbury, John<br/> Copinger, James

# step4, checking how much data we filtered out
print(cleaned_df.shape)
n_filtered_rows = len(df)-len(cleaned_df)
per_cent = (n_filtered_rows/len(df))*100
print('We filtered out: ',len(df)-len(cleaned_df),', corresponding to about', round(per_cent), '% of our initial data')

(18408, 8)
We filtered out:  17743 , corresponding to about 49 % of our initial data

It seems that because of our filtering, almost half of our data will be excluded from the analysis. This is a quite large percent and we may decide to re-think our filtering criteria to include more data. For example, we could substitue the missing value in the Captives column with an avarage number of captived per trip. For the purpose of our workshop, we will keep the current filtering criteria and keep our filtered DataFrame as it is.

At this point we obtained a “clean” DataFrame, cleaned_df, containing 18408 rows with values organised in 8 columns. We can now start diving deep in the analysis of our DataFrame, we are ready to interrogate this dataset and see which kind of story it is going to tell us.