import pandas as pd
= 'data/data.csv'
data_file = pd.read_csv(data_file)
df print(df.shape)
(36151, 9)
import pandas as pd
= 'data/data.csv'
data_file = pd.read_csv(data_file)
df print(df.shape)
(36151, 9)
= df.columns
column_names 5) df.head(
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | VOYAGEID2 | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|---|
0 | 1714.0 | 16109 | Freeke Gally | Bristol | NaN | Kingston | NaN | 283.0 | Neale, Alexander |
1 | 1713.0 | 16110 | Greyhound Gally | Bristol | NaN | Jamaica, place unspecified | NaN | NaN | Selkirk, Alexander<br/> Forrest, Henry |
2 | 1714.0 | 16111 | Jacob | Bristol | NaN | Kingston | NaN | 130.0 | Nicholls, Philip |
3 | 1714.0 | 16112 | Jason Gally | Bristol | NaN | Port Royal | NaN | 278.0 | Plummer, John |
4 | 1713.0 | 16113 | Lawford Gally | Bristol | Africa, port unspecified | Newcastle (Nevis) | NaN | NaN | Stretton, Joseph |
Now that we got some familiarity with our dataset, it is time to clean our data, i.e. to get rid of all those NaN values and anything else that might effect our data analysis. Where to start? Well, inspecting the DataFrame by eye, we see several NaNs in the first 5 rows of our DataFrame. The first column we see NaNs is “Voyage itinerary imputed principal place of slave purchase”, the fourth column (index 5). It would be nice to check if also other column have NaNs.
Let’s start with the first column, “Year of arrival at port of disembarkation” (index 0), let’s check if this column contains any NaN and then we will repeat the same process for all the other columns.
pandas
column selector method .iloc
;pandas
method .is_na
;.sum()
;print()
.arr_year = df.iloc[:,0]
, we first select the first column of the DataFrame using the pandas
method .iloc[]
and we store the selected column in the new variable arr_year
(we already used this method in the previous chapter);arr_year_na = arr_year.isna()
, we apply the pandas
method .isna()
. When we apply the .isna()
method, we obtain a result with the same dimensions of the object we applied it to, containing either True or False depending on if the corresponding value is a NaN (or na, non arithmetic) value or not. Indeed the result of this method just answers the question: is this value na? We store the result in the variable arr_year_na
(you are free to use a more descriptive name);print(arr_year_na.sum())
, in Python the boolean values True and False are equivalent to 1 and 0, respectively. This means that if we have an array (a list or sequence of values) containing True and False, if we sum all the values, we would obtain the number of True values (as they count as 1 and everything else counts as 0). These True values correspond to the cases when the method .isna()
found a NaN, so that summing all these values means, as a matter of fact, counting how many NaNs have been found.= df.iloc[:,0]
arr_year = arr_year.isna()
arr_year_na print(arr_year_na)
print('Total number of NaNs in the first column:',arr_year_na.sum())
0 False
1 False
2 False
3 False
4 False
...
36146 False
36147 False
36148 False
36149 False
36150 False
Name: Year of arrival at port of disembarkation, Length: 36151, dtype: bool
Total number of NaNs in the first column: 1
= 'The first column contains 1 NaN value'
solution =solution) question_box(solution
In this way we found our that the first column has 1 NaN (or na) value, that would have been quite hard to spot by eye scrolling 36151 lines!
It is great that we found 1 NaN in the first column, but where exactly it is located? What’s the corresponding Voyage ID of that value?
df[arr_year_na]
, here we will use one of the most useful features when working with DataFrames: masking. From our previous coding, arr_year_na is an object with the same shape and features of the first column, but instead of containing years, it contains True and False values, where True corresponds to NaNs found applying the method .isna()
. If we consider our DataFrame df
, we can use this object to select data from it. What are we going to select? Well, arr_year_na
has a name, and that is the name of the first column, so we will select that. We will also select only the rows where arr_year_na
is True. df[arr_year_na]
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | VOYAGEID2 | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|---|
32248 | NaN | 91909 | Kitty | Liverpool | NaN | NaN | NaN | NaN | Fisher, Joseph |
In this way we can inspect NaNs one by one and we can make a decision about how to handle them. In our DataFrame there are thousands of NaNs (as you will see in a minute) and going through ALL of them one by one is not a good idea. Let’s first try to figure out if the other columns have also NaNs and how many are they. The process will be quite straightforward as we already did it for one of the columns, so what we need to do now is to repeat the same procedure for all the other columns.
for column_name in column_names:
selected_column = df[column_name]
selected_column_na = selected_column.isna()
n_nan = selected_column_na.sum()
print(column_name,"has",n_nan,"NaN")
, we start with a for loop scrolling all the elements contained in the variable column_names
. In this variable we previously stored all the column names. In this way the variable column_name
will containg a single column name for each iteration of the loop. We then use the column name to select that specific column in our DataFrame, storing this selected column in the variable selected_column
. We apply the method .isna()
to the selected column to obtain, instead of a column of numerical or string values, a column of True and False (boolean) values. We use the method .sum()
to sum all the True and False values. Keeping in mind that True is equivalent to 1 and False to 0, this will result in counting all the NaN values in the selected column. We store this last result in the variable n_nan
. We finally print this result, together with the name of the selected column, using the python function print()
.
for i,column_name in enumerate(column_names):
print(f"{i}) {column_name} has {df[column_name].isna().sum()} NaN")
, we can apply the methods described above one after another, without storing any intermidiate result in a new variable. We can also use the python function enumerate()
to not only scroll through the df
column names, but also for counting the loop iterations starting from the first one (index 0). The iteration index will be stored in the variable i
.
for column_name in column_names:
= df[column_name]
selected_column = selected_column.isna()
selected_column_na = selected_column_na.sum()
n_nan print(column_name,'has',n_nan,'NaN')
Year of arrival at port of disembarkation has 1 NaN
Voyage ID has 0 NaN
Vessel name has 1614 NaN
Voyage itinerary imputed port where began (ptdepimp) place has 4508 NaN
Voyage itinerary imputed principal place of slave purchase (mjbyptimp) has 2210 NaN
Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place has 4191 NaN
VOYAGEID2 has 36101 NaN
Captives arrived at 1st port has 17743 NaN
Captain's name has 4028 NaN
and if we want to keep in mind the column index of each column…
for i,column_name in enumerate(column_names): \
print(f"{i}) {column_name} has {df[column_name].isna().sum()} NaN")
0) Year of arrival at port of disembarkation has 1 NaN
1) Voyage ID has 0 NaN
2) Vessel name has 1614 NaN
3) Voyage itinerary imputed port where began (ptdepimp) place has 4508 NaN
4) Voyage itinerary imputed principal place of slave purchase (mjbyptimp) has 2210 NaN
5) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place has 4191 NaN
6) VOYAGEID2 has 36101 NaN
7) Captives arrived at 1st port has 17743 NaN
8) Captain's name has 4028 NaN
At this point we have a general idea of the amount of data missing in our DataFrame. The following question is how to deal with this missing data? There are several things we can do, the easiest option would be just exclude it from our DataFrame. However, in order to answer a research question, we often do not need to use or explore ALL the available information and we would usually be interested in some parameters more than others. In this case our data selection could be performed looking at one or more specific columns. What to do with the rest of the NaNs? We can either leave them as they are and trying to figure out how our analysis program will “digest” these values or find good substitute for them. The value of this substitute will depend on the data type of the columns containing the NaN and on our decision. For example the NaN in the columns containing a descriptive string, like the vessel name or the starting port, could be substituted by the string “unknown”. NaNs in the “Captives arrived […]” column could be left as they are (you may be tempted to change them to 0, but zero captives is quite different from unknown number of captives) or substituted by, for example, the average of captives during the same year.
Each choice will have different implications to our final results, the most important thing in this stage is to clearly document our criteria for filtering NaN. In our specific case we will be mostly interested in the data containing the number of captives, so we want to filter our all those rows where the number of captives is NaN. We will then exclude the columns VOYAGEID2 as we already have a voyage ID and it is not listed in the data variable description. To resume, here there are our cleaning criteria: - All the rows not containing data about the number of captives have been removed; - All the NaN values in columns with descriptive information (e.g. names) have been substituted with “unknown”; - The column VOYAGEID2 has been removed from the DataFrame.
print()
;pandas
attribute .columns
;pandas
method drop()
;pandas
method .head()
;pandas
method .dropna()
;pandas
method .fillna()
;pandas
attribute .shape
;len()
;print(df.columns)
, we start printing the name of our DataFrame columns, just as a reminder. We use the Python function print()
and the pandas
attribute .columns
that we already used several times;column_to_remove = "VOYAGEID2"
column_to_remove_nan = "Captives arrived at 1st port"
, we store the name of the columns we are interested in into two variables. This is not stricktly necessary, as we could use the name of the columns directly as input for methods and functions. However, with this extra step, our code will be more readable as we are using two very descriptive variable names: column_to_remove
will, indeed, contain the name of the column we want to remove from our DataFrame (VOYAGEID2) and column_to_remove_nan
will contain the name of the column to use as a reference for filtering rows, all the rows having NaN in this column will be removed;cleaned_df_step1 = df.drop(column_to_remove,axis=1)
, we use the method .drop()
to remove a specific column. Each row of a pandas
DataFrame has an index. This index is different from the index that every Python list of objects has, i.e. an integer number starting from 0. This DataFrame index is an additional index that can be not only a number, but also a string or a label. The pandas
method .drop()
can be used either to remove columns, depending on column names, or rows, depending on pandas
index. To indicate that we want to target columns, and not rows, we need to specify the parameter axis
equal to 1 (0 would target the rows). Together with that, we also need to specify the name of the column we want to remove, stored in the variable column_to_remove
. In this way, we apply the method .drop()
to our "dirty" DataFrame df
and we store the result into the variable cleaned_df_step1
;cleaned_df_step1.head(5)
, after every filtering operation, it is a good idea to check out the effect on our output result. In this case we use the pandas
method .head()
to visualise the first five rows of our DataFrame cleaned_df_step1
cleaned_df_step2 = cleaned_df_step1.dropna(subset=[column_to_remove_nan])
, we use the method .dropna()
to filter out all the rows having NaN in a list of columns. The list of columns needs to be specified in the function parameter subset
. In this case we have a single column name stored in the variable column_to_remove_nana
and to make it a list we put this variable in between square brackets [ ]
. We apply the method to the DataFrame cleaned_df_step1
and store the result in cleaned_df_step2
;cleaned_df = cleaned_df_step2.fillna("unknown")
, we finally use the method .fillna()
to fill all the NaN in our DataFrame with a certain value, in this case the string (word) "unknown". We apply the method to the DataFrame cleaned_df_step2
and store the result into cleaned_df
;print(cleaned_df.shape)
n_filtered_rows = len(df)-len(cleaned_df)
per_cent = (n_filtered_rows/len(df))*100
, this block of Python instructions checks the size of our filtered DataFrame cleaned_df
and compares it with our original, raw, DataFrame df
. We first visualise the size of cleaned_df
using the attribute .shape
, then we obtain simply the number of rows of each DataFrame using the Python function len()
. We subtract the number of rows of the DataFrames before and after filtering to check how many rows we filtered out and we store the result into the variable n_filtered_rows
. We finally compute the percent of filtered rows compared to the initial size of the DataFrame.# Display the name of the columns first
print(df.columns)
# Select our target columns for clearning the data
= 'VOYAGEID2'
column_to_remove = 'Captives arrived at 1st port'
column_to_remove_nan
# Perform Data Cleaning visualising the result step by step
# step1, removing column VOYAGEID2 from the DataFrame
= df.drop(column_to_remove,axis=1)
cleaned_df_step1 5) cleaned_df_step1.head(
Index(['Year of arrival at port of disembarkation', 'Voyage ID', 'Vessel name',
'Voyage itinerary imputed port where began (ptdepimp) place',
'Voyage itinerary imputed principal place of slave purchase (mjbyptimp) ',
'Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place',
'VOYAGEID2', 'Captives arrived at 1st port', 'Captain's name'],
dtype='object')
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|
0 | 1714.0 | 16109 | Freeke Gally | Bristol | NaN | Kingston | 283.0 | Neale, Alexander |
1 | 1713.0 | 16110 | Greyhound Gally | Bristol | NaN | Jamaica, place unspecified | NaN | Selkirk, Alexander<br/> Forrest, Henry |
2 | 1714.0 | 16111 | Jacob | Bristol | NaN | Kingston | 130.0 | Nicholls, Philip |
3 | 1714.0 | 16112 | Jason Gally | Bristol | NaN | Port Royal | 278.0 | Plummer, John |
4 | 1713.0 | 16113 | Lawford Gally | Bristol | Africa, port unspecified | Newcastle (Nevis) | NaN | Stretton, Joseph |
# step2, removing all the rows haveing NaN in the "Captives arrived at 1st port" column
= cleaned_df_step1.dropna(subset=[column_to_remove_nan])
cleaned_df_step2 5) cleaned_df_step2.head(
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|
0 | 1714.0 | 16109 | Freeke Gally | Bristol | NaN | Kingston | 283.0 | Neale, Alexander |
2 | 1714.0 | 16111 | Jacob | Bristol | NaN | Kingston | 130.0 | Nicholls, Philip |
3 | 1714.0 | 16112 | Jason Gally | Bristol | NaN | Port Royal | 278.0 | Plummer, John |
5 | 1714.0 | 16114 | Mercy Gally | Bristol | Africa, port unspecified | Barbados, place unspecified | 190.0 | Scott, John |
6 | 1714.0 | 16115 | Mermaid Gally | Bristol | Cape Verde Islands | Kingston | 72.0 | Banbury, John<br/> Copinger, James |
# step3, changing all the other NaN into unknown
= cleaned_df_step2.fillna("unknown")
cleaned_df 5) cleaned_df.head(
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|
0 | 1714.0 | 16109 | Freeke Gally | Bristol | unknown | Kingston | 283.0 | Neale, Alexander |
2 | 1714.0 | 16111 | Jacob | Bristol | unknown | Kingston | 130.0 | Nicholls, Philip |
3 | 1714.0 | 16112 | Jason Gally | Bristol | unknown | Port Royal | 278.0 | Plummer, John |
5 | 1714.0 | 16114 | Mercy Gally | Bristol | Africa, port unspecified | Barbados, place unspecified | 190.0 | Scott, John |
6 | 1714.0 | 16115 | Mermaid Gally | Bristol | Cape Verde Islands | Kingston | 72.0 | Banbury, John<br/> Copinger, James |
# step4, checking how much data we filtered out
print(cleaned_df.shape)
= len(df)-len(cleaned_df)
n_filtered_rows = (n_filtered_rows/len(df))*100
per_cent print('We filtered out: ',len(df)-len(cleaned_df),', corresponding to about', round(per_cent), '% of our initial data')
(18408, 8)
We filtered out: 17743 , corresponding to about 49 % of our initial data
It seems that because of our filtering, almost half of our data will be excluded from the analysis. This is a quite large percent and we may decide to re-think our filtering criteria to include more data. For example, we could substitue the missing value in the Captives column with an avarage number of captived per trip. For the purpose of our workshop, we will keep the current filtering criteria and keep our filtered DataFrame as it is.
At this point we obtained a “clean” DataFrame, cleaned_df
, containing 18408 rows with values organised in 8 columns. We can now start diving deep in the analysis of our DataFrame, we are ready to interrogate this dataset and see which kind of story it is going to tell us.