4 Exploring data

Previous steps

import pandas as pd
data_file = 'data/data.csv'
df = pd.read_csv(data_file)

df.head(5)

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	VOYAGEID2	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	NaN	Kingston	NaN	283.0	Neale, Alexander
1	1713.0	16110	Greyhound Gally	Bristol	NaN	Jamaica, place unspecified	NaN	NaN	Selkirk, Alexander<br/> Forrest, Henry
2	1714.0	16111	Jacob	Bristol	NaN	Kingston	NaN	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	NaN	Port Royal	NaN	278.0	Plummer, John
4	1713.0	16113	Lawford Gally	Bristol	Africa, port unspecified	Newcastle (Nevis)	NaN	NaN	Stretton, Joseph

Now that we correctly loaded our data in our working environment, it is time to figure out what the data contains. It is always a good idea to look at the dataset documentation (or metadata) to understand where the data comes from, what is the source of all the different records, how data has been collected, and any other possible data related caveat. Diving into the data documentation is up to you, in this chapter what we want to do is understanding as much as we can from the data itself, looking at its columns, rows, and values.
Every dataset tells a story. You may think about it like a person with a long experience, but not really willing to talk (well, some datasets “talk” more easily than others). It is your role in this case to “interrogate” the data, let it to talk, to tell a story and to dive into the details of that story, getting as much information as you can. This also depends on how much you need to know: will you be satisfied by a small “chat” or you need to know all kind of details?
Let’s formulate some questions to begin with.

Question(s)

Question: How big is the data?

What to do?

Counting the number of rows and columns and checking its size on the disk.

(Python) Tools

pandas attribute shape

Coding

df.shape, the attribute .shape contains the size of the DataFrame expressed in rows and columns. When printed on the screen it will display two numbers, number of rows and number of columns.

df.shape

(36151, 9)

solution = 'Our DataFrame contains data distributed in 36151 rows and 9 columns. '
question_box(solution=solution)

Answer

Our DataFrame contains data distributed in 36151 rows and 9 columns.

It is a quite big dataset. Shall we care about how big is our dataset? We should as this may affect our analysis. For example, if we implement a scientific analysis that requires 1 second per row to produce an output, such program would take about 10hrs to analyse the entire dataset, and that is something we should keep in mind. That is why, in general, it is a good idea to test large analysis programs on a small sub-set of data and then, once verified that everything runs smoothly, to perform the analysis on the entire dataset.

Let’s continue exploring our DataFrame. We have 9 columns, we saw them displayed in our notebook and, luckily enough, their names are pretty descriptive, therefore, in this case, it is quite intuitive to understand what kind of information they contain. It could be useful to store the column names inside a Python variable and to display their names with a corresponding index (this will be useful later).

Task(s)

Display the DataFrame column names with an index

What to do?

Identify column names and assign them an index (starting from zero) depending on their order in the DataFrame (the first column index will be 0, the second 1, and so on).

(Python) Tools

pandas attribute .columns;
Python for loop;
Python function print().

Coding

column_names = df.columns, the DataFrame attribute .columns contains the column names of our DataFrame. We store these names into the variable column_names;
print(column_names), we use the function print() to print on the screen the content of the variable column_names;
```
i=0
print("Index ) Column name")
for name in column_names:
    print(i,")",name)
    i = i + 1
```
, we first initialise (assign a value) to the Python variable i, this will correspond to the first index. We then print the string "Index) Column name" as a description for what we are going to print later. We finally use a for loop to scroll the values contained in column_names. What the for loop does is reading one by one the values stored in column_names and assigning them, one at the time, to the variable name. It then performs all the instructions "inside" the loop (indented text) and, once all the instructions are executed, it starts all over again with the next value in column_names until all the values are explored. In our case, inside the loop we perform just two operations: 1) we both print the current value of the variables i and name and 2) we increase the value of i by 1. Why do we increase i? Because we want to display the different column names according to their position in the DataFrame. Our loop automatically updates the value of the variable name, but it does not increase by one step the index i, so we have to do it explicitly.

Expert Coding

print("Index) Column name") 
for i,name in enumerate(column_names): 
    print(f"{i}) {name}")

, you can substitute the previous block of code with this, it performs the same tasks (printing indices and column names) with less lines of code

column_names = df.columns
print(column_names)
i=0 
print("Index ) Column name") 
for name in column_names: 
    print(i,")",name) 
    i = i + 1

Index(['Year of arrival at port of disembarkation', 'Voyage ID', 'Vessel name',
       'Voyage itinerary imputed port where began (ptdepimp) place',
       'Voyage itinerary imputed principal place of slave purchase (mjbyptimp) ',
       'Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place',
       'VOYAGEID2', 'Captives arrived at 1st port', 'Captain's name'],
      dtype='object')
Index ) Column name
0 ) Year of arrival at port of disembarkation
1 ) Voyage ID
2 ) Vessel name
3 ) Voyage itinerary imputed port where began (ptdepimp) place
4 ) Voyage itinerary imputed principal place of slave purchase (mjbyptimp) 
5 ) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place
6 ) VOYAGEID2
7 ) Captives arrived at 1st port
8 ) Captain's name

Now we have the column names nicely listed from top to bottom and with their corresponding index assigned to them. You might be tempted to start the indexing from 1, but as in Python the first element of a list (or any other series of elements) has index 0, we started counting from zero. You can obtain the same result with less lines of code, try it out!

print("Index) Column name") 
for i,name in enumerate(column_names): 
    print(f"{i}) {name}")

Index) Column name
0) Year of arrival at port of disembarkation
1) Voyage ID
2) Vessel name
3) Voyage itinerary imputed port where began (ptdepimp) place
4) Voyage itinerary imputed principal place of slave purchase (mjbyptimp) 
5) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place
6) VOYAGEID2
7) Captives arrived at 1st port
8) Captain's name

It is now time to figure out what are the rows about. Looking at the column names, we notice that the second one (index 1) is called “Voyage ID”. This indicates that this column contains a specific identifier for the ship voyage, implying that each row contains specific information about a single trip. To verify that each row corresponds to a single voyage, we need to check if all the values of the Voyage ID column are different, i.e. if they are unique.

Question(s)

Are the values of the Voyage ID column unique?

What to do?

Select the column Voyage ID, go through all its 36151 values and check if there are repetitions.

(Python) Tools

pandas column selector method .iloc[];
pandas attribute .is_unique;
Python function print()

Coding

voyage_id = df.iloc[:,1], we apply the method .iloc[] to select the second column of our DataFrame and store it in the variable voyage_id. Inside the square brackets of .iloc[] we can specify rows an columns to select in the form [selected_rows,selected_columns]. In this case, : means that we select ALL the rows, so df.iloc[:,1] selects all the rows of the column with index 1 (second column)
print(voyage_id.is_unique), we print on the screen the pandas attribute .is_unique. This attribute is True if, indeed, all the values of voyage_id are unique, False otherwise.

Expert Coding

print(df.iloc[:,1].is_unique), methods, attributes, and functions can be applied one after another in a single line of code.

voyage_id = df.iloc[:,1]
print(voyage_id.is_unique)

True

We verified that all the values of the Voyage ID column are unique, this means that each of the rows of our DataFrame refers to a single ship voyage. Looking at the other columns, we also notice that information where the voyage began, the port where slaves have been purchased, and the port where slaves have been desembarked is provided.
Looking in particular at the fifth column (index 4, “Voyage itinerary imputed principal place of slave purchase”), we notice it contains several NaNs. NaN stands for “Not a Number”, it is a value that appears when something goes wrong in one of the processes ran by our program. If something went wrong, why did not our program stop or told us something about an occuring problem? Because problems may happen more often than you think and if our program stops working everytime it encounters a situation it cannot handle, it would most probably never finish running! In this case, most probably the record does not exist so the data set cell has been filled by NaN, either in our original .csv file or by the pandas method .read_csv(). NaN are not necesseraly something bad, as they can be easily identified and eventually corrected (or simply ignored). Incorrect or missing data may be much harder to spot and correct.
In any case, the presence of NaNs or any other missing value can severely affect our data analysis, for this reason before starting analysing the data we need to find and get rid of those values. This process is usually called “data cleaning” and that is exactly what we are going to do in the next chapter.