import pandas as pd
= 'data/data.csv'
data_file = pd.read_csv(data_file) df
4 Exploring data
Previous steps
5) df.head(
Year of arrival at port of disembarkation | Voyage ID | Vessel name | Voyage itinerary imputed port where began (ptdepimp) place | Voyage itinerary imputed principal place of slave purchase (mjbyptimp) | Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place | VOYAGEID2 | Captives arrived at 1st port | Captain's name | |
---|---|---|---|---|---|---|---|---|---|
0 | 1714.0 | 16109 | Freeke Gally | Bristol | NaN | Kingston | NaN | 283.0 | Neale, Alexander |
1 | 1713.0 | 16110 | Greyhound Gally | Bristol | NaN | Jamaica, place unspecified | NaN | NaN | Selkirk, Alexander<br/> Forrest, Henry |
2 | 1714.0 | 16111 | Jacob | Bristol | NaN | Kingston | NaN | 130.0 | Nicholls, Philip |
3 | 1714.0 | 16112 | Jason Gally | Bristol | NaN | Port Royal | NaN | 278.0 | Plummer, John |
4 | 1713.0 | 16113 | Lawford Gally | Bristol | Africa, port unspecified | Newcastle (Nevis) | NaN | NaN | Stretton, Joseph |
Now that we correctly loaded our data in our working environment, it is time to figure out what the data contains. It is always a good idea to look at the dataset documentation (or metadata) to understand where the data comes from, what is the source of all the different records, how data has been collected, and any other possible data related caveat. Diving into the data documentation is up to you, in this chapter what we want to do is understanding as much as we can from the data itself, looking at its columns, rows, and values.
Every dataset tells a story. You may think about it like a person with a long experience, but not really willing to talk (well, some datasets “talk” more easily than others). It is your role in this case to “interrogate” the data, let it to talk, to tell a story and to dive into the details of that story, getting as much information as you can. This also depends on how much you need to know: will you be satisfied by a small “chat” or you need to know all kind of details?
Let’s formulate some questions to begin with.

Question(s)

What to do?

(Python) Tools
pandas
attribute shape

Coding
df.shape
, the attribute .shape
contains the size of the DataFrame expressed in rows and columns. When printed on the screen it will display two numbers, number of rows and number of columns.
df.shape
(36151, 9)
= 'Our DataFrame contains data distributed in 36151 rows and 9 columns. '
solution =solution) question_box(solution

Answer
It is a quite big dataset. Shall we care about how big is our dataset? We should as this may affect our analysis. For example, if we implement a scientific analysis that requires 1 second per row to produce an output, such program would take about 10hrs to analyse the entire dataset, and that is something we should keep in mind. That is why, in general, it is a good idea to test large analysis programs on a small sub-set of data and then, once verified that everything runs smoothly, to perform the analysis on the entire dataset.
Let’s continue exploring our DataFrame. We have 9 columns, we saw them displayed in our notebook and, luckily enough, their names are pretty descriptive, therefore, in this case, it is quite intuitive to understand what kind of information they contain. It could be useful to store the column names inside a Python variable and to display their names with a corresponding index (this will be useful later).

Task(s)

What to do?

(Python) Tools
pandas
attribute.columns
;- Python
for
loop; - Python function
print()
.

Coding
column_names = df.columns
, the DataFrame attribute.columns
contains the column names of our DataFrame. We store these names into the variablecolumn_names
;print(column_names)
, we use the functionprint()
to print on the screen the content of the variablecolumn_names
;
, we first initialise (assign a value) to the Python variablei=0 print("Index ) Column name") for name in column_names: print(i,")",name) i = i + 1
i
, this will correspond to the first index. We then print the string "Index) Column name" as a description for what we are going to print later. We finally use afor
loop to scroll the values contained incolumn_names
. What thefor
loop does is reading one by one the values stored incolumn_names
and assigning them, one at the time, to the variablename
. It then performs all the instructions "inside" the loop (indented text) and, once all the instructions are executed, it starts all over again with the next value incolumn_names
until all the values are explored. In our case, inside the loop we perform just two operations: 1) we both print the current value of the variablesi
andname
and 2) we increase the value ofi
by 1. Why do we increasei
? Because we want to display the different column names according to their position in the DataFrame. Our loop automatically updates the value of the variablename
, but it does not increase by one step the indexi
, so we have to do it explicitly.

Expert Coding
print("Index) Column name") for i,name in enumerate(column_names): print(f"{i}) {name}"), you can substitute the previous block of code with this, it performs the same tasks (printing indices and column names) with less lines of code
= df.columns
column_names print(column_names)
=0
iprint("Index ) Column name")
for name in column_names:
print(i,")",name)
= i + 1 i
Index(['Year of arrival at port of disembarkation', 'Voyage ID', 'Vessel name',
'Voyage itinerary imputed port where began (ptdepimp) place',
'Voyage itinerary imputed principal place of slave purchase (mjbyptimp) ',
'Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place',
'VOYAGEID2', 'Captives arrived at 1st port', 'Captain's name'],
dtype='object')
Index ) Column name
0 ) Year of arrival at port of disembarkation
1 ) Voyage ID
2 ) Vessel name
3 ) Voyage itinerary imputed port where began (ptdepimp) place
4 ) Voyage itinerary imputed principal place of slave purchase (mjbyptimp)
5 ) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place
6 ) VOYAGEID2
7 ) Captives arrived at 1st port
8 ) Captain's name
Now we have the column names nicely listed from top to bottom and with their corresponding index assigned to them. You might be tempted to start the indexing from 1, but as in Python the first element of a list (or any other series of elements) has index 0, we started counting from zero. You can obtain the same result with less lines of code, try it out!
print("Index) Column name")
for i,name in enumerate(column_names):
print(f"{i}) {name}")
Index) Column name
0) Year of arrival at port of disembarkation
1) Voyage ID
2) Vessel name
3) Voyage itinerary imputed port where began (ptdepimp) place
4) Voyage itinerary imputed principal place of slave purchase (mjbyptimp)
5) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place
6) VOYAGEID2
7) Captives arrived at 1st port
8) Captain's name
It is now time to figure out what are the rows about. Looking at the column names, we notice that the second one (index 1) is called “Voyage ID”. This indicates that this column contains a specific identifier for the ship voyage, implying that each row contains specific information about a single trip. To verify that each row corresponds to a single voyage, we need to check if all the values of the Voyage ID column are different, i.e. if they are unique.

Question(s)

What to do?

(Python) Tools
pandas
column selector method.iloc[]
;pandas
attribute.is_unique
;- Python function
print()

Coding
voyage_id = df.iloc[:,1]
, we apply the method.iloc[]
to select the second column of our DataFrame and store it in the variablevoyage_id
. Inside the square brackets of.iloc[]
we can specify rows an columns to select in the form[selected_rows,selected_columns]
. In this case,:
means that we select ALL the rows, sodf.iloc[:,1]
selects all the rows of the column with index 1 (second column)print(voyage_id.is_unique)
, we print on the screen thepandas
attribute.is_unique
. This attribute is True if, indeed, all the values ofvoyage_id
are unique, False otherwise.

Expert Coding
print(df.iloc[:,1].is_unique)
, methods, attributes, and functions can be applied one after another in a single line of code.
= df.iloc[:,1]
voyage_id print(voyage_id.is_unique)
True
We verified that all the values of the Voyage ID column are unique, this means that each of the rows of our DataFrame refers to a single ship voyage. Looking at the other columns, we also notice that information where the voyage began, the port where slaves have been purchased, and the port where slaves have been desembarked is provided.
Looking in particular at the fifth column (index 4, “Voyage itinerary imputed principal place of slave purchase”), we notice it contains several NaNs. NaN stands for “Not a Number”, it is a value that appears when something goes wrong in one of the processes ran by our program. If something went wrong, why did not our program stop or told us something about an occuring problem? Because problems may happen more often than you think and if our program stops working everytime it encounters a situation it cannot handle, it would most probably never finish running! In this case, most probably the record does not exist so the data set cell has been filled by NaN, either in our original .csv file or by the pandas
method .read_csv()
. NaN are not necesseraly something bad, as they can be easily identified and eventually corrected (or simply ignored). Incorrect or missing data may be much harder to spot and correct.
In any case, the presence of NaNs or any other missing value can severely affect our data analysis, for this reason before starting analysing the data we need to find and get rid of those values. This process is usually called “data cleaning” and that is exactly what we are going to do in the next chapter.