3 Reading data

The first thing we need to do is loading the data. This means opening the file where the data is currently stored and transfer that data here, in our working environment. As we are working with Python in this Jupyter notebook environment, this means transfering all the data into a Python object. Which object? There are Python libraries (Python code written by other developers) that have been specifically designed to perform the task of data analysis. One of these libraries, or (“Pythonically” speaking) packages, is called pandas. We will use one of the many pandas functions to read our .csv (coma separated values file) file and we will store the information into a pandas DataFrame.

Task(s)

Loading data

What to do?

Localise the csv file;
Have a look at it;
Transfer data to the Python working environment.

(Python) Tools

pandas package;
pandas method .read_csv();
Python function print();
Python function type()

Coding

import pandas as pd, we first import the package pandas in our working environment in order to use all its functionalities. To make our life easier, we assign to the package an alias, a nickname, so that we do not neet to write pandas.function_to_use() every time we need to use a pandas function. We need just to use the abbreviated form pd.function_to_use();
data_file = 'data/data.csv', we store the relative path of our data file as a string (between single or double quotes) to the Python variable data_file;
df = pd.read_csv(data_file), we use the method .read_csv() to read our data file and we store the result in the Python variable df (data frame);
print(type(df)), we first apply the Python function type() to the just initialised variable df to check what is its type. We print the result on the screen using the Python function print().

import pandas as pd
data_file = 'data/data.csv'
df = pd.read_csv(data_file)
print(type(df))

<class 'pandas.core.frame.DataFrame'>

We managed to transfer our data into a Python object, specifically a pandas.core.frame.DataFrame, or simply (from now on) a DataFrame. However, a lot of things can go wrong when going from one format to another, so it is a good idea to have a first look at the data.

Task(s)

Have a first look at the data

What to do?

Visualize the first 10 lines of data, just to check that everything looks "ok";

(Python) Tools

pandas method .head();

Coding

pd.head(10), calling the head(10) method on the DataFrame df we will visualise the first 10 lines of the DataFrame (we wrote 10, but you can use whatever number you want). This method, as a matter of fact, shows you only the "head", the beginning, of the DataFrame.

df.head(10)

	Year of arrival at port of disembarkation	Voyage ID	Vessel name	Voyage itinerary imputed port where began (ptdepimp) place	Voyage itinerary imputed principal place of slave purchase (mjbyptimp)	Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place	VOYAGEID2	Captives arrived at 1st port	Captain's name
0	1714.0	16109	Freeke Gally	Bristol	NaN	Kingston	NaN	283.0	Neale, Alexander
1	1713.0	16110	Greyhound Gally	Bristol	NaN	Jamaica, place unspecified	NaN	NaN	Selkirk, Alexander<br/> Forrest, Henry
2	1714.0	16111	Jacob	Bristol	NaN	Kingston	NaN	130.0	Nicholls, Philip
3	1714.0	16112	Jason Gally	Bristol	NaN	Port Royal	NaN	278.0	Plummer, John
4	1713.0	16113	Lawford Gally	Bristol	Africa, port unspecified	Newcastle (Nevis)	NaN	NaN	Stretton, Joseph
5	1714.0	16114	Mercy Gally	Bristol	Africa, port unspecified	Barbados, place unspecified	NaN	190.0	Scott, John
6	1714.0	16115	Mermaid Gally	Bristol	Cape Verde Islands	Kingston	NaN	72.0	Banbury, John<br/> Copinger, James
7	1713.0	16116	Morning Star	Bristol	Africa, port unspecified	Charleston	NaN	NaN	Poole, Nicholas
8	1714.0	16117	Peterborough	Bristol	Africa, port unspecified	Barbados, place unspecified	NaN	200.0	Shawe, John<br/> Martin, Joseph
9	1713.0	16118	Resolution	Bristol	Gold Coast, port unspecified	Barbados, place unspecified	NaN	255.0	Williams, Charles

Comparing what we see here with our .csv file it seems that everything went well. We have the data organised in rows and columns. Each column has a name and each row and index. Looking at our data, some values are numbers, some are names and places, some contain htmlo tags, some are NaN. It is not time yet to run data analysis, after having loaded the data we still need to correctly interpret the information it contains, then we need to “clean” it, and after that, finally, we can proceed with some data analysis. This is just the beginning, but the best is yet to come!