2  Reading data

The first thing we need to do is loading the data. This means opening the file where the data is currently stored and transfer that data here, in our working environment. As we are working with Python in this Jupyter notebook environment, this means transfering all the data into a Python object. Which object? There are Python libraries (Python code written by other developers) that have been specifically designed to perform the task of data analysis. One of these libraries, or (“Pythonically” speaking) packages, is called pandas. We will use one of the many pandas functions to read our .csv (coma separated values file) file and we will store the information into a pandas DataFrame.

Icon

Task(s)

Loading data
Icon

What to do?

  • Localise the csv file;
  • Have a look at it;
  • Transfer data to the Python working environment.
Icon

(Python) Tools

  • pandas package;
  • pandas method .read_csv();
  • Python function print();
  • Python function type()
Icon

Coding

  • import pandas as pd, we first import the package pandas in our working environment in order to use all its functionalities. To make our life easier, we assign to the package an alias, a nickname, so that we do not neet to write pandas.function_to_use() every time we need to use a pandas function. We need just to use the abbreviated form pd.function_to_use();
  • data_file = 'data/data.csv', we store the relative path of our data file as a string (between single or double quotes) to the Python variable data_file;
  • df = pd.read_csv(data_file), we use the method .read_csv() to read our data file and we store the result in the Python variable df (data frame);
  • print(type(df)), we first apply the Python function type() to the just initialised variable df to check what is its type. We print the result on the screen using the Python function print().
import pandas as pd
data_file = 'data/data.csv'
df = pd.read_csv(data_file)
print(type(df))
<class 'pandas.core.frame.DataFrame'>

We managed to transfer our data into a Python object, specifically a pandas.core.frame.DataFrame, or simply (from now on) a DataFrame. However, a lot of things can go wrong when going from one format to another, so it is a good idea to have a first look at the data.

Icon

Task(s)

Have a first look at the data
Icon

What to do?

Visualize the first 10 lines of data, just to check that everything looks "ok";
Icon

(Python) Tools

pandas method .head();
Icon

Coding

pd.head(10), calling the head(10) method on the DataFrame df we will visualise the first 10 lines of the DataFrame (we wrote 10, but you can use whatever number you want). This method, as a matter of fact, shows you only the "head", the beginning, of the DataFrame.
df.head(10)
Year of arrival at port of disembarkation Voyage ID Vessel name Voyage itinerary imputed port where began (ptdepimp) place Voyage itinerary imputed principal place of slave purchase (mjbyptimp) Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place VOYAGEID2 Captives arrived at 1st port Captain's name
0 1714.0 16109 Freeke Gally Bristol NaN Kingston NaN 283.0 Neale, Alexander
1 1713.0 16110 Greyhound Gally Bristol NaN Jamaica, place unspecified NaN NaN Selkirk, Alexander<br/> Forrest, Henry
2 1714.0 16111 Jacob Bristol NaN Kingston NaN 130.0 Nicholls, Philip
3 1714.0 16112 Jason Gally Bristol NaN Port Royal NaN 278.0 Plummer, John
4 1713.0 16113 Lawford Gally Bristol Africa, port unspecified Newcastle (Nevis) NaN NaN Stretton, Joseph
5 1714.0 16114 Mercy Gally Bristol Africa, port unspecified Barbados, place unspecified NaN 190.0 Scott, John
6 1714.0 16115 Mermaid Gally Bristol Cape Verde Islands Kingston NaN 72.0 Banbury, John<br/> Copinger, James
7 1713.0 16116 Morning Star Bristol Africa, port unspecified Charleston NaN NaN Poole, Nicholas
8 1714.0 16117 Peterborough Bristol Africa, port unspecified Barbados, place unspecified NaN 200.0 Shawe, John<br/> Martin, Joseph
9 1713.0 16118 Resolution Bristol Gold Coast, port unspecified Barbados, place unspecified NaN 255.0 Williams, Charles

Comparing what we see here with our .csv file it seems that everything went well. We have the data organised in rows and columns. Each column has a name and each row and index. Looking at our data, some values are numbers, some are names and places, some contain htmlo tags, some are NaN. It is not time yet to run data analysis, after having loaded the data we still need to correctly interpret the information it contains, then we need to “clean” it, and after that, finally, we can proceed with some data analysis. This is just the beginning, but the best is yet to come!