The first thing we need to do is loading the data. This means opening the file where the data is currently stored and transfer that data here, in our working environment. As we are working with Python in this Jupyter notebook environment, this means transfering all the data into a Python object. Which object? There are Python libraries (Python code written by other developers) that have been specifically designed to perform the task of data analysis. One of these libraries, or (“Pythonically” speaking) packages, is called pandas. We will use one of the many pandas functions to read our .csv (coma separated values file) file and we will store the information into a pandas DataFrame.
Task(s)
Loading data
What to do?
Localise the csv file;
Have a look at it;
Transfer data to the Python working environment.
(Python) Tools
pandas package;
pandas method .read_csv();
Python function print();
Python function type()
Coding
import pandas as pd, we first import the package pandas in our working environment in order to use all its functionalities. To make our life easier, we assign to the package an alias, a nickname, so that we do not neet to write pandas.function_to_use() every time we need to use a pandas function. We need just to use the abbreviated form pd.function_to_use();
data_file = 'data/data.csv', we store the relative path of our data file as a string (between single or double quotes) to the Python variable data_file;
df = pd.read_csv(data_file), we use the method .read_csv() to read our data file and we store the result in the Python variable df (data frame);
print(type(df)), we first apply the Python function type() to the just initialised variable df to check what is its type. We print the result on the screen using the Python function print().
import pandas as pddata_file ='data/data.csv'df = pd.read_csv(data_file)print(type(df))
<class 'pandas.core.frame.DataFrame'>
We managed to transfer our data into a Python object, specifically a pandas.core.frame.DataFrame, or simply (from now on) a DataFrame. However, a lot of things can go wrong when going from one format to another, so it is a good idea to have a first look at the data.
Task(s)
Have a first look at the data
What to do?
Visualize the first 10 lines of data, just to check that everything looks "ok";
(Python) Tools
pandas method .head();
Coding
pd.head(10), calling the head(10) method on the DataFrame df we will visualise the first 10 lines of the DataFrame (we wrote 10, but you can use whatever number you want). This method, as a matter of fact, shows you only the "head", the beginning, of the DataFrame.
df.head(10)
Year of arrival at port of disembarkation
Voyage ID
Vessel name
Voyage itinerary imputed port where began (ptdepimp) place
Voyage itinerary imputed principal place of slave purchase (mjbyptimp)
Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place
VOYAGEID2
Captives arrived at 1st port
Captain's name
0
1714.0
16109
Freeke Gally
Bristol
NaN
Kingston
NaN
283.0
Neale, Alexander
1
1713.0
16110
Greyhound Gally
Bristol
NaN
Jamaica, place unspecified
NaN
NaN
Selkirk, Alexander<br/> Forrest, Henry
2
1714.0
16111
Jacob
Bristol
NaN
Kingston
NaN
130.0
Nicholls, Philip
3
1714.0
16112
Jason Gally
Bristol
NaN
Port Royal
NaN
278.0
Plummer, John
4
1713.0
16113
Lawford Gally
Bristol
Africa, port unspecified
Newcastle (Nevis)
NaN
NaN
Stretton, Joseph
5
1714.0
16114
Mercy Gally
Bristol
Africa, port unspecified
Barbados, place unspecified
NaN
190.0
Scott, John
6
1714.0
16115
Mermaid Gally
Bristol
Cape Verde Islands
Kingston
NaN
72.0
Banbury, John<br/> Copinger, James
7
1713.0
16116
Morning Star
Bristol
Africa, port unspecified
Charleston
NaN
NaN
Poole, Nicholas
8
1714.0
16117
Peterborough
Bristol
Africa, port unspecified
Barbados, place unspecified
NaN
200.0
Shawe, John<br/> Martin, Joseph
9
1713.0
16118
Resolution
Bristol
Gold Coast, port unspecified
Barbados, place unspecified
NaN
255.0
Williams, Charles
Comparing what we see here with our .csv file it seems that everything went well. We have the data organised in rows and columns. Each column has a name and each row and index. Looking at our data, some values are numbers, some are names and places, some contain htmlo tags, some are NaN. It is not time yet to run data analysis, after having loaded the data we still need to correctly interpret the information it contains, then we need to “clean” it, and after that, finally, we can proceed with some data analysis. This is just the beginning, but the best is yet to come!