DataFrames

Basics

A DataFrame is star of the pandas package — many of our pandas guides are simply building blocks for understanding DataFrames.

THe standard practice for DataFrames is reading a file and saving it, taking a glimpse at its contents, and using a wide variety of methods to manipulate the data to achieve whatever goal you have.


Viewing Data

Since we have another page devoted to reading .csv data, we’ll start by creating a simple DataFrame for analysis:

import pandas as pd

fruits = pd.DataFrame([[1, 'pear'], [2, 'apple'], [3, 'orange']], columns=['count', 'fruit'])


The head function lets us look at the first n rows of the DataFrame, with the default being 5. head is one of the best ways to get a feeling for the data you’ll be analyzing. In our case, fruits has less than 5 lines, so the whole DataFrame will be displayed.

Unlike R’s data.frames, you don’t include a Python DataFrame as a parameter to a function, instead using "." to call the function after inputting the DataFrame.

print(fruits.head())
   count   fruit
0      1    pear
1      2   apple
2      3  orange


shape

In most cases, your DataFrame will be more than 5 lines long (otherwise it wouldn’t be very useful), and sometimes there are so many columns that standard output will not include all of them. We can use the shape attribute to list how many rows and columns we have:

print(fruits.shape)
(3, 2)

Functions will have parentheses after the codeword, while attributes will not. There isn’t a great pneumonic for knowing which are attributes and which are functions, but there is always the official documentation (or simply trial and error).


columns

As mentioned, you might find yourself in a situation where you can’t display all columns simultaneously. To get an idea of which columns you want, use the columns attribute to display a list of all columns:

print(fruits.columns)
Index(['count', 'fruit'], dtype='object')


Manipulating Data

Generally speaking, data manipulation happens with functions and not attributes, given the fact that most manipulations require parameters to specify what is being changed or how to change the data.


rename()

What if we read some data and the column names aren’t in our desired format, or just plain unhelpful? Fortunately, we can use rename() to replace the names of certain columns, as defined by the columns dictionary parameter.

fruits = fruits.rename(columns={'fruit': 'groceries'})

We can now introduce the important inplace=True argument to make the change directly to the DataFrame instead of creating a new variable:

fruits.rename(columns={'fruit': 'groceries'}, inplace=True)
print(fruits.columns)
Index(['count', 'groceries'], dtype='object')

Keep in mind that inplace = True overwrites your DataFrame — if your function is more complicated, it might not be possible to reclaim an earlier version of the DataFrame without reloading it completely. Here’s an article that briefly explains inplace = True and when you would want to use it.


DataFrame() Constructor

It’s a common occurrence in DataFrame tutorials to use a dictionary to create a DataFrame. When this is done, the keys become columns and the values become entries. The top example on the pandas DataFrame documentation is the following:

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
print(df)
   col1  col2
0     1     3
1     2     4

Very easy! One benefit is that the constructor is adaptable, able to take many multi-dimensional array objects and convert it to a DataFrame. Our chosen example is a list of dictionaries:

list_of_dicts = []
list_of_dicts.append({'columnA': 1, 'columnB': 2})
list_of_dicts.append({'columnB': 4, 'columnA': 1})
list_of_dicts.append({'columnA': 3, 'columnB': 1, 'columnC': 'hello'})

construction = pd.DataFrame(list_of_dicts)
print(construction.head())
   columnA  columnB columnC
0        1        2     NaN
1        1        4     NaN
2        3        1   hello

As you can see, the constructor adapted our data with little difficulty — instead of throwing errors when the new columnC was introduced, it filled the other rows with NaN, and it was able to handle the dictionary entries being in different orders.