Lecture 08: Basic data analysis
First Let's start with a survey on your background in data wrangling.
You may need to install the DST api-data reader, the pandas_datareader and the matplotlib_venn module. Uncomment the following cells and run to install.
The ! in front of each command indicates that this is a system command that may as well have been executed in the terminal/command prompt of your computer.
# The DST API wrapper
!pip install git+https://github.com/elben10/pydst
# A wrapper for multiple APIs with a pandas interface
!pip install pandas-datareader
# For Venn diagrams
!pip install matplotlib-venn
import numpy as np
import pandas as pd
import datetime
import pandas_datareader # install with `pip install pandas-datareader`
import pydst # install with `pip install git+https://github.com/elben10/pydst`
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
from matplotlib_venn import venn2 # `pip install matplotlib-venn`
When combining datasets there are a few crucial concepts:
- Concatenate (append): "stack" rows (observations) on top of each other. This works if the datasets have the same columns (variables).
- Merge: the two datasets have different variables, but may or may not have the same observations.
There are different kinds of merges depending on which observations you want to keep:
- Outer join (one-to-one) Keep observations which are in either or in both datasets.
- Inner join (one-to-one) Keep observations which are in both datasets.
- Left join (many-to-one) Keep observations which are in the left dataset or in both datasets.
Keeping observations which are not in both datasets will result in missing values for the variables comming from the dataset, where the observation does not exist.
Read data:
empl = pd.read_csv('../07/data/RAS200_long.csv') # .. -> means one folder up
inc = pd.read_csv('../07/data/INDKP107_long.csv')
area = pd.read_csv('../07/data/area.csv')
1.1 Concatenating datasets
Suppose we have two datasets that have the same variables and we just want to concatenate them.
empl.head(5)
N = empl.shape[0]
A = empl.loc[empl.index < N/2,:] # first half of observations
B = empl.loc[empl.index >= N/2,:] # second half of observations
print(f'A has shape {A.shape} ')
print(f'B has shape {B.shape} ')
Concatenation is done using the command pd.concat([df1, df2])
.
C = pd.concat([A,B])
print(f'C has shape {C.shape} (same as the original empl, {empl.shape})')
1.2 Merging datasets
Two datasets with different variables: empl
and inc
.
Central command: pd.merge(empl, inc, on=[municipalitiy, year], how=METHOD)
.
The keyword
on
specifies the merge key(s). They uniquely identify observations in both datasets (for sure in at least one of them).The keyword
how
specifies the merge method (taking values such as'outer'
,'inner'
, or'left'
).
Look at datasets:
print(f'Years in empl: {empl.year.unique()}')
print(f'Municipalities in empl = {len(empl.municipality.unique())}')
print(f'Years in inc: {inc.year.unique()}')
print(f'Municipalities in inc = {len(inc.municipality.unique())}')
Find differences:
diff_y = [y for y in inc.year.unique() if y not in empl.year.unique()]
print(f'years in inc data, but not in empl data: {diff_y}')
diff_m = [m for m in empl.municipality.unique() if m not in inc.municipality.unique()]
print(f'municipalities in empl data, but not in inc data: {diff_m}')
Conclusion: inc
has more years than empl
, but empl
has one municipality that is not in inc
.
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('Cristiansø')
v.get_label_by_id('010').set_text('2004-07' )
v.get_label_by_id('110').set_text('common observations')
plt.show()
Outer join: union
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('included')
v.get_label_by_id('010').set_text('included')
v.get_label_by_id('110').set_text('included')
plt.title('outer join')
plt.show()
outer = pd.merge(empl,inc,on=['municipality','year'],how='outer')
print(f'Number of municipalities = {len(outer.municipality.unique())}')
print(f'Number of years = {len(outer.year.unique())}')
We see that the outer join includes rows that exist in either dataframe and therefore includes missing values:
I = (outer.year.isin(diff_y)) | (outer.municipality.isin(diff_m))
outer.loc[I, :].head(15)
Inner join
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('empl', 'inc'))
v.get_label_by_id('100').set_text('dropped'); v.get_patch_by_id('100').set_alpha(0.05)
v.get_label_by_id('010').set_text('dropped'); v.get_patch_by_id('010').set_alpha(0.05)
v.get_label_by_id('110').set_text('included')
plt.title('inner join')
plt.show()
inner = pd.merge(empl,inc,how='inner',on=['municipality','year'])
print(f'Number of municipalities = {len(inner.municipality.unique())}')
print(f'Number of years = {len(inner.year.unique())}')
We see that the inner join does not contain any rows that are not in both datasets.
I = (inner.year.isin(diff_y)) | (inner.municipality.isin(diff_m))
inner.loc[I, :].head(15)
Left join
In my work, I most frequently use the left join. It is also known as a many-to-one join.
- Left dataset:
inner
many observations of a given municipality (one per year), - Right dataset:
area
at most one observation per municipality and new variable (km2).
inner_with_area = pd.merge(inner, area, on='municipality', how='left')
inner_with_area.head(10)
print(f'inner has shape {inner.shape}')
print(f'area has shape {area.shape}')
print(f'merge result has shape {inner_with_area.shape}')
plt.figure()
v = venn2(subsets = (4, 4, 10), set_labels = ('inner', 'area'))
v.get_label_by_id('100').set_text('included:\n no km2');
v.get_label_by_id('010').set_text('dropped'); v.get_patch_by_id('010').set_alpha(0.05)
v.get_label_by_id('110').set_text('included:\n with km2')
plt.title('left join')
plt.show()
Intermezzo: Finding the non-overlapping observations
not_in_area = [m for m in inner.municipality.unique() if m not in area.municipality.unique()]
not_in_inner = [m for m in area.municipality.unique() if m not in inner.municipality.unique()]
print(f'There are {len(not_in_area)} municipalities in inner that are not in area. They are:')
print(not_in_area)
print('')
print(f'There is {len(not_in_inner)} municipalities in area that are not in inner. They are:')
print(not_in_inner)
print('')
Check that km2 is never missing:
inner_with_area.km2.isnull().sum()
Alternative function for left joins: df.join()
which uses the index
To use a left join function df.join()
, we must first set the index. Technically, we do not need this, but if you ever need to join on more than one variable, df.join()
requires you to work with indices so we might as well learn it now.
area.sample(10)
inner.set_index(['municipality', 'year'], inplace=True)
area.set_index('municipality', inplace=True)
inner.head()
final = inner.join(area)
print(f'final has shape: {final.shape}')
final.head(5)
1.3 Other programming languages
SQL (including SAS proc sql)
SQL is one of the most powerful database languages and many other programming languages embed a version of it. For example, SAS has the proc SQL
, where you can use SQL syntax.
SQL is written in statements such as
- left join
select * from empl left join inc on empl.municipality = inc.municipality and empl.year = inc.year
- outer join
select * from empl full outer join inc on empl.municipality = inc.municipality and empl.year = inc.year
STATA
In Stata, the command merge
nests many of the commands mentioned above. You specify merge 1:1
for a one-to-one merge or merge m:1
or merge 1:m
for many-to-one or one-to-many merges, and you do not use merge m:m
(until you are quite advanced).
API stands for Application Programming Interface. An API is an interface through which we can directly ask for and receive data from an online source. We will be using packages for this and will not look at what is going on underneath.
- We use
pandas_datareader
to access many common international online data sources (install withpip install pandas-datareader
) - For Statistics Denmark, Jakob Elben has written the
pydst
package (install withpip install git+https://github.com/elben10/pydst
)
Fetching data from an API requires an internet connection and works directly without saving data to your harddisk (unless you ask Python to do so afterwards). You can use it to automate tasks such as fetching the most recent data, doing some calculations and outputting it in the same manner. This can be useful e.g. for quarterly reports. Remember to save the data on your computer if you really need for later though. The admins of the data may turn off the water..
Pros: Automatic; smart; everything is done from Python (so no need to remember steps in between).
Cons: The connection can be slow or drop out, which may lead to errors. If e.g. 100 students simultaneously fetch data (during, say, a lecture), the host server may not be able to service all the requests and may drop out.
The raw output data from an API could look like this: https://stats.oecd.org/SDMX-JSON/data/NAAG. It is a log list of non-human-readable gobledygook in the so-called "JSON" format.
2.1 Import data from Denmark Statistics
Setup:
Create an dst api object that will allow us to interact with the DST server.
Dst = pydst.Dst(lang='en') # setup data loader with the langauge 'english'
# What is the Dst variable?
print(type(Dst))
Data from DST are organized into:
- Subjects: indexed by numbers. Use
Dst.get_subjects()
to see the list. - Tables: with names like "INDKP107". Use
Dst.get_tables(subjects=['X'])
to see all tables in a subject.
Data is extracted with Dst.get_data(table_id = 'NAME', variables = DICT)
.
Subjects: With Dst.get_subjects()
we can list all subjects.
Dst.get_subjects()
Tables: With get_tables()
, we can list all tables under a subject.
tables = Dst.get_tables(subjects=['2'])
print(type(tables))
display(tables)
Variable in a dataset:
tables[tables.id == 'INDKP107']
indk_vars = Dst.get_variables(table_id='INDKP107')
indk_vars
We want to know the available levels of each conditioning variable that we may subset by. Use a loop to print out those levels.
Values of variable in a dataset:
indk_vars = Dst.get_variables(table_id='INDKP107')
for id in ['ENHED','KOEN','UDDNIV','INDKOMSTTYPE']:
print(id)
values = indk_vars.loc[indk_vars.id == id,['values']].values[0,0]
for value in values:
print(f' id = {value["id"]}, text = {value["text"]}')
There are quite a few to select from. Need to use a dictionary to specify the desired subset of data. Note: a * indicates that you want all levels. For example, we are subsetting all periods below.
Get data:
variables = {'OMRÃ…DE':['*'],'ENHED':['110', '116'],'KOEN':['M','K'],'TID':['*'],'UDDNIV':['65'],'INDKOMSTTYPE':['100']}
inc_api = Dst.get_data(table_id = 'INDKP107', variables=variables)
inc_api.sort_values(by=['OMRÃ…DE', 'TID', 'KOEN'], inplace=True)
inc_api.head(5)
.. now you have a data set ready for cleaning and renaming.
2.2 FRED (Federal Reserve Economic Data)
GDP data for the US
# Need first to encode dates in a python friendly to specify the length of the desired time period.
# Use the datetime module - it is the general way to handle dates in python.
start = datetime.datetime(2005,1,1)
end = datetime.datetime(2017,1,1)
timespan = end - start # We can investigate the precise time span by just subtracting to time variables.
print('total number of days:', timespan.days) # The timespan object has a days attribute.
# Call the FRED api using pandas_datareader
gdp = pandas_datareader.data.DataReader('GDP', 'fred', start, end)
gdp.head(10)
Finding data:
- go to https://fred.stlouisfed.org
- search for data in main bar, e.g. employment and unemployment
- click the first links
- table name is next to header
We now want to pull down data on aggregate employment (PAYEMS) and unemployment (UNEMPLOY) levels and then plot the development.
Pulling the data:
start = datetime.datetime(1939,1,1)
end = datetime.datetime(2021,12,1)
# We can pull from multiple sources in one go. Just combine them in a list.
empl_us = pandas_datareader.data.DataReader(['PAYEMS', 'UNEMPLOY'], 'fred', start, end)
Plot:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
# Now we are just plotting directly from the pandas dataframe. Still using matplotlib under the hood.
empl_us.plot(ax=ax)
ax.legend(frameon=True)
ax.set_xlabel('')
ax.set_ylabel('employment (US)');
2.3 World Bank indicators: wb
Finding data:
- go to https://data.worldbank.org/indicator/
- search for GDP
- variable name ("NY.GDP.PCAP.KD") is in the URL
Pull GDP numbers:
# Need a different module than in the FRED case
from pandas_datareader import wb
wb_gdp = wb.download(indicator='NY.GDP.PCAP.KD', country=['SE','DK','NO'], start=1990, end=2017)
wb_gdp = wb_gdp.rename(columns = {'NY.GDP.PCAP.KD':'GDP'})
wb_gdp = wb_gdp.reset_index()
wb_gdp.sample(5)
wb_gdp.info()
Problems:
- It turns out that the dataframe has stored the variable year as an "object", meaning in practice that it is a string. This must be converted to an int, as we want to use it as a number.
- country is in fact a text variable, so it is acceptable to have it as an object type. But pandas has implemented a string type on its own. It is called 'string', while the text type of object that you normally encounter is of type 'str'. Yes, confusing!! But you want to get it right, because an object variable can also contain numbers in addition to text. Which is bad.
- Fortunately, GDP is a float (i.e. a number).
wb_gdp.year = wb_gdp.year.astype(int) # convert year
wb_gdp.country = wb_gdp.country.astype('string') # convert country to the special pandas string type
wb_gdp.info()
Fetch employment-to-population ratio:
wb_empl = wb.download(indicator='SL.EMP.TOTL.SP.ZS', country=['SE','DK','NO'], start=1990, end=2017) # don't need the special datetime here.
wb_empl.rename(columns = {'SL.EMP.TOTL.SP.ZS':'employment_to_pop'}, inplace=True) # Better col name
wb_empl.reset_index(inplace = True)
wb_empl.year = wb_empl.year.astype(int)
wb_empl.sample(3)
Merge:
wb = pd.merge(wb_gdp, wb_empl, how = 'outer', on = ['country','year']);
wb.head(5)
One of the most useful skills to learn is the split-apply-combine process. For example, we may want to compute the average employment rate within a municipality over time and calculate whether the employment rate in each year is above or below the average. We calculate this variable using a split-apply-combine procedure:
- split: divide the dataset into units (one for each municipality)
- apply: compute the average employment rate for each unit
- combine: merge this new variable back onto the original dataset
3.1 Groupby
Example data:
empl.head()
empl.rename(columns={'empl':'e'}, inplace=True)
# A simple, unconditional transformation of data. Works because of broadcasting.
empl['empl_demean'] = empl.e - empl.e.mean()
empl.head()
empl = empl.sort_values(['municipality','year']) # sort by first municipality then year
empl.head(5)
Use groupby to calculate within means:
empl.groupby(['municipality'])['e'].mean().head(5)
Custom functions the apply part can be specified by using the lambda
notation.
Warning: lambda
implementations will often be a pretty slow alternative to vectorized operations. More on that later.
An example with average change:
# Define a lambda function to applied down rows of a column in blocks defined by the groupby.
avg_first_diff = lambda x: x.diff(1).mean() # A pd.Series has a function diff that does the job.
# Apply the lambda and print head of output
empl.groupby('municipality')['e'].apply(avg_first_diff).head(5)
Or:
# We can also define our lambda with a numpy implementation.
avg_first_diff = lambda x: np.mean(x[1:]-x[:-1])
# Need the extra lambda function to retrieve values (aka a numpy array) of e for the avg_first_diff.
empl.groupby('municipality')['e'].apply(lambda x: avg_first_diff(x.values)).head(5)
Plot statistics: Dispersion in employment rate across Danish municipalities over time.
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
empl.groupby('year')['e'].std().plot(ax=ax,style='-o')
ax.set_ylabel('std. dev.')
ax.set_title('std. dev. across municipalities in the employment rate');
3.2 Split-Apply-Combine
Goal: Calculate within municipality difference to mean employment rate.
Start by splitting, applying and combining manually:
1. Split:
e_grouped = empl.groupby('municipality')['e']
# The e_grouped object is not ready for inspection
print(e_grouped)
2. Apply:
e_mean = e_grouped.mean() # mean employment rate
e_mean.head(10)
Change name of series:
e_mean.name = 'e_mean' # necessary for join
3. Combine:
empl_ = empl.set_index('municipality').join(e_mean, how='left')
empl_['e_demean'] = empl_.e - empl_.e_mean
empl_.xs('Copenhagen')
Plot:
municipalities = ['Copenhagen','Roskilde','Lejre']
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
# Here we use the fact that the index has multiple levels (years) for an elegant loop
for m in municipalities:
empl_.xs(m).plot(x='year',y='e_demean',ax=ax,label=m)
ax.legend(frameon=True)
ax.set_ylabel('difference to mean')
Do the splitting and applying in one fell swoop
with agg()
Agg: The same value for all observations in a group.
We can use lambdas or built-in functions for the operation. Use built-in whenever you can! Here we use lambda for exposition.
empl_ = empl.copy()
# a. Good use: a built-in function for mean rather than lambda.
e_mean = empl_.groupby('municipality')['e'].agg('mean')
# Same result with a lambda
#e_mean = empl_.groupby('municipality')['e'].agg(lambda x: x.mean())
e_mean.name = 'e_mean'
# b. combine
empl_ = empl_.set_index('municipality').join(e_mean, how='left')
empl_['diff'] = empl_.e - empl_.e_mean
empl_.xs('Copenhagen')
Note: Same result!!
Question Are there any dangers with the variable name 'diff'?
This is pretty cumbersome though. Creating a new variable and then merging in separate step - we can do better with the tools in Pandas.
Splitting, applying and combining all together
with - apply()
and transform()
directly
Transform: In case you are dealing with multiple variables, transform
will work on one variable/column at a time. In the case below, had we selected both e and year rather than just e, x.mean() would and could only have been applied to observations within one column at a time. transform
has to return an array of size 1 or of the same size as the original column.
Apply: The set of columns passed to the apply
function is considered to be a whole dataframe on its own. You can therefore make lambda functions that utilizes several columns of data in each operation. That is not possible with the transform
function.
More info: you can read more about the differences between transform and apply
here.
Note when you are dealing with selections of only 1 variable, then transform
and apply
behave similarly.
empl_ = empl.copy()
empl_['e_demean'] = empl_.groupby('municipality')['e'].apply(lambda x: x - x.mean())
# In this case, you could have used apply as well
#empl_[['e_demean']] = empl_.groupby('municipality')['e'].transform(lambda x: x - x.mean())
empl_.set_index('municipality').xs('Copenhagen')
3.3 Optimizing performance
It is quite important for your own and other's productivity to implement effecient procedures when dealing with large datasets. The apply method (as well as the transform) essentially loops over the rows of a column when applying a lambda function. This may be much slower than needed if you for example end up calculating averages over the whole column or group many, many times (one per row) as in the case below. Using pandas functions without lambdas gets it right. Important to avoid such behavior with large data sets.
import time
N = 300
# a. Check performance with lambda function. Sooo slooow..
demean = lambda x: x - x.mean()
tic = time.perf_counter()
for i in range(N):
d1 = empl.groupby('municipality')['e'].transform(demean)
toc = time.perf_counter()
print(f'Performance with lambda function {toc-tic: 5.3f}')
# b. Performance when relying on built-in pandas methods. It is not because we're using transform per se.
# It's much faster, because mean is not calculated for each row in data and we're in Cython.
tic = time.perf_counter()
for i in range(N):
d2 = empl.e - empl.groupby('municipality')['e'].transform('mean') # Demean by subtracting grouped mean from e column.
toc = time.perf_counter()
print(f'Performance with pandas vectorized: {toc-tic: 5.3f}')
print('Check of consistency: ', np.all(d1==d2))
We can also see that an explicit numpy implementation is faster than relying on pandas methods. The example with first differencing from above.
# a. The pandas series implementation
avg_first_diff = lambda x: x.diff(1).mean()
tic = time.perf_counter()
for i in range(N):
d1 = empl.groupby('municipality')['e'].apply(avg_first_diff)
toc = time.perf_counter()
print(f'Performance with pandas: {toc-tic: 3.6f}')
# b. The Numpy implementation
avg_first_diff = lambda x: np.mean(x.values[1:]-x.values[:-1])
tic = time.perf_counter()
for i in range(N):
d2 = empl.groupby('municipality')['e'].apply(avg_first_diff)
toc = time.perf_counter()
print(f'Performance with numpy: {toc-tic: 3.6f}')
print('Is d1 == d2:', np.all(d1==d2))
Note: Same result!!
Need more complex group by stuff?
Look here.
Additional links
- Do you have missing values in data? Check here
- About strings and the
object
type in pandas, here. - Comparison of SQL statements and pandas group by here
- Optimizing pandas routines incl.
apply
, here. (less technical) - Stackoverflow musings on optimal use of apply( ) and it's downsides. See also this. (both pretty technical)
- About optimizing pandas with numpy and vectorization, here.
4. A few examples of open access APIs
As already demonstrated, you can pull data from DST using their API. Just to give a few examples of where else you may find open access to data by API:
- Check out the documentation for pandas_datareader. There is a bunch of economic data banks to access through that.
- There is an API for covid-19 data that draws on several sources.
- The National Museum of Art (DK) gives access to their collection by an API.
- NASA has its own API. Look here for their documentation and here for a Python wrapper.
4. Other sources
- A crazy large collection of APIs found on this Github repo. Stocks, government, map data, lol memes, anything..
- Datasets behind FiveThirtyEight articles (youknow.. this site)
5. All built-in functions to aggregate data in Pandas
You can use the functions in apply()
, transform()
and agg()
by writing them out in a string. See above. Will normally be the fastest implementation.
Function Description
- count: Number of non-null observations
- sum: Sum of values
- mean: Mean of values
- mad: Mean absolute deviation
- min: Minimum
- max: Maximum
- mode: Mode
- abs: Absolute Value
- prod: Product of values
- std: Unbiased standard deviation
- var: Unbiased variance
- sem: Unbiased standard error of the mean
- skew: Unbiased skewness (3rd moment)
- kurt: Unbiased kurtosis (4th moment)
- quantile: Sample quantile (value at %)
- cumsum: Cumulative sum
- cumprod: Cumulative product
- cummax: Cumulative maximum
- cummin: Cumulative minimum
There is also your DataCamp cheatsheet for pandas for references.
This lecture: We have discussed
- Combining datasets (merging and concatenating)
- Fatching data using an API (DST, FRED, World Bank, etc.)
- Split-apply-combine (groupby, agg, transform)
Your work: Before solving Problem Set 4 read through this notebook and play around with the code.
Project 1: See the details in the projects folder of Lectures2021 or Project 1: Data analysis here.
Deadline: 11th of April.
Next lecture: Algorithms: Searching and sorting algorithms.
Remember no lecture Monday 5th of April! (it's a holiday)