Problem set 4: Analyzing data
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import pydst
dst = pydst.Dst(lang='en')
Tasks
Import national account data from Denmark Statistics
Consider the following dictionary definitions:
columns_dict = {}
columns_dict['TRANSAKT'] = 'variable'
columns_dict['PRISENHED'] = 'unit'
columns_dict['TID'] = 'year'
columns_dict['INDHOLD'] = 'value'
var_dict = {} # var is for variable
var_dict['P.1 Output'] = 'Y'
var_dict['P.3 Final consumption expenditure'] = 'C'
var_dict['P.3 Government consumption expenditure'] = 'G'
var_dict['P.5 Gross capital formation'] = 'I'
var_dict['P.6 Export of goods and services'] = 'X'
var_dict['P.7 Import of goods and services'] = 'M'
unit_dict = {}
unit_dict['2010-prices, chained values'] = 'real'
unit_dict['Current prices'] = 'nominal'
Step 1: Download all of table nah1
.
# hint, nah1 = dst.get_data(table_id = '?', variables={'TRANSAKT':[?], 'PRISENHED':[?], 'TID':[?]})
Step 2: Rename the columns using columns_dict
and replace data using var_dict
and unit_dict
.
# hint, nah1_true.rename(?,inplace=True)
# for key,value in var_dict.items():
# nah1.variable.replace(?)
#for key,value in unit_dict.items():
# nah1.unit.replace(?)
Step 3: Only keep rows where the variable is in [Y, C, G, I, X, M]
. Afterwards convert the value
column to a float.
# write you code here
# nah1.value = nah1.value.astype('float')
Step 4: Discuss what the following summary statistics show.
# nah1_true.groupby(['variable','unit']).describe()
Answer: See A1.py
Merge with population data from Denmark Statistics
Load population data from Denmark Statistics:
pop = dst.get_data(table_id = 'FT', variables={'HOVEDDELE':['*'], 'TID':['*']})
pop.rename(columns={'TID':'year','INDHOLD':'population'},inplace=True)
I = pop.HOVEDDELE == 'All Denmark'
pop = pop.loc[I,['year','population']]
pop.head()
Question 1: Merge the population and the national account data, so there is a new column called population
. Use the merge function.
# hint, merged = pd.merge(?,?,how='?',on=[?])
# merged_true.tail(10)
Answer: See A2.py
Question 2: Merge the population on again, so there is a new column called population_alt
. Use the join method.
# pop_with_index = pop.set_index(?)
# pop_with_index.rename(columns={'population':'population_alt'},inplace=True)
# merged_with_index = merged.set_index(?)
# merged_alt = merged_with_index.join(?)
# merged_alt.tail(10)
Answer: See A3.py
Split-apply-combine-(plot)
Consider the following split-apply-combine-plot:
# a. split
nah1_true_grouped = nah1_true.groupby(['variable','unit'])
nah1_true_grouped_first = nah1_true_grouped.value.first()
nah1_true_grouped_first.name = 'first'
# b. apply
nah1_true.set_index(['variable','unit','year'],inplace=True)
nah1_true = nah1_true.join(nah1_true_grouped_first,how='left',on=['variable','unit'])
nah1_true.reset_index(inplace=True)
# c. combine
nah1_true['indexed'] = nah1_true['value']/nah1_true['first']
# d. plot
def plot(df):
df_indexed = df.set_index('year')
I = df_indexed.unit == 'real'
df_indexed[I].groupby(['variable'])['indexed'].plot(legend=True);
plot(nah1_true)
Question Implement the same split-apply-combine as above using transform
.
def first(x): # select the first element in a series
return x.iloc[0]
# nah1_alt = nah1_final.copy()
# grouped = nah1_alt.groupby(?)
#nah1_alt[?] = ?.transform(lambda x: ?)
#nah1_alt.head()
Answer: See A4.py
Problem: The Housing market
Housing data
Note: The file data/bm010_parcel.xlsx
has been downloaded from http://rkr.statistikbank.dk/201.
Question: Go through the cell below and ensure you understand ALL commands.
# a. load data
prices = pd.read_excel('data/bm010_parcel.xlsx', skiprows=2)
prices.rename(columns={'Unnamed: 2': 'municipality'}, inplace=True)
# b. delete columns
del prices['Unnamed: 0']
del prices['Unnamed: 1']
# c. rename time columns: 1992K1 -> price19921
time_dict = {}
for y in range(1992,2021+1):
for k in range(1,4+1):
str_from = f'{y}K{k}'
str_to = f'price{y}{k}'
time_dict[str_from] = str_to
prices = prices.rename(columns = time_dict)
# d. drop missing
prices = prices.dropna()
# e. convert to long
prices_long = pd.wide_to_long(prices, stubnames='price', i='municipality', j='year_quarter')
prices_long.reset_index(inplace=True)
# f. drop missing and convert to float
I = prices_long.loc[prices_long.price == '..']
prices_long.drop(I.index, inplace=True)
prices_long.price = prices_long.price.astype('float')
# g. create date variable
prices_long['d'] = (prices_long.year_quarter.astype(str).str[:4] # grab the year, first four digits
+ 'Q' # add the letter Q
+ prices_long.year_quarter.astype(str).str[4]) # the quarter (fifth digit)
prices_long['date'] = pd.to_datetime(prices_long.d)
# h. cleanup
del prices_long['year_quarter']
del prices_long['d']
prices_long.head()
Population data
Question: Go through the cell below and ensure you understand ALL commands.
# a. load data
pop = dst.get_data(table_id='FOLK1A', variables={'Alder':['IALT'], 'CIVILSTAND':['TOT'], 'Køn':['TOT'], 'Tid':['*'], 'OMRÅDE':['*']})
# b. drop and rename columns
for v in ['ALDER', 'CIVILSTAND', 'KØN']:
del pop[v]
pop = pop.rename(columns = {'INDHOLD':'population', 'OMRÅDE': 'municipality'})
# c. drop non-municipalities
for val in ['Region', 'All']:
I = pop['municipality'].str.contains(val)
pop.drop(pop[I].index, inplace=True)
# d. convert to date
pop['date'] = pd.to_datetime(pop.TID)
del pop['TID']
pop.head()
Analysis
Problem: Analyze the co-variation betwen population growth and house price growth. Reproduce the graphs below.
Hint: For the second one consider the agg
method (similar to but different from transform
, Google it).
# write your code here
Answer: See A5.py for difference within total population. See A6.py for difference within municipality