Introduction¶
In this post we will examine some basic features of the 2023 CPI data. The data includes records up to August 2023. This data was obtained from the U.S. Bureau of Labor Statistics. Clicking the Average Price Data TEXT FILES link brings us to the data files we utilize in this report.
To keep things focused we will examine the ap.data.2.Gasoline
file and look at regular unleaded gas. From the ap.item
text file we get the following line
item_code item_name
74714 Gasoline, unleaded regular, per gallon/3.785 liters
This tells us that in our dataset we need to look out for item_code
74714.
Extracting the Data¶
Let's take a look at the first 20 rows from the ap.data.2.Gasoline
dataset:
series_id year period value footnote_codes
APU000074712 1973 M10 0.402
APU000074712 1973 M11 0.418
APU000074712 1973 M12 0.437
APU000074712 1974 M01 0.465
APU000074712 1974 M02 0.491
APU000074712 1974 M03 0.528
APU000074712 1974 M04 0.537
APU000074712 1974 M05 0.550
APU000074712 1974 M06 0.556
APU000074712 1974 M07 0.558
APU000074712 1974 M08 0.554
APU000074712 1974 M09 0.550
APU000074712 1974 M10 0.534
APU000074712 1974 M11 0.528
APU000074712 1974 M12 0.528
APU000074712 1975 M01 0.532
APU000074712 1975 M02 0.533
APU000074712 1975 M03 0.534
APU000074712 1975 M04 0.540
We can see that this data is both space and tab delimited. It turns out it has a fixed width formatted lines text file format. Let's see how to work with this in Python using the pandas
package. Our first step is to install pandas
and then import it for use in this notebook. We install pandas
by running:
pip install pandas
in the command line in the virtual environment used to run the Jupyter Notebook. We then run the following code
import pandas as pd
Let's see if we can load this data up with Pandas
.
import os
fp = os.path.join('cpi', 'ap.data.2.Gasoline.txt')
df = pd.read_fwf(fp)
Let's check our first 20 values using the .head
attribute of df
.
df.head(20)
We can see that the .read_fwf
function is able to handle this fixed width formatted lines text file with ease.
Isolating a Variable¶
Now that we have full access to our dataset as a pandas
data frame we need to limit our results to regular unleaded or item_code
74714. The key to figuring this out is to look at the series_id
variable.
Reading the ap.txt
file we learn that the series_id
consists of four consecutive components: the survey abbreviation, a seasonal code, the area_code
, and the item_code
. Looking at the first row (number 0) we have survey abbreviation AP, seasonal code U, area_code
0000, and item_code
74712.
We are looking for instances with item_code
74714. We also need to account for different area_code
observations. Looking at the ap.area.txt
file we can see what each area_code
corresponds to. There are many different types of regions and cities covered by different area codes. Here is a small subset:
We will focus on area_code
0000 corresponding to the overall U.S. city average. Thus, our focus will be rows with series_id
APU000074714. We do this below:
regular_unleaded = df[df['series_id'] == 'APU000074714']
regular_unleaded.head()
Now that we have the desired dataset, let's take some time to examine what we are working with.
# Number of observations / rows
regular_unleaded.shape[0]
# Check data types for all columns
regular_unleaded.dtypes
Data Cleaning¶
With our current data frame, regular_unleaded
, we have both a year and period column. The period corresponds to the month where M01
is January, M02
is February and so on. We need to create a new column with a datetime
data type to help us visualize our data. Our steps are below:
# Create new column with datetime data type
month = regular_unleaded.apply(lambda row: row.period.replace('M', ''), axis=1)
day = [1 for day in range(len(regular_unleaded))]
regular_unleaded.loc[:, ['date']] = pd.to_datetime(
[f'{y}-{m}-{d}' for y, m, d in zip(regular_unleaded.year, month, day)]
)
regular_unleaded.head(5)
Notice that we created variables to hold the month and day values that go into the date. Furthermore, we are not concerned with the day of the month so we simply use the first day of the month as a placeholder.
Data Visualization¶
Now that we have cleaned up our data let's take a look our results. To do this we install matplotlib
using
pip install matplotlib
in our virtual environment. We then import matplotlib.pyplot
to allow access to some of the graphical functionality in matplotlib
. Below we can see our results:
import matplotlib.pyplot as plt
regular_unleaded.plot(x='date', y='value')
plt.xlabel('Date')
plt.ylabel('Price (dollars)')
plt.title("Regular Unleaded Price over Time")
We can see from our plot above the average price of regular unleaded has an upward trend. In addition, we can see that our greatest average price occurred recently (based on our current range of data). To see our start and end date for this collection of data consider the following code:
# get earliest date
min(regular_unleaded['date'])
# get newest date
max(regular_unleaded['date'])
We mentioned the recent date with the highest average price. Let's calculate this value and determine the exact date in our date range.
# Maximum value
max_price = max(regular_unleaded['value'])
max_price
Now to find the date with the maximum price. Since we can visually see that we do not have a repeat year we can do the following:
regular_unleaded['value'].idxmax()
With this we can get the actual date of the max value as follows:
regular_unleaded.loc[891, 'date']
Thus, the greatest average price for regular unleaded was in June of 2022.
Conclusion¶
This is a really rich dataset with a lot of data which can be extracted. In this post we demonstrated how to create a time series plot using one component of this massive dataset. In a future post we will revisit this dataset and extract some more insights using this dataset.
Add a new comment