Introduction

In this post we will examine some basic features of the 2023 CPI data. The data includes records up to August 2023. This data was obtained from the U.S. Bureau of Labor Statistics. Clicking the Average Price Data TEXT FILES link brings us to the data files we utilize in this report.

To keep things focused we will examine the ap.data.2.Gasoline file and look at regular unleaded gas. From the ap.item text file we get the following line

item_code    item_name
74714   Gasoline, unleaded regular, per gallon/3.785 liters

This tells us that in our dataset we need to look out for item_code 74714.

Extracting the Data

Let's take a look at the first 20 rows from the ap.data.2.Gasoline dataset:

series_id            year    period         value    footnote_codes
APU000074712        1973    M10        0.402    
APU000074712        1973    M11        0.418    
APU000074712        1973    M12        0.437    
APU000074712        1974    M01        0.465    
APU000074712        1974    M02        0.491    
APU000074712        1974    M03        0.528    
APU000074712        1974    M04        0.537    
APU000074712        1974    M05        0.550    
APU000074712        1974    M06        0.556    
APU000074712        1974    M07        0.558    
APU000074712        1974    M08        0.554    
APU000074712        1974    M09        0.550    
APU000074712        1974    M10        0.534    
APU000074712        1974    M11        0.528    
APU000074712        1974    M12        0.528    
APU000074712        1975    M01        0.532    
APU000074712        1975    M02        0.533    
APU000074712        1975    M03        0.534    
APU000074712        1975    M04        0.540    

We can see that this data is both space and tab delimited. It turns out it has a fixed width formatted lines text file format. Let's see how to work with this in Python using the pandas package. Our first step is to install pandas and then import it for use in this notebook. We install pandas by running:

pip install pandas

in the command line in the virtual environment used to run the Jupyter Notebook. We then run the following code

In [1]:
import pandas as pd

Let's see if we can load this data up with Pandas.

In [2]:
import os


fp = os.path.join('cpi', 'ap.data.2.Gasoline.txt')
df = pd.read_fwf(fp)

Let's check our first 20 values using the .head attribute of df.

In [3]:
df.head(20)
Out[3]:
series_id year period value footnote_codes
0 APU000074712 1973 M10 0.402 NaN
1 APU000074712 1973 M11 0.418 NaN
2 APU000074712 1973 M12 0.437 NaN
3 APU000074712 1974 M01 0.465 NaN
4 APU000074712 1974 M02 0.491 NaN
5 APU000074712 1974 M03 0.528 NaN
6 APU000074712 1974 M04 0.537 NaN
7 APU000074712 1974 M05 0.550 NaN
8 APU000074712 1974 M06 0.556 NaN
9 APU000074712 1974 M07 0.558 NaN
10 APU000074712 1974 M08 0.554 NaN
11 APU000074712 1974 M09 0.550 NaN
12 APU000074712 1974 M10 0.534 NaN
13 APU000074712 1974 M11 0.528 NaN
14 APU000074712 1974 M12 0.528 NaN
15 APU000074712 1975 M01 0.532 NaN
16 APU000074712 1975 M02 0.533 NaN
17 APU000074712 1975 M03 0.534 NaN
18 APU000074712 1975 M04 0.540 NaN
19 APU000074712 1975 M05 0.550 NaN

We can see that the .read_fwf function is able to handle this fixed width formatted lines text file with ease.

Isolating a Variable

Now that we have full access to our dataset as a pandas data frame we need to limit our results to regular unleaded or item_code 74714. The key to figuring this out is to look at the series_id variable.

Reading the ap.txt file we learn that the series_id consists of four consecutive components: the survey abbreviation, a seasonal code, the area_code, and the item_code. Looking at the first row (number 0) we have survey abbreviation AP, seasonal code U, area_code 0000, and item_code 74712.

We are looking for instances with item_code 74714. We also need to account for different area_code observations. Looking at the ap.area.txt file we can see what each area_code corresponds to. There are many different types of regions and cities covered by different area codes. Here is a small subset:

area_code area_name 0000 U.S. city average A214 Kansas City, MO-KS S24B St. Louis, MO-IL

We will focus on area_code 0000 corresponding to the overall U.S. city average. Thus, our focus will be rows with series_id APU000074714. We do this below:

In [4]:
regular_unleaded = df[df['series_id'] == 'APU000074714']
In [5]:
regular_unleaded.head()
Out[5]:
series_id year period value footnote_codes
334 APU000074714 1976 M01 0.605 NaN
335 APU000074714 1976 M02 0.600 NaN
336 APU000074714 1976 M03 0.594 NaN
337 APU000074714 1976 M04 0.592 NaN
338 APU000074714 1976 M05 0.600 NaN

Now that we have the desired dataset, let's take some time to examine what we are working with.

In [6]:
# Number of observations / rows
regular_unleaded.shape[0]
Out[6]:
572
In [7]:
# Check data types for all columns
regular_unleaded.dtypes
Out[7]:
series_id          object
year                int64
period             object
value             float64
footnote_codes    float64
dtype: object

Data Cleaning

With our current data frame, regular_unleaded, we have both a year and period column. The period corresponds to the month where M01 is January, M02 is February and so on. We need to create a new column with a datetime data type to help us visualize our data. Our steps are below:

In [8]:
# Create new column with datetime data type
month = regular_unleaded.apply(lambda row: row.period.replace('M', ''), axis=1)
day = [1 for day in range(len(regular_unleaded))]
regular_unleaded.loc[:, ['date']] = pd.to_datetime(
    [f'{y}-{m}-{d}' for y, m, d in zip(regular_unleaded.year, month, day)]
)

regular_unleaded.head(5)
Out[8]:
series_id year period value footnote_codes date
334 APU000074714 1976 M01 0.605 NaN 1976-01-01
335 APU000074714 1976 M02 0.600 NaN 1976-02-01
336 APU000074714 1976 M03 0.594 NaN 1976-03-01
337 APU000074714 1976 M04 0.592 NaN 1976-04-01
338 APU000074714 1976 M05 0.600 NaN 1976-05-01

Notice that we created variables to hold the month and day values that go into the date. Furthermore, we are not concerned with the day of the month so we simply use the first day of the month as a placeholder.

Data Visualization

Now that we have cleaned up our data let's take a look our results. To do this we install matplotlib using

pip install matplotlib

in our virtual environment. We then import matplotlib.pyplot to allow access to some of the graphical functionality in matplotlib. Below we can see our results:

In [9]:
import matplotlib.pyplot as plt
regular_unleaded.plot(x='date', y='value')
plt.xlabel('Date')
plt.ylabel('Price (dollars)')
plt.title("Regular Unleaded Price over Time")
Out[9]:
Text(0.5, 1.0, 'Regular Unleaded Price over Time')
No description has been provided for this image

We can see from our plot above the average price of regular unleaded has an upward trend. In addition, we can see that our greatest average price occurred recently (based on our current range of data). To see our start and end date for this collection of data consider the following code:

In [10]:
# get earliest date
min(regular_unleaded['date'])
Out[10]:
Timestamp('1976-01-01 00:00:00')
In [11]:
# get newest date
max(regular_unleaded['date'])
Out[11]:
Timestamp('2023-08-01 00:00:00')

We mentioned the recent date with the highest average price. Let's calculate this value and determine the exact date in our date range.

In [12]:
# Maximum value
max_price = max(regular_unleaded['value'])
max_price
Out[12]:
5.058

Now to find the date with the maximum price. Since we can visually see that we do not have a repeat year we can do the following:

In [13]:
regular_unleaded['value'].idxmax()
Out[13]:
891

With this we can get the actual date of the max value as follows:

In [14]:
regular_unleaded.loc[891, 'date']
Out[14]:
Timestamp('2022-06-01 00:00:00')

Thus, the greatest average price for regular unleaded was in June of 2022.

Conclusion

This is a really rich dataset with a lot of data which can be extracted. In this post we demonstrated how to create a time series plot using one component of this massive dataset. In a future post we will revisit this dataset and extract some more insights using this dataset.