Introduction¶

In this post we will examine some basic features of the 2023 CPI data. The data includes records up to August 2023. This data was obtained from the U.S. Bureau of Labor Statistics. Clicking the Average Price Data TEXT FILES link brings us to the data files we utilize in this report.

To keep things focused we will examine the ap.data.2.Gasoline file and look at regular unleaded gas. From the ap.item text file we get the following line

item_code    item_name
74714   Gasoline, unleaded regular, per gallon/3.785 liters

This tells us that in our dataset we need to look out for item_code 74714.

Extracting the Data¶

Let's take a look at the first 20 rows from the ap.data.2.Gasoline dataset:

series_id            year    period         value    footnote_codes
APU000074712        1973    M10        0.402    
APU000074712        1973    M11        0.418    
APU000074712        1973    M12        0.437    
APU000074712        1974    M01        0.465    
APU000074712        1974    M02        0.491    
APU000074712        1974    M03        0.528    
APU000074712        1974    M04        0.537    
APU000074712        1974    M05        0.550    
APU000074712        1974    M06        0.556    
APU000074712        1974    M07        0.558    
APU000074712        1974    M08        0.554    
APU000074712        1974    M09        0.550    
APU000074712        1974    M10        0.534    
APU000074712        1974    M11        0.528    
APU000074712        1974    M12        0.528    
APU000074712        1975    M01        0.532    
APU000074712        1975    M02        0.533    
APU000074712        1975    M03        0.534    
APU000074712        1975    M04        0.540

We can see that this data is both space and tab delimited. It turns out it has a fixed width formatted lines text file format. Let's see how to work with this in Python using the pandas package. Our first step is to install pandas and then import it for use in this notebook. We install pandas by running:

pip install pandas

in the command line in the virtual environment used to run the Jupyter Notebook. We then run the following code

In [1]:

import pandas as pd

Let's see if we can load this data up with Pandas.

In [2]:

import os


fp = os.path.join('cpi', 'ap.data.2.Gasoline.txt')
df = pd.read_fwf(fp)

Let's check our first 20 values using the .head attribute of df.

In [3]:

df.head(20)

Out[3]:

	series_id	year	period	value	footnote_codes
0	APU000074712	1973	M10	0.402	NaN
1	APU000074712	1973	M11	0.418	NaN
2	APU000074712	1973	M12	0.437	NaN
3	APU000074712	1974	M01	0.465	NaN
4	APU000074712	1974	M02	0.491	NaN
5	APU000074712	1974	M03	0.528	NaN
6	APU000074712	1974	M04	0.537	NaN
7	APU000074712	1974	M05	0.550	NaN
8	APU000074712	1974	M06	0.556	NaN
9	APU000074712	1974	M07	0.558	NaN
10	APU000074712	1974	M08	0.554	NaN
11	APU000074712	1974	M09	0.550	NaN
12	APU000074712	1974	M10	0.534	NaN
13	APU000074712	1974	M11	0.528	NaN
14	APU000074712	1974	M12	0.528	NaN
15	APU000074712	1975	M01	0.532	NaN
16	APU000074712	1975	M02	0.533	NaN
17	APU000074712	1975	M03	0.534	NaN
18	APU000074712	1975	M04	0.540	NaN
19	APU000074712	1975	M05	0.550	NaN

We can see that the .read_fwf function is able to handle this fixed width formatted lines text file with ease.

Isolating a Variable¶

Now that we have full access to our dataset as a pandas data frame we need to limit our results to regular unleaded or item_code 74714. The key to figuring this out is to look at the series_id variable.

Reading the ap.txt file we learn that the series_id consists of four consecutive components: the survey abbreviation, a seasonal code, the area_code, and the item_code. Looking at the first row (number 0) we have survey abbreviation AP, seasonal code U, area_code 0000, and item_code 74712.

We are looking for instances with item_code 74714. We also need to account for different area_code observations. Looking at the ap.area.txt file we can see what each area_code corresponds to. There are many different types of regions and cities covered by different area codes. Here is a small subset:

area_code area_name 0000 U.S. city average A214 Kansas City, MO-KS S24B St. Louis, MO-IL

We will focus on area_code 0000 corresponding to the overall U.S. city average. Thus, our focus will be rows with series_id APU000074714. We do this below:

In [4]:

regular_unleaded = df[df['series_id'] == 'APU000074714']

In [5]:

regular_unleaded.head()

Out[5]:

	series_id	year	period	value	footnote_codes
334	APU000074714	1976	M01	0.605	NaN
335	APU000074714	1976	M02	0.600	NaN
336	APU000074714	1976	M03	0.594	NaN
337	APU000074714	1976	M04	0.592	NaN
338	APU000074714	1976	M05	0.600	NaN

Now that we have the desired dataset, let's take some time to examine what we are working with.

In [6]:

# Number of observations / rows
regular_unleaded.shape[0]

Out[6]:

In [7]:

# Check data types for all columns
regular_unleaded.dtypes

Out[7]:

series_id          object
year                int64
period             object
value             float64
footnote_codes    float64
dtype: object

Data Cleaning¶

With our current data frame, regular_unleaded, we have both a year and period column. The period corresponds to the month where M01 is January, M02 is February and so on. We need to create a new column with a datetime data type to help us visualize our data. Our steps are below:

In [8]:

# Create new column with datetime data type
month = regular_unleaded.apply(lambda row: row.period.replace('M', ''), axis=1)
day = [1 for day in range(len(regular_unleaded))]
regular_unleaded.loc[:, ['date']] = pd.to_datetime(
    [f'{y}-{m}-{d}' for y, m, d in zip(regular_unleaded.year, month, day)]
)

regular_unleaded.head(5)

Out[8]:

	series_id	year	period	value	footnote_codes	date
334	APU000074714	1976	M01	0.605	NaN	1976-01-01
335	APU000074714	1976	M02	0.600	NaN	1976-02-01
336	APU000074714	1976	M03	0.594	NaN	1976-03-01
337	APU000074714	1976	M04	0.592	NaN	1976-04-01
338	APU000074714	1976	M05	0.600	NaN	1976-05-01

Notice that we created variables to hold the month and day values that go into the date. Furthermore, we are not concerned with the day of the month so we simply use the first day of the month as a placeholder.

Data Visualization¶

Now that we have cleaned up our data let's take a look our results. To do this we install matplotlib using

pip install matplotlib

in our virtual environment. We then import matplotlib.pyplot to allow access to some of the graphical functionality in matplotlib. Below we can see our results:

In [9]:

import matplotlib.pyplot as plt
regular_unleaded.plot(x='date', y='value')
plt.xlabel('Date')
plt.ylabel('Price (dollars)')
plt.title("Regular Unleaded Price over Time")

Out[9]:

Text(0.5, 1.0, 'Regular Unleaded Price over Time')

No description has been provided for this image

We can see from our plot above the average price of regular unleaded has an upward trend. In addition, we can see that our greatest average price occurred recently (based on our current range of data). To see our start and end date for this collection of data consider the following code:

In [10]:

# get earliest date
min(regular_unleaded['date'])

Out[10]:

Timestamp('1976-01-01 00:00:00')

In [11]:

# get newest date
max(regular_unleaded['date'])

Out[11]:

Timestamp('2023-08-01 00:00:00')

We mentioned the recent date with the highest average price. Let's calculate this value and determine the exact date in our date range.

In [12]:

# Maximum value
max_price = max(regular_unleaded['value'])
max_price

Out[12]:

5.058

Now to find the date with the maximum price. Since we can visually see that we do not have a repeat year we can do the following:

In [13]:

regular_unleaded['value'].idxmax()

Out[13]:

With this we can get the actual date of the max value as follows:

In [14]:

regular_unleaded.loc[891, 'date']

Out[14]:

Timestamp('2022-06-01 00:00:00')

Thus, the greatest average price for regular unleaded was in June of 2022.

Conclusion¶

This is a really rich dataset with a lot of data which can be extracted. In this post we demonstrated how to create a time series plot using one component of this massive dataset. In a future post we will revisit this dataset and extract some more insights using this dataset.

August 2023 CPI - Regular Unleaded

Introduction¶

Extracting the Data¶

Isolating a Variable¶

Data Cleaning¶

Data Visualization¶

Conclusion¶

Similar posts

0 comments

Add a new comment

My blog

Latest posts

Most commented posts

Popular Tags see all...