Introduction¶

In this post we will utilize a portion of the Steam Hardware & Software Survey from February 2023. The data came from https://store.steampowered.com/hwsurvey/, however this page appears that it will refresh every month with new survey data. With my new site set up I have the goal to post content in a variety of languages; this includes R, Python, and web languages. In this case I am going to use Jupyter to create a short Python project.

In this project we will be analyzing the operating system data from the survey. We will be using Python's pandas package for our data cleaning and manipulation.

Data Cleaning¶

Our first order of business is load the pandas package. We can then read in the CSV file that I set up using the available data. Once the data has been read in we will display it.

In [13]:

import pandas as pd

In [32]:

steam_os = pd.read_csv(
    r'C:\Python\JupyterNotebook\Data Science Practice Folder\data\steam_hwsurvey_os_february2023.csv',
)

In [33]:

steam_os

Out[33]:

	Version	Percentage	Change	OS
0	Windows 10 64 bit	62.33%	-1.13%	Windows
1	Windows 11 64 bit	32.06%	+1.73%	Windows
2	Windows 7 64 bit	1.43%	-0.17%	Windows
3	Windows 8.1 64 bit	0.34%	-0.05%	Windows
4	Windows 7	0.09%	-0.02%	Windows
5	MacOS 13.1.0 64 bit	0.42%	-0.10%	OSX
6	MacOS 13.2.0 64 bit	0.26%	+0.20%	OSX
7	MacOS 13.0.1 64 bit	0.17%	-0.18%	OSX
8	MacOS 10.15.7 64 bit	0.13%	-0.02%	OSX
9	MacOS 13.2.1 64 bit	0.13%	+0.13%	OSX
10	MacOS 12.6.0 64 bit	0.13%	-0.08%	OSX
11	MacOS 12.6.3 64 bit	0.11%	+0.11%	OSX
12	MacOS 13.0.0 64 bit	0.09%	-0.02%	OSX
13	MacOS 12.4.0 64 bit	0.08%	-0.03%	OSX
14	MacOS 10.13.6 64 bit	0.08%	-0.01%	OSX
15	MacOS 12.5.0 64 bit	0.07%	-0.02%	OSX
16	MacOS 12.5.1 64 bit	0.07%	-0.03%	OSX
17	MacOS 12.6.2 64 bit	0.06%	-0.04%	OSX
18	MacOS 10.14.6 64 bit	0.05%	-0.02%	OSX
19	Arch Linux 64 bit	0.13%	0.00%	Linux
20	Ubuntu 22.04.1 LTS 64 bit	0.12%	-0.04%	Linux
21	Manjaro Linux 64 bit	0.08%	-0.01%	Linux
22	Linux Mint 21.1 64 bit	0.06%	+0.01%	Linux

As we can see we have 22 rows and 4 columns. The original data was not comma-separated and did not have a forth column. Those were edited before loading the dataset to keep things simple.

Let's take a look at our pandas dataframe in more detail.

In [34]:

steam_os.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Version     23 non-null     object
 1   Percentage  23 non-null     object
 2   Change      23 non-null     object
 3   OS          23 non-null     object
dtypes: object(4)
memory usage: 432.0+ bytes

At this point every column has a datatype of object. This is great for our Version and OS column, but not ideal for the other columns. For Percentage and Change we will need a numeric datatype. This will require first removing extra symbols like % and +. We can do this using the str.replace function.

In [35]:

# replace % and +
steam_os['Percentage'] = steam_os['Percentage'].str.replace(r'%', '')
steam_os['Change'] = steam_os['Change'].str.replace(r'[+%]', '')

In [37]:

# update columns to numerical
steam_os['Percentage'] = pd.to_numeric(steam_os['Percentage'])
steam_os['Change'] = pd.to_numeric(steam_os['Change'])

In [38]:

# confirm columns are now numerical
steam_os.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Version     23 non-null     object 
 1   Percentage  23 non-null     float64
 2   Change      23 non-null     float64
 3   OS          23 non-null     object 
dtypes: float64(2), object(2)
memory usage: 616.0+ bytes

Data Exploration¶

Now that we have our data cleaned up we can take a look at a few values. Let's figure out how much each operating system is used in total.

In [39]:

os_agg = steam_os.groupby('OS', as_index=False)['Percentage'].agg('sum')
os_agg

Out[39]:

	OS	Percentage
0	Linux	0.39
1	OSX	1.85
2	Windows	96.25

As we can see Windows is far and away the most used operating system for Steam users. Linux and Mac users are not even close.

Note below that these numbers do not account for all operating systems.

In [40]:

sum(os_agg['Percentage'])

Out[40]:

98.49000000000001

Let's visualize our percentages by operating system below.

In [54]:

os_agg.plot.barh(x="OS")

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x139b5c50>

Conclusion¶

This is a fairly simple dataset, but we were able to do a lot with it. A little data cleaning here, some grouping there, and a splash of color with our chart. From the data itself we can see that Windows is the dominant operating system. We also have reason to believe that people are quickly transitioning over to Windows 11. This last observation can be seen if you take a look at the Change column.

Following this I will try to do more more analysis in Python. Hopefully I can find some pretty neat datasets to work with!

Steam Operating Systems Explored with Python's Pandas

Introduction¶

Data Cleaning¶

Data Exploration¶

Conclusion¶

Similar posts

0 comments

Add a new comment

My blog

Latest posts

Most commented posts

Popular Tags see all...