Introduction

In this post we will utilize a portion of the Steam Hardware & Software Survey from February 2023. The data came from https://store.steampowered.com/hwsurvey/, however this page appears that it will refresh every month with new survey data. With my new site set up I have the goal to post content in a variety of languages; this includes R, Python, and web languages. In this case I am going to use Jupyter to create a short Python project.

In this project we will be analyzing the operating system data from the survey. We will be using Python's pandas package for our data cleaning and manipulation.

Data Cleaning

Our first order of business is load the pandas package. We can then read in the CSV file that I set up using the available data. Once the data has been read in we will display it.

In [13]:
import pandas as pd
In [32]:
steam_os = pd.read_csv(
    r'C:\Python\JupyterNotebook\Data Science Practice Folder\data\steam_hwsurvey_os_february2023.csv',
)
In [33]:
steam_os
Out[33]:
Version Percentage Change OS
0 Windows 10 64 bit 62.33% -1.13% Windows
1 Windows 11 64 bit 32.06% +1.73% Windows
2 Windows 7 64 bit 1.43% -0.17% Windows
3 Windows 8.1 64 bit 0.34% -0.05% Windows
4 Windows 7 0.09% -0.02% Windows
5 MacOS 13.1.0 64 bit 0.42% -0.10% OSX
6 MacOS 13.2.0 64 bit 0.26% +0.20% OSX
7 MacOS 13.0.1 64 bit 0.17% -0.18% OSX
8 MacOS 10.15.7 64 bit 0.13% -0.02% OSX
9 MacOS 13.2.1 64 bit 0.13% +0.13% OSX
10 MacOS 12.6.0 64 bit 0.13% -0.08% OSX
11 MacOS 12.6.3 64 bit 0.11% +0.11% OSX
12 MacOS 13.0.0 64 bit 0.09% -0.02% OSX
13 MacOS 12.4.0 64 bit 0.08% -0.03% OSX
14 MacOS 10.13.6 64 bit 0.08% -0.01% OSX
15 MacOS 12.5.0 64 bit 0.07% -0.02% OSX
16 MacOS 12.5.1 64 bit 0.07% -0.03% OSX
17 MacOS 12.6.2 64 bit 0.06% -0.04% OSX
18 MacOS 10.14.6 64 bit 0.05% -0.02% OSX
19 Arch Linux 64 bit 0.13% 0.00% Linux
20 Ubuntu 22.04.1 LTS 64 bit 0.12% -0.04% Linux
21 Manjaro Linux 64 bit 0.08% -0.01% Linux
22 Linux Mint 21.1 64 bit 0.06% +0.01% Linux

As we can see we have 22 rows and 4 columns. The original data was not comma-separated and did not have a forth column. Those were edited before loading the dataset to keep things simple.

Let's take a look at our pandas dataframe in more detail.

In [34]:
steam_os.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Version     23 non-null     object
 1   Percentage  23 non-null     object
 2   Change      23 non-null     object
 3   OS          23 non-null     object
dtypes: object(4)
memory usage: 432.0+ bytes

At this point every column has a datatype of object. This is great for our Version and OS column, but not ideal for the other columns. For Percentage and Change we will need a numeric datatype. This will require first removing extra symbols like % and +. We can do this using the str.replace function.

In [35]:
# replace % and +
steam_os['Percentage'] = steam_os['Percentage'].str.replace(r'%', '')
steam_os['Change'] = steam_os['Change'].str.replace(r'[+%]', '')
In [37]:
# update columns to numerical
steam_os['Percentage'] = pd.to_numeric(steam_os['Percentage'])
steam_os['Change'] = pd.to_numeric(steam_os['Change'])
In [38]:
# confirm columns are now numerical
steam_os.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Version     23 non-null     object 
 1   Percentage  23 non-null     float64
 2   Change      23 non-null     float64
 3   OS          23 non-null     object 
dtypes: float64(2), object(2)
memory usage: 616.0+ bytes

Data Exploration

Now that we have our data cleaned up we can take a look at a few values. Let's figure out how much each operating system is used in total.

In [39]:
os_agg = steam_os.groupby('OS', as_index=False)['Percentage'].agg('sum')
os_agg
Out[39]:
OS Percentage
0 Linux 0.39
1 OSX 1.85
2 Windows 96.25

As we can see Windows is far and away the most used operating system for Steam users. Linux and Mac users are not even close.

Note below that these numbers do not account for all operating systems.

In [40]:
sum(os_agg['Percentage'])
Out[40]:
98.49000000000001

Let's visualize our percentages by operating system below.

In [54]:
os_agg.plot.barh(x="OS")
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x139b5c50>

Conclusion

This is a fairly simple dataset, but we were able to do a lot with it. A little data cleaning here, some grouping there, and a splash of color with our chart. From the data itself we can see that Windows is the dominant operating system. We also have reason to believe that people are quickly transitioning over to Windows 11. This last observation can be seen if you take a look at the Change column.

Following this I will try to do more more analysis in Python. Hopefully I can find some pretty neat datasets to work with!