Introduction¶
In this post we will utilize a portion of the Steam Hardware & Software Survey from February 2023. The data came from https://store.steampowered.com/hwsurvey/, however this page appears that it will refresh every month with new survey data. With my new site set up I have the goal to post content in a variety of languages; this includes R, Python, and web languages. In this case I am going to use Jupyter to create a short Python project.
In this project we will be analyzing the operating system data from the survey. We will be using Python's pandas
package for our data cleaning and manipulation.
Data Cleaning¶
Our first order of business is load the pandas
package. We can then read in the CSV file that I set up using the available data. Once the data has been read in we will display it.
import pandas as pd
steam_os = pd.read_csv(
r'C:\Python\JupyterNotebook\Data Science Practice Folder\data\steam_hwsurvey_os_february2023.csv',
)
steam_os
As we can see we have 22 rows and 4 columns. The original data was not comma-separated and did not have a forth column. Those were edited before loading the dataset to keep things simple.
Let's take a look at our pandas
dataframe in more detail.
steam_os.info()
At this point every column has a datatype of object
. This is great for our Version
and OS
column, but not ideal for the other columns. For Percentage
and Change
we will need a numeric datatype. This will require first removing extra symbols like % and +. We can do this using the str.replace
function.
# replace % and +
steam_os['Percentage'] = steam_os['Percentage'].str.replace(r'%', '')
steam_os['Change'] = steam_os['Change'].str.replace(r'[+%]', '')
# update columns to numerical
steam_os['Percentage'] = pd.to_numeric(steam_os['Percentage'])
steam_os['Change'] = pd.to_numeric(steam_os['Change'])
# confirm columns are now numerical
steam_os.info()
Data Exploration¶
Now that we have our data cleaned up we can take a look at a few values. Let's figure out how much each operating system is used in total.
os_agg = steam_os.groupby('OS', as_index=False)['Percentage'].agg('sum')
os_agg
As we can see Windows is far and away the most used operating system for Steam users. Linux and Mac users are not even close.
Note below that these numbers do not account for all operating systems.
sum(os_agg['Percentage'])
Let's visualize our percentages by operating system below.
os_agg.plot.barh(x="OS")
Conclusion¶
This is a fairly simple dataset, but we were able to do a lot with it. A little data cleaning here, some grouping there, and a splash of color with our chart. From the data itself we can see that Windows is the dominant operating system. We also have reason to believe that people are quickly transitioning over to Windows 11. This last observation can be seen if you take a look at the Change
column.
Following this I will try to do more more analysis in Python. Hopefully I can find some pretty neat datasets to work with!
Add a new comment