Attempting to Cluster States by Steam Locomotive Types

Simon Miller
INST414: Data Science Techniques
7 min readMar 29, 2024
EBT 16, an example of a 2–6–2 preserved in Pennsylvania at the East Broad Top Railroad.

The Question

Today, diesel and electric units power the country’s trains. However, this job was once almost exclusively done by steam locomotives. Different types of steam locomotives were used in different areas. For example, geared steam locomotives were typical for mountainous areas. Large, coupled engines such as 4–10–2s (These identifiers are steam locomotive wheel arrangements in order of leading wheels, driving wheels, trailing wheels, and a T if the locomotive is a tank engine. The s at the end identifies that they are plural and is not related to notation. Please note “Whyte Steam Locomotive Classification” below to further understand this. The Wikipedia article “Whyte Notation” may also help.), and 2–10–2s were more common out west. Smaller engines such as 4–6–2s, 4–4–2s were more common around cities. This suggests that historically speaking, regions in the United States may be identifiable by the type of steam locomotives in each region. Utilizing network science, this begs the following question:

Will clear, identifiable clusters emerge by grouping together states by steam locomotive type?

Albeit very specific, this question could be useful for historians and enthusiasts alike. It will help them to identify where designs were used across the country.

An explanation of the Whyte Notation. Note that geared engines are not listed here. They are identified by Builder and the number of trucks. (Ex: 4 Truck Shay, 3 Truck Climax, 2 Truck Heisler). Source: https://traveltown.org/historyandlearning/whyte-locomotive-classification/

The Data

Ideally, the data would be a complete list of all steam locomotives in the country. Minimally, it would contain the steam locomotive wheel arrangement (Whyte Notation), and what state the locomotive was based out of. It is unfortunate that the historical data for this does not exist. However, a database of preserved steam locomotives does exist on steamlocomotive.com that includes wheel arrangement and the state each locomotive is preserved in.

To gather the information from the website, a get method was used on a US-only search url to gather a response from the page, and BeautifulSoup was used to parse this and gather the information from the table. The HTML table was then converted to a Pandas DataFrame. The Python code for this is listed below.

#request a response from the website
response = requests.get("https://www.steamlocomotive.com/survivors/search.php?str=&country=USA")
print(response)
soup = BeautifulSoup(response.text, 'html.parser') #parse the html
table = soup.find('table') #find the table
df = pd.read_html(str(table))[0] #convert the table into a pandas dataframe

df.head(10)

To measure similarity, the DataFrame must be transformed into a matrix. The index will be a complete list of US states and a complete list of wheel arrangements found on the steamlocomotive.com website. Each value will contain the occurrences of that wheel arrangement in the state. The Python code that accomplishes this, and a subset of the DataFrame is included below.

unique_wheel = df['F.M. Whyte'].unique() #show unique wheel arrangements 
states =[
'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN',
'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH',
'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT',
'VT', 'VA', 'WA', 'WV', 'WI', 'WY'
] #create a list of state codes

#wheel_road = unique_rrs + unique_wheel.tolist() #combine the two lists
df_comparison = pd.DataFrame(index=states, columns=unique_wheel, data=0)

#iterate through the dataframe
for index, row in df.iterrows():

state = row["Location"][-2:] #grab the state
wheel_arr = row["F.M. Whyte"]

df_comparison.loc[state, wheel_arr] += 1

df_comparison.head(10)

Before the data can be clustered, a value of K must be selected. One way to select a value of K is the elbow method. This is done by graphing a series of K values by the average distance of that K value to the centroid. The K value that is just into the beginning of the curve is selected as the final value. This is shown in the code below. Note that the code from this point on is modified code from Professor Cody Buntain.

df_comparison.columns = df_comparison.columns.astype(str) #ensure columns are strings

# Let us test different values of k
interia_scores = []

for test_k in sorted(set(np.random.randint(1,50,20))):
print("Test k:", test_k)

tmp_model = MiniBatchKMeans(
n_clusters=test_k,
n_init=16, max_iter=2048, tol=0.5, reassignment_ratio=0.5
)
tmp_model.fit(df_comparison)

score = tmp_model.inertia_
interia_scores.append((test_k, score))
intertia_df = pd.DataFrame(interia_scores, columns=["k", "score"])

fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(1,1,1)

intertia_df.sort_values(by="k").plot("k", "score", ax=ax)

ax.set_ylabel("Intertia")

plt.show()

This returned a graph, as shown below. Although the graph is not a straight line, there is a somewhat noticeable curve. It was decided to set the K value to 25 as based on this graph. Note that this would allow enough clusters for each cluster to contain at least 2 states.

Describing Clusters & Answering the Question

To generate the clusters, the following Python code was executed.

k = 25
cluster_model = KMeans(n_clusters=k)
cluster_model.fit(df_comparison)
cluster_labels = cluster_model.predict(df_comparison)
loco_cluster_df = pd.DataFrame(cluster_labels, index=df_comparison.index, columns=["cluster"])
loco_cluster_df["cluster"].value_counts()
for cluster,locos in loco_cluster_df.groupby("cluster"):
print("Cluster:", cluster, "Size:", locos.shape[0])

for state in locos.index:
print(state)

This code returned the following clusters of states.

Cluster 0: LA, MS, NJ, NM, SC

Cluster 4: GA, NV, TN

Cluster 11: AL, AZ, IN, MT, NY

*Cluster 12: MD, VA

*Cluster 18: DC, DE, ID, KY, MA, ND, OK, RI, UT, VT, WY

Cluster 23: RK, AR, CT, IA

Cluster 24: MN, NE, NC

States in their own clusters: *PA, MI, *CA, KS, *CO, OH, WA, MO, ME, IL, SD, *WV, HI, OR, FL, WI, TX, NH

There are some similarities between states. For example, MD and VA in Cluster 12 both have one 0–8–0, smaller numbers of 0–4–0Fs, and moderate amounts of 4–4–0s. The largest cluster (18), all contained states that have small amounts of steam locomotives.

Will clear, identifiable clusters emerge by grouping together states by steam locomotive type?

No. This analysis does return clusters that contain similar wheel arrangements. However, they are not as clear and identifiable as the question asks. There are simply not enough surviving steam locomotives in each state to show clear, identifiable trends. Clusters also contain an extensive amount of common steam locomotives (for example, 2–8–0s and 2–8–2s, and 0–4–0Ts were commonplace across the entire US).

What is more interesting than the clusters is the outlier states. The code for this analysis was run several times and several different versions of the clustering was explored (not shown in this post). White clusters with multiple states varied, each time, PA, CA, WV, and CO all were outliers in their own clusters. There is a reasoning for why each of these states has unique wheel arrangements.

PA and CA both have many steam locomotives. PA was a railroading powerhouse with a once extensive network of mining railroads and mainlines that crossed the states. For this reason, PA has an extensive amount of common steam locomotives. CA is a large state with many railroad museums. However, CA has more steam locomotive types that were common out west (such as 2–6–2Ts, 2–6–2s, and 2–6–6–2Ts), placing it in a different cluster than PA.

WV and CO both have a unique number of specific types of locomotives and featured mountain railroading. WV is a mountainous state that contains a plethora of coal and lumber railroads. For this reason, they have an extensive number of geared engines (Shays, Heislers, and Climaxes). CO is another mountainous state. Instead of using geared engines, CO used narrow gauge with an extensive amount of 2–8–2s and 2–8–0s found on the Denver & Rio Grande Western Railroad.

An example of a Sante Fe 2–6–2 preserved at the California Railroad Museum. Credit: https://www.californiarailroad.museum/get-involved/at-sf-1010-restoration

Data Cleaning & Bugs

Very limited data cleaning was used for this dataset. Most columns found on the steamlocomotive.com website were ignored, except for the “F.M. Whyte” column and the “Location” column. The state was gathered from the location column by taking the last two letters of each cell.

There were very limited bugs in this analysis. Most bugs came from adapting Professor Cody Buntain’s code for this scenario. For example, trouble was had with the graph used with the elbow method. In Buntain’s example, there was a high number of test k values. With only 50 states, the number of test k values had to be decreased.

Limitations & Conclusion

This analysis was limited in two ways, by the number of surviving steam locomotives, and the information about the locomotives included in the analysis. As discussed before, having more steam locomotives could potentially increase the commonalities between states. Including more information, instead of just the wheel arrangement, from steamlocomotives.com may have also added value to the analysis. Adding gauge, and builder info may have contributed to differentiating between locomotives.

Although it may be a fun exercise for enthusiasts and historians, clustering preserved steam locomotives does not return clear and identifiable clusters of states. However, it does provide valuable insight on the uniqueness of the steam locomotives in such states, such as the commonality of steam in PA, unique western engines in CA, narrow gauge in Colorado, and geared steam in WV.

Resources

Steamlocomotive.com: https://www.steamlocomotive.com/

GitHub Repository: https://github.com/smiller1551/SteamLocoAnalysis

Professor Cody Buntain’s Github: https://github.com/cbuntain

This Medium post was created by Simon Miller at the University of Maryland — College Park for INST414: Data Science Techniques under Professor Cody Buntain.

--

--