Constellation Network
There have been many attempts at quantifying the size of space, and as astronomers look deeper into our own galaxy and others, there are an endless number of stars to find and categorize. As scientists look into space, they continuously discover more stars, and by extension, these stars have the potential to create a constellation. By examining star data, then a network of stars can be created. Constellation and singular star information can be extrapolated from this network. The question of how densely populated a constellation is, on average, can be answered through this. Additionally, learning more about the specifics of the stars that are connected to others through constellations can also be studied by examining a network of stars and noting the edges between the star nodes. This question would be beneficial for novice astronomers who want to learn more about star placement in constellations. It can be helpful to visualize constellations as star clusters because in essence, this is how astronomers create constellations. An astronomer will note stars which are closely related to each other in distance, and if the edges between the stars resemble a shape, then a constellation will be created. This question can help people new to astronomy better recall which stars are connected to each other, and from this, understand how large constellations can be or the average number of stars that encompass a constellation. Novice astronomers can use this analysis to inform their decisions when recognizing constellations that they are new to studying. This means that astronomers can factor in the network information so they can recognize future constellations. Some constellations only contain two stars while the largest constellations can have up to 150. Analyzing trends of past stars can also assist when categorizing new constellations. By noting specific characteristics of stars and seeing how they are connected and why they are categorized under the same constellation, astronomers can then make informed decisions about how to classify newly discovered stars into constellations.
In order to answer the proposed question and gain the following insights, a dataset would need to contain star names and which constellations they fall under. In addition to this, to gain more information about star classification, distance from each star within the same constellation would be beneficial. This way, astronomers can analyze average distance trends between stars. Information on the star spectra would also be beneficial. A star’s spectra contains information on its temperature and core elements which dictates it’s temperature, color, and brightness. Stars are typically categorized into 7 different categories based on these factors. Lastly, star mass should be included in the dataset due to that being another classification factor.
The dataset found here: https://www.kaggle.com/datasets/diaaessam/constellation-names includes information such as star name, constellation, Bayern Designation, scientific name as registered in the IAU catalog, and data of approval for the star being added to the IAU catalog. Star name and constellation were used to create the network. Each node represents one individual star and two stars are connected by an edge if they are within the same constellation. Since the factors mentioned in the previous slide were not found all in the same dataset, an astronomer would need to analyze the network and proceed to research those characteristics separately based on the stars that they found important through observing the visualization. If someone only analyzed the network, this would still be beneficial for noting the number of stars within in a constellation on average, and for an amateur audience learning which stars are connected to others through constellations.
In the network above, each of the clusters represent a constellation and the singular nodes without any edges show lone stars. Since the network above contains labelled nodes, an audience can learn about each individual star within a constellation and note the connections between each. The network above also emphasizes the largest constellations. The edges and node colors are categorized by degree, nodes with more degrees will appear a darker pink color while nodes without any edges will appear as a much lighter color of pink. This means that larger constellations will appear darker since each star in those clusters has a higher degree due to having more connections than smaller constellations. Degree centrality is a measure of importance in the above network. By using degree as a measurement for node importance, then the nodes within the Ursa Major constellation would be considered the most important. There are 18 stars within the constellation, which makes it the largest one within this dataset, meaning that each node within this constellation has the highest number of edges. Stars such as Mizar, Merak, Talitha, Alkaid, Muscida, etc. which make up Ursa Major are the most important nodes in the above graph. Other insights that can be drawn from this graph include the answer to the question of how dense, on average, a constellation is with stars. The network shows that constellations tend to contain a smaller number of stars. The network above only shows 8 larger constellations, with their numbers all being under 20. Additionally, there are 64 constellations within the dataset and the majority of the clusters shown above are made up of roughly 5 to 7 stars. While this conclusion can be drawn from the 64 constellations within the dataset, there are many other constellations within the sky. This is a limitation of the dataset, that it does not contain more constellations. If the dataset contained all the constellation in the sky, then the average number of stars within a constellation could be different then the number that was extrapolated from the above network. This limitation could be biasing the data and causing it to skew towards a smaller number of stars within a constellation, since the dataset does not include the constellations with the upper limit of 150 stars.
Before starting any additional analysis, the data first needed to be examined for cleaning/hygiene purposes. The process of data cleaning involves examining the dataset for any corrupted values, incorrectly inputted information, using domain expertise to note any values which are not accurate due to other factors, etc. When using the dataset from kaggle, the data did not show any indication of corruption and the star names did match the constellations that they were under, so there were no mis-logged values or inaccuracies that would typically need to be removed from other datasets. However, an issue that could arise when data cleaning is that people without domain expertise in this area could assume that stars with two words in their name, such as Lilii Borea, are two different stars, Lilii and Borea, and create two different nodes by accident for the same star. In order to avoid this mistake, someone would need to understand the naming conventions for stars so they don’t assume that the stars with two words in their name are errors.
The networkx API was used to create the graph pictured. In order to transform the initial CSV file to a dataset that will result in information that Gephi can interpret and manipulate to an understandable visualization, the file will need to be read into a python file and then converted to a dataframe using python pandas. After creating a dataframe, the rows will need to be grouped together by the constellation they are apart of, and then converted into a list of dictionaries. Due to the grouping function of pandas, when the dataframe is converted to the list, each constellation name will be the key for the dictionary, while all of the names of the stars that encompass it, will be the values of the dictionary, stored in a list. In order to create each node, a person would have to iterate through the list of dictionaries and use the ‘add_node()’ functions. To connect the nodes, the code uses the ‘add_edge()’ function to create an edge between every star with the same dictionary key. After these nodes and edges are created, then the information can be written to a graph file using the write_graphml() function. This file can then be uploaded into Gephi, where it can then be manipulated to create a visualization that insights can be drawn from. When attempting to recreate this code and visualization, someone may run into issues while adding the nodes and edges. This is because the data is formatted and reformatted into a dataframe and a list of dictionaries. If someone iterates through these points incorrectly, then the nodes and edges will be created incorrectly and any subsequent visualizations would be inaccurate due to this.
github link:
https://github.com/anapetsmart/INST414/blob/main/module2_final.ipynb