Data Mining the City Final Paper

Ethan Hudgins
Data Mining the City
13 min readDec 3, 2018

This semester I took the class X Info Modeling where we used Rhinoceros (architecture software) and Grasshopper (a visual coding software) to produce iterative models to answer complex problems. Through this software we were able to create thousands of different models of a particular building while changing some of the parameters. Each of these model iterations were also being measured for various performance data such as daylighting, proximity to other buildings, total square footage, and many others. The reason these methods are so useful and interesting is because all of the iterations can be loaded into a software that shows each iteration as a line on a spectrum, and you can select which iteration scores the best for each of the performance metric. You can then select which iterations you want to take a closer look at, and then work with those as the best examples for the possibilities of a project moving forward. This is powerful because it takes all of the guesswork out of the process of deciding on a design, and uses verifiable proof to do so instead.

For the final paper for this class, Data Mining the City, I wanted to take a closer look at the python behind grasshopper script, and compare that to the possibility of using just python to accomplish the same sort of goals. For this, I wanted to specifically look at data tree structures in Grasshopper. I have been told by a professor that understanding data trees is the key to gaining advanced skills with grasshopper, so exploring this topic will be useful to me. I will use script examples from my final project to illustrate the points I make here. My final project is an iterative model of McMurdo Station, Antarctica; I am analyzing the building performance and consolidating the existing buildings into larger structures. This model replicates the goals of master planning for the station over the last few decades: consolidate building spaces into higher performing structures to improve station performance. I will use the grasshopper script for this project as the examples of data trees, and later I will use a specific example to compare python and grasshopper script and outputs. My overall grasshopper script looks like this:

The buildings in red are the existing buildings at McMurdo Station, Antarctica, and the building in green is a hypothetical new building with simplistic cubic design that is a consolidated total of the square footage of all of the residential space at McMurdo. This consolidation is unlikely to happen in its entirety because several of the residential buildings are higher capacity and perform well enough to maintain for years to come.

To start, there are many types of data structures. In python, I am familiar with lists and arrays, both of which contain data in an ordered fashion that is easily accessible. Other forms of data storage include bags and hash sets (no order, but better for finding specific values), stacks and queues (ordered and accessible by date. There are also dictionaries, which use key values to access data in different groups. The types of data storage are seemingly endless, and formatted in many different ways ranging from mainstream JSON formats to more specific types such as shapefiles.

There are several different forms of data storage in grasshopper that components use to get and store data. Some of these are straightforward, while other types allow data to “flow” through grasshopper and gain complexity over time. For this paper, am referencing a post made by David Rutten on the subject because I found it to be the clearest in explaining the data tree structure. I’ll use examples of the different data types from my grasshopper script for my final project to explain the different types of data structures:

  1. Many components are relatively simple, functioning on a single value input into them, and producing a single value as a result. Rutten describes these as “1:1” or “one-to-one” components that only work with one or two numbers as inputs and produce an output. A simple example in my code is the divide function. In this case I have a number A that is being divided by number B, which is one. The number one is being input from a number slider, so this number can be changed manually depending on how I want to divide up the input number A.

2. The next level of complexity is the components that create lists of values, described as components that are “1:N”. These components are used to create new data in some instances, such as if you are dividing a curve into multiple segments. These take singular input data and output a list of values. In my script there is an instance where I have consolidated the total square footage of the residential spaces into a single new building mass (in Rhino/Grasshopper, there are no “buildings” per se, just surfaces and geometry). This new geometry is confined to a single set of data called a Brep, or boundary representations. This Brep (Mass 6) is being fed into a Deconstruct Brep component that is breaking down the Brep into faces, edges and vertices so I can select and use those components in the future script. Notice the panel at the bottom that says “Closed Brep”. After I took the screenshot, I also plugged in a Parameter Viewer to double check what I was claiming, and it reported “Data with 1 branches”. After the Brep is fed into a DeBrep, I now have several lists to work with. This was important so I could find the bottom floor of the new mass and use that to create new floors, splitting the mass up into 4 floors. There are also components that do the opposite of 1:N components; these take a list of data and return a single data point or branch. An example of this would be a list of points being used to create a polyline, which would take the list and produce a single line with singular data.

3. There are components that work with data lists to manage data, not necessarily creating new data. These would be identified as their N:N functionality. One easy example of these components would be the Sort component. This simply takes a list and rearranges it based on the input parameters. The example from my script here is the (surprise) Sort function. In this instance, I am using Sort to take the list of line data I have created and arrange the data according to the length of the line. Notice the panels that show lines of various length in no particular order, and then after being sorted they are arranged from shortest to longest.

4. There are more obscure components used to take data from 1:N’, or as Rutten describes, taking a single branch of data and outputting several lists of points. These can include creating grids, where lists of data are created, some representing lines and others representing points on a new grid. These instances create “layers” of complexity, similar to how a data dictionary works to group items.

These instances of layered complexity created the need for structuring the data in a way that would allow you access to specific parts of the data, group them in certain ways, etc. This is where the data tree structure comes in. Data trees are an “ordered collection of lists” according to Rutten. Each of these lists is associated with a path that identifies the list, and each path is unique to each list of data. Curly brackets and semicolons identify the data list by a set of integers, such as {1,0,9}. These are used in concert with indices, which use straight brackets and commas to identify each individual data point in a list. So for example, {10,3} [26] identifies a single data point in this hypothetical data tree with two branches.

The reason this is so important is the entire grasshopper plugin operates on this data tree structure. A more thorough understanding of this structure is important in order to be able to function in this software environment and to gain advanced understanding of the operational capabilities.

Working with Data Trees

What I want to focus on with this paper is the use of a data tree in practice, and compare the operation of that data tree to a different type of data management in python. Grasshopper actually runs on python script, so it is an interesting data management structure for what is essentially just python code. The goal is to see if this particular chunk of grasshopper script could be more easily computed with a python function, or to see why this would not be the ideal case, and grasshopper has X, Y, and Z advantages to python.

The script I want to use as the example for this paper is a chunk that is being used to measure distances between buildings of different groups at McMurdo Station. This is being done because buildings are used for different activities at the station, of course. Measuring the distance between the buildings will help get a sense of how spread out the different activities are at the station, and can be used as a baseline to measure against as the buildings get consolidated into larger masses.

This may look a bit intimidating, but it is not too hard to understand once you break it down into the individual components. It starts with two groups of geometry, in this case Breps of buildings at McMurdo. These geometries are in a data tree of one branch with 16 values, so each of those values would be identified by the path {0} [1] or {0} [2] and so on. Then they are fed into the area component which computes the area, but more importantly finds the centroid of the geometry. The centroid is crucial to identify a point that can be measured to for each geometry so I can compute the distance between each. The point is then projected to the “ground” with the Project component. This components default plane to project to is the XY plane, so what it does is reduce the Z value in the XYZ coordinates to zero for the centroid. The first change in the data tree structure comes after this projection is done. The small upward facing arrow on the right side of the Project component output is the Graft function. Graft adds an additional branch to the data tree, so after this is done, the data is organized as 16 branches with one data point each instead of one branch with 16 data points. This is crucial for the next step in my operations.

The Distance function measures the distance between geometry inputs, and is useful in this case because it can measure the distance between the centroids of each building. For this next step we are measuring the distance between each centroid in Group A to every centroid in Group B. This is only possible because we have grafted the data in Group A to have another branch. Notice Group A data is organized as data with 16 branches, and Group B at the bottom is organized as data with one branch. So each branch of Group A will be measured against each branch in Group B, meaning each one data point in Group A will be measured against all 32 data points in Group B. Notice the Distance component outputs data with 16 branches. There are 32 data points for each branch now because each branch has a data point for distance between buildings.

The Sort component here is taking in the two-branch data and producing the same exact structure, except these are now sorted based on length. The data are then fed into a List Item component that is used to index through different data points in lists. List item in this case does not have an index input, so by default it is selecting the first item in each data branch. Since we sorted the data, this is therefore selecting the shortest distance between a building in Group A to a building in Group B. The data is then flattened into one branch for the next portion of the script.

This next part works intuitively because each data point in the list item that was just flattened is still in order from 1 to 16 the same way as the project component that is feeding into it. The other list we are going to work with on this portion is the original project component that was grafted into 16 branches. That 16 branch component is being flattened into one branch again and plugged into the A in the sort component. The A list will be sorted synchronously with the flattened list of items that holds the data of shortest distances between buildings. Since they are in order 1–16 as originally, they are synched up and the output we are interested in is the A list, because that produces the list of shortest distances between the buildings in Group A to all the buildings in Group B, itself as a sorted list. The list item this list is fed into then takes this sorted list and selects the first by default, producing the point in Group A closest to any point in Group B. This output is also flattened, but you may have noticed this flattening is redundant.

This single data point is then run through a distance function to measure the distance between that point and all of the points in Group B. This is being done to select the point in Group B that is closest to the point in Group A. This is more straightforward and does not involve any grafting or flattening. The List Item component then selects the first of this sorted list, which is the point with the shortest distance to the Group A point. Now we have two points to feed into a Line component, which simply draws a line between the two. This line is more for visualization than anything else.

So that’s a way to sort through lists, graft and flatten your way to selecting points and creating lines based on distance. The question of this paper though: Is it possible to do this with python script and not use data trees? What are the advantages and disadvantages? To answer this question, I will have to outline how to do a similar data analysis in Python.

Previously I have conducted spatial analysis on Citi Bike trip data. For this assignment, I did summary statistics on the Citi Bike trips and grouped them by station. So I started with the data in standard dataframe format:

And I grouped them by Citi Bike station:

The far right column is the total of all trip time (in minutes) of trips that started at this station. So all trips and trip times are now organized by the station in which they started. This is similar to a data tree in that there is a sort of layered complexity to it, but it is dissimilar to a data tree in that there are not paths per se, everything is indexed by individual rows and columns. Accessing the data can look similar to a path though if you are using some of the pandas functions. In this example, I am creating a new data frame from Citi Bike data grouped by station ID, and then adding up all of the trip time into one data point. This feels familiar to the data tree structure, though it is a completely different type of data management:

So I wrote or copied from examples online this new code chunk below to approximate what I was doing in Grasshopper and Rhino. Essentially I am applying points to the geodataframe in example 9 (you can see from the picture above the points are already applied to the coordinates). The code is at the bottom is from a discusison board on Stackoverflow.

This new code chunk has a for loop at the bottom that I pulled from Stackoverflow. This loop was intended to draw the shortest line between points and lines formed as census block boundaries. This loop iterates through the dataframe to find the minimum distance. To use this loop to find the minimum distance between groups of buildings, an additional for loop subset would be needed to loop through the different groups.

Before this semester I had never used Rhino, Grasshopper or Python before, so this was quite an undertaking. Al three I have used to do some form of spatial analysis, and in the case of Rhino and Grasshopper, I was operating more of a GIS. After writing this paper it is becoming more clear that the data storage and even the software being used is not as crucial as the actual concepts and techniques you are conducting. The goals of each of the grouping and shortest distance conducted here are really only hindered by the skills the user has with each software. To me, data trees are relatively easy to understand, but slightly more complicated than a for loop. But at the same time, I feel the visual components and connections make grasshopper a bit more intuitive than Python (disclaimer, I feel more comfortable with Grasshopper/Rhino that I do with Python).

So there you have it. One more semester of academic firefighting in the books, and I feel almost comfortable with two more major softwares used in the field. Cheers!

References:

Re: Calculate Distance to Nearest Feature with Geopandas [Web log comment]. (2015, June 9). Retrieved December 3, 2018, from https://stackoverflow.com/questions/30740046/calculate-distance-to-nearest-feature-with-geopandas

Rutten, D. (2015, January 20). The Why and How of Data Trees [Web log post]. Retrieved December 3, 2018, from https://www.grasshopper3d.com/forum/topics/the-why-and-how-of-data-trees

--

--