GETTING STARTED | LINE PLOTS | KNIME ANALYTICS PLATFORM

Home Sweet Home

Exploring Average Home Sale Price Based On Census Regions With Line Plots In KNIME Analytics Platform

John Denham
Low Code for Data Science

--

Introduction: The Keys

Due to COVID and a myriad of other factors, the 2020’s have been chaotic when it comes to home ownership. In many places in the United States, supply cannot keep up with demand, heated discussion is still taking place around the impact to Millennials and undoubtedly soon, Generation Z. This has driven many existing homeowners to develop renewed interest in regularly checking their home Zestimate.

With all this input, filtered through friends, family, neighbors, media, home builders and others, exploring the data ourselves may provide a breath of fresh air, confirm suspicions, or at the very least offer a different perspective.

The Federal Reserve Bank of Saint Louis Economic Research FRED website has a dearth of economic data across many different categories. One of these, U.S. Regional Data breaks down even further and ultimately, we arrive at datasets providing numbers on the Average Sales Price of Houses Sold across various census regions.

With access to these datasets, we can put something together to examine how home sale prices have changed over the years.

With the Line Plot node in KNIME Analytics Platform, we can quickly and easily generate a line plot that allows us to compare this data and interact with it in the interactive and composite views.

The four datasets we are working with include the average sales price of houses sold across United States census regions (West, Midwest, Northeast and South). These datasets cover the sale prices from 1975 through July 2021. These datasets are very straightforward with just date and price columns for each region (Figure 1).

Figure 1: GEOFRED Site Showing Breakout Of U.S. Census Regions. Sources: Census; HUD.

The datasets come with the reference workflow for this blog post, named Line Plot, which is available on the KNIME Hub and can be downloaded for free.

In this post, we will describe and apply this node to:

  • Create a basic line plot from the FRED data.
  • Apply custom colors to specific columns to display in the plot.
  • Control aspects of the Line Plot node with flow variables.
  • Provide an example of the Plotly-based Line Plot node.

The Line Plot Node

Line plots assist us in visualizing changes over time. We often plot single or multiple lines against a temporal variable (such as a date) so we can more easily understand the relationship between our various data points.

Generally, we plot our continuous variable on the y-axis, so in this case the average sale price, and the temporal variable on the x-axis, in this case the date. The main area of the plot is then composed of our data points per region, color coded and connected with line segments (Figure 2).

Figure 2: The Line Plot Node.

The Line Plot node is part of the KNIME JavaScript Views family nodes that have a JavaScript-based implementation.

Aside from basic x and y plotting, we can apply color from nominal values to our series on the plot.

The Line Plot node is organized under the Views category in the node repository (Figure 3).

Figure 3: Where To Find The Line Plot Node.

Step 1: What Did They Pay For It?

While KNIME is free, these homes were not and we want to visualize how the average sale price has changed over time (Figure 4).

Figure 4: Reference Workflow Step 1.

First, we use a Column Appender to append our 4 source tables together, creating one table with 8 columns.

We now have duplicate DATE columns in the dataset so the Column Filter filters out the DATE columns we don’t need. Finally, rather than let this temporal data stay a String data type, we convert it to a Date&Time format. In KNIME, there is a dedicated data type to work easily and effectively with columns containing date and time information. This will come in handy when we generate our plot!

Next, we will just move directly to the Line Plot node. We will cover the lower branch of the workflow later in this article.

Let’s open the Line Plot node configuration. If you’ve been reading my articles on KNIME’s JavaScript-enabled visualization nodes, this should look immediately familiar. If you’re not, go check them out! (Figure 4).

Figure 4: The Line Plot Node Configuration Options Tab.

In the Options tab of the node configuration, our first choice is to Create image at outport which will generate a .PNG image of our specified size when the node is executed. To reduce the busyness and general noise of our visual, we can limit the Maximum number of rows here as well.

As mentioned earlier, the x-axis will often contain temporal data, and in our case it will. Let’s set the DATE column as our x-axis. Next, we want the columns containing the regional sales data to be included on our y-axis. Based on our workflow and whether or not our data changes between runs, we can leverage Wildcard/Regex selection.

Based on the example in Figure 5 below, if our dataset included regions that were not in our original, these would also get included for plotting.

Figure 5: Regex Example: Include Strings That Contain ‘Region’

The values range is represented on the y-axis whereas the census region aligned points are plotted in the body of the chart, aligned with their respective DATE data on the x-axis. An example of this is shown below (Figure 6).

Figure 6: Example Of A Line Plot.

Once we’ve selected our point columns, the y-axis data, we can decide to Report on missing values. If we choose to report on missing values, we have some choices around how we want the data reported (Figure 7).

Figure 7: Missing Value Handling Choices.

The Axis Configuration tab is where we can add axis labels, choose Locale and format dates (Figure 8).

Figure 8: Line Plot Node Configuration Axis Configuration Tab.

Since we are working with multiple data series, it’s recommended here to select Show Color Legend. If we don’t, the lines will be colored but at a glance we won’t know where they belong.

The final section here is the Axes ranges. We can have the visual automatically choose the axes ranges based on the data with Auto range axes, Use domain information (in this case it would be the same as the auto range since we have not adjusted the domain) or Always show origin which includes 0 even if the data domain does not include it as a value.

The General Plot Options tab allows us to set the plot title and subtitle. We can also set the image output size in pixels, if we want a static image of our plot (Figure 9).

Figure 9: Line Plot Node Configuration General Plot Options Tab.

The Resize view to fill window and Display fullscreen button are choices that appear in the composite or interactive view after the node is executed and offer more customization once the view has been opened.

Big fan of The Economist theme in R’s ggthemes? The Background section allows us to choose background, data and grid colors. There are lots of choices here, from swatches to HSV, HSL, RGB, HEX and CMYK. We can also suppress the grid altogether by de-selecting the Show grid check box here.

In the Appearance section, we can change the Line Size which is the thickness of the line on the plot.

Finally, as we interact with the plot in the composite or interactive view we can choose to see warnings or suppress them by selecting Show warnings in view.

The View Controls tab includes customization options that directly impact the interactive experience. Some key items to pay attention to include: Snap to data points, Enable mouse wheel zooming and Show zoom reset button (Figure 10).

Figure 10: Line Plot Node Configuration View Controls Tab.

Let’s execute the node and take a look at the output plot (Figure 11).

Figure 11: Line Plot Output In KNIME Analytics Platform.

It looks great! It’s easy to see the data relationships between our regions and across time.

We have a descriptive title and subtitle and labeled axes. The color legend allows us to distinguish easily between the series in this interactive view because we are using Date&Time formatted data for our temporal column — as we zoom in the dates along the bottom ‘zooms in’ as well showing more data points from the specific year as we get closer (Figure 12).

Figure 12: Line Plot Interactivity.

We can see changes in the data that would correspond to real estate downturns, and more specifically we can see the upward trends in the 2020s that we read so much about. I highly encourage you to explore these datasets online to see more great examples of how line plots are used to tell concise, effective stories that have immediate impact.

What About Some Color?

The little branch in the workflow is there to allow us to color the region columns (Figure 13).

Figure 13: Setting Colors For The Lines.

To do this, we use the Extract Table Spec node to grab just our table specifications. In the output there is a column of just column names. Once in the Color Manager node, we select Column Name from the dropdown and then assign colors to the specific rows that correspond to the columns that constrain the region specific data (Figure 13, 14).

Figure 14: The Color Manager Node.

When this is connected to the bottom in port of the Line Plot node, this color data will be applied to our columns on the y axis.

The Thin Red Line: Notes On Flow Variable Control

This node has a number of flow variables worth noting (Figure 15).

Figure 15: Line Plot Flow Variables.

If we choose to control various flow variables, we just need to ensure our flow variable data types match what these outputs show. In the example above, most are boolean or string values. It should be noted that the color values are in RGBA format.

Alternatives:

There are JavaScript View (Labs) nodes available that include Ploty.js enabled nodes. If you’re interested in an alternative you should explore the Line Plot (Plotly) node (Figure 16).

Figure 16: Line Plot From Plotly.js Line Node.

There is also a Line Plot (local) node (Figure 17) that requires no configuration and:

Plots the numeric columns of the input table as lines. All values are mapped to a single y coordinate. This may distort the visualization if the difference of the values in the columns is large.

Figure 17: Line Plot From Line Plot (local) Node.

Note. It is worth mentioning that (local) nodes are a legacy from older KNIME software versions. While they are still a valid option to plot your data, JavaScript-based visualization nodes ensure better interactivity and can be used in component composite views.

Additionally, with KNIME’s flexibility you could easily code your own Plotly.js line plot (with a template in the Generic JavaScript View node) or use the R or Python integrations to plot with your favorite data visualization packages directly in KNIME (Figure 18).

Figure 18: Generic JavaScript View Node Plotly.js Line Plot (Included In Reference Workflow).

Conclusion: Make A House A Home

In this article, we explored a classic: the line plot. We plotted a basic line plot from FRED data, applied custom colors to our data based on the columns, explored flow variable control and saw other options available to use when plotting in KNIME.

I hope that you’re beginning to see how incredibly easy it is to read-in, wrangle and visualize data through KNIME Analytics platform. The JavaScript-enabled nodes allow us to tell a story with our data in the best way we see fit, elevating our work and ultimately democratizing the data science experience by lowering the barrier to entry.

--

--

John Denham
Low Code for Data Science

I am a Data Scientist who is passionate about empowering people to make the most of their data. I run the website KNIME.tips.