Analytics Vidhya
Published in

Analytics Vidhya

Azure synapse analytics spark graph processing

Load data

Load Station Data

load Trip Data

Display schema for review

Create graph based data

Build the Graph

Now that you’ve imported your data, you’re going to need to build your graph. To do so you’re going to do two things. You are going to build the structure of the vertices (or nodes) and you’re going to build the structure of the edges. What’s awesome about GraphFrames is that this process is incredibly simple. All that you need to do get the distinct id values in the Vertices table and rename the start and end stations to src and dst respectively for your edges tables. These are required conventions for vertices and edges in GraphFrames.

Now you can build your graph.

You’re also going to cache the input DataFrames to your graph.

Trips From Station to Station

One question you might ask is what are the most common destinations in the dataset from location to location. You can do this by performing a grouping operator and adding the edge counts together. This will yield a new graph except each edge will now be the sum of all of the semantically same edges. Think about it this way: you have a number of trips that are the exact same from station A to station B, you just want to count those up!

In the below query you’ll see that you’re going to grab the station to station trips that are most common and print out the top 10.

You can see above that a given vertex being a Caltrain station seems to be significant! This makes sense as these are natural connectors and likely one of the most popular uses of these bike share programs to get you from A to B in a way that you don’t need a car!

In Degrees and Out Degrees

Remember that in this instance you’ve got a directed graph. That means that your trips are directional — from one location to another. Therefore you get access to a wealth of analysis that you can use. You can find the number of trips that go into a specific station and leave from a specific station.

Naturally you can sort this information and find the stations with lots of inbound and outbound trips! Check out this definition of Vertex Degrees for more information.

Now that you’ve defined that process, go ahead and find the stations that have lots of inbound and outbound traffic.

One interesting follow up question you could ask is what is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from

You can do something similar by getting the stations with the lowest in degrees to out degrees ratios, meaning that trips start from that station but don’t end there as often. This is essentially the opposite of what you have above.

The conclusions of what you get from the above analysis should be relatively straightforward. If you have a higher value, that means many more trips come into that station than out, and a lower value means that many more trips leave from that station than come into it!

Hopefully you’ve gotten some value out of this notebook! Graph stuctures are everywhere once you start looking for them and hopefully GraphFrames will make analyzing them easy!

Originally published at https://github.com.

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store