Manure makes Flowers Grow
Anyone worth their salt in the Geoverse has complained about the Shapefile. Its limits, its size. The fact its not a file, but a collection of files, hence why I and others call it the Shapefolder. But lets be honest — our beautiful datasets being analyzed in Q, or pushing out a REST service, or GeoJSON or whatever you want from PostGIS, has spent at least a portion of their life as a shapefile. In an un-scientific survey, 178% of all geodata has spent time as a shapefile. The reason its over 100% is because the files were bloated before, because…shapefolder.
No, I am not here to sing the praises of the Shapefolder, much like iceberg lettuce, dating and that film on the top of yogurt, you have to deal with this crap to get to the good stuff — the data they contain. This post is about how to crack open a Shapefolder, and get to that creamy data goodness, without using a desktop tool. Because sometimes you just want the data for analysis.
We’re going to discuss, R, Python, Julia and Java(badly), and how they can consume, and analyze the data in a Shapefolder — adding to their data to your analysis. There are other methods to do this, such as Spotfire’s native Shapefolder importer, but I wanted to talk about you just banging it out in front of an IDE, or notepad. This isn’t a deep dive, just a skim over what is possible. We’ll get into visualization next week — maybe.
First some caveats about Shapefolders. First, they are limited to around 2.2gb of storage. While this does seem like a sea of data to a normal human, that’s only about 70 million records. Second, if you shapefolder has ever touched ArcGIS, it will replace all null values with 0. ArcGIS does not support NULLS in shapefolders, so you may have some data cleaning in front of you OR, if in your stuff 0 is basically null. If you are going to be creating shapefolders from data, then there is a whole slue of things you’re going to need to be concerned about, field size, rounding errors, or taking your dead dog to a pet semetery, only to have it come back a possessed killing machine — or was that the plot of a Steven King novel. Either Shapefolder or Steven King novel, we’re about to open up “the horror.”
<Edit>Some Snowback called me out for not speaking about OGR. This post assumes there is data in a shapefolder that you want to help perform further analysis, so you want it integrated into your model/language. Not just pulling the data out for the data </Edit>
Python is a quick clean and has a number of robust libraries for Spatial, and its the most popular language for Data Science and GIS. I prefer PySAL, it seems to be the most complete of all the spatial packages for Python. Not only does it allow you to crack open a .shp, but you can also just crack open the .dbf and harvest the data from there. Which does speed up the data ingestion process if you aren’t doing a visual.
If you know you’re just going to be dealing with Shapefolders, PySHP is another library you can check out. While it does not offer the complete geospatial analytics package as PySAL, if you’re just cracking open shapefolders. One of the advantages of PySHP over PySAL, is that it will crack open any of the the Shapefolder’s file types, so if you’re just grabbing projections, then you can import and harvest that file.
Its also feasible just to use NumPY, but you’ll be writing a bunch of code, and none of us want to do that.
Like Python, R is a language well suited for Data Science and statistics, mainly because it was created by Statisticians. There are a whole SLUE of Spatial Libraries for R, the one I am partial to is SpatStat. SpatStat allows you to do, well what everthing else does so far, crack open a shapefile, and get to that data without using a desktop application. As with all things R, there is a mailing list. If you only have two take aways from this blog entry, its that if your shapefile has ever touched ArcGIS you’re going to spend hours cleaning it, and that R users love mailing lists.
Julia is a recent interest of mine. I build an R model that needed to scale, and you know what. R scales….poorly. Nosing around a bit and I found Julia. We’ve only been dating for a couple of months, but as far as I can tell, this relationship is going to last a few years. <– that sentence is why you should never name a language after a woman.
Regardless, Julia has a library just for parsing out shapefolders, strangely named Shapefile. That being said, Julia can also call functions from other lingos. So, if you wanted to you could bring PySAL, or if you wanted to pull async data, you could hook her up to Node, pulling in data and then munching it using PySAL or SpatStat or whatever Java package does spatial stuff. So Julia is really the dark matter of the Data Science world at this point.
I’m not a Java guy, never really have been — that being said there is a really robust library for Java from the OSGEO Folks, GEOTools. I’ve never used it, but its been around forever and has a strong user community and support. But yea, go there, download it play with it.
Next week we’ll talk about Visualizing geospatial data without Arc or Q, and lead into a deeper discussion as to “Why do I need these things in the first place?”
Originally published at www.spatialcapability.com on July 8, 2015.