
Modeling Time Series
Time series, measurement data that changes over time, underlies the foundation for nearly all forms of data analysis. You want to show how revenue income changes from quarter to quarter? You create a time series. You want to show the growth of plants from week to week? Establish crime statistics? Determine whether a given medication is improving a patient’s health? Chances are that a time series is involved somewhere in there.
For all that, meaningful time series are hard to model, and even worse are hard to model in a way that allows that data to be portable. There are a lot of dimensions to time series. The purpose of this article is to explore how such time series can be modeled at a semantic level, and from there, can be used by applications.
It’s About Time
There is a tendency, when looking at time series, to get hung up on the time aspect, and not understand that the real complexity is in the metric itself. Part of this is a direct artifact of relational data systems. If you specify a column name and a data time in a SQL database, then setting up a time series would look like it’s just a matter of capturing values and time stamps.
Yet the real value of time series when they are far more fully specified, and this in turn comes down to actually thinking about what is being represented. One way of thinking about a time series is that it is a sequence of measurements. Each measurement in turn is defined as an object with a value, a named event, and a metric, that belongs to that time sees.
#Turtle representation. Assume all prefixes map as follows:
#
# prefixName:term =>
# <http://semanticalllc.com/ns/prefixName#term>
# e.g.,
#
# timeSeries:totalRevenue101 =>
# <http://semanticalllc.com/ns/timeSeries#totalRevenue101>
timeSeries:totalRevenue101
a class:TimeSeries;
timeSeries:metric metric:revenueDef121;
timeSeries:appliesTo salesRegion:salesPacificNorthWest;
.measurement:trm1
a class:Measurement;
measurement:timeSeries timeSeries:totalRevenue101;
measurement:value "103425922.00"^^currency:USD2001;
measurement:metricEvent metricEvent:2001;
.measurement:trm2
a class:Measurement;
measurement:timeSeries timeSeries:totalRevenue101;
measurement:value "134992726.00"^^currency:USD2001;
measurement:metricEvent metricEvent:2002;
.measurement:trm3
a class:Measurement;
measurement:timeSeries timeSeries:totalRevenue101;
measurement:value "142285022.00"^^currency:USD2001;
measurement:metricEvent metricEvent:2003;
.
Here, the time series object binds a metric to a resource. In the example above, what that means is that you have a metric, here revenue as defined by FASB Sect. 121, applied to a particular resource, a sales region for a company. a different region for that same company would have a different time series for the revenue metric, maybe given by timeSeries:revenue102. Putting this another way: if you had three regions, the Pacific Northwest, New England, and the Southeast, then for any given metric you would have a table with each row being the identifier for each region, and the columns being the specific metric events such as years or quarters.
Revenue (FASB 121)
2001 2002 2003 time series id
SE $95,944,252.00 $101,225,692.00 $111,259,216.00 timeSeries:totalRevenue99
NE $141,224,512.00 $163,266,741.00 $181,205,266.00 timeSeries:totalRevenue100
PNW $103,425,922.00 $134,992,726.00 $142,285,022.00 timeSeries:totalRevenue101The advantage to this approach is that a time series in effect becomes an annotation to a resource, and you could have multiple such annotation for different metrics affiliated with that resource.
Now, a frequent comment about RDF: This seems like an awful lot of information to have to retain. Why talk about events rather than just providing strings of years? The problem is that an event is itself more than a specific year. It’s a specific time span. Different measurements may have different reference points for what constitutes a year.
Indeed, looking at a specific event itself can yield a lot of critical information:
metricEvent:2002 a class:MetricEvent;
rdfs:label "2002"^^xsd:string;
event:startDateTime "2002-01-01T00:00:00.00001Z"^^xsd:dateTime;
event:endDateTime "2002-12-31T23:59:59.99999Z"^^xsd:dateTime;
metricEvent:calendar calendar:YearJan01;
metricEvent:weight "0.5"^^xsd:double;
metricEvent:previous metricEvent:2001;
metricEvent:order "32"^^xd:long;
.calendar:YearJan01
a class:Calendar;
rdfs:Label "Year starting Jan 1";
.
A metric event is a subclass of a generic event class specifically for handling measurements within a time series. The start and end points indicate the interval in question. It should be noted that these are very specifically date/time values, as it makes comparisons between different metric events consistent.
The calendar indicates a specific named calendar type, typically starting at a certain point in the year. This becomes especially useful when dealing with quarters or months, because it is frequently useful to talk about a given calendar being applicable for comparing second quarters, even when the specific dates for comparison are unknown (and it cuts down on the amount of computation necessary to order metric events).
The metricEvent:previous predicate is a pointer to the previous item in the timeline. The metricEvent:order provides a sequential index from a base (here, 1970). The key in both cases is that they provide ways to access the previous or next entry in a sequence if it exists.
The final value — weight — determines approximately when in that interval (over the range [0,1]) that the value should be considered sampled. A weight of 0 puts that at the beginning of the interval (here, January 1st), while a weight of 1 puts the sampling on December 31. In most cases, that value is likely an averaged value over the interval, and as such should be considered sampled at 0.5 (July 1st). This in turn makes a difference when interpolation takes place, as the interpolated curve of values may look very different if they are at the beginning rather than the middle of a range. Weight is also a convenient way to normalize measurements by giving a single event but multiple weights giving different sample points.
Understanding the Metric
The metric is a mix of citation, units and other provenance information.
metric:revenueDef121
a class:Metric;
rdfs:label "Revenue (FASB Sect 121)";
metric:code "RevDEf121";^^xsd:string
metric:cites citation:FASB.121;
metric:units currency:USD2001;
metric:description """This determines corporate revenue based upon
the assumptions of FASB section 121"""^^xsd:string;metric:dateTime "2016-08-18T01:58:22.12369Z";
metric:author person:JaneDoe;.citation:FASB.121
a class:Citation;
citation:biblio biblio:FASB;
citation:selector "121"^^xsd:string;
citation:href "https://asc.fasb.org#sect121"^^xsd:anyURI;
.biblio:FASB
a class:Biblio;
rdfs:label "Financial Accounting Standards Board";
biblio:href "https://asc.fasb.org/"^^xsd:anyURI;
.
currency:USD2001
a class:Currency;
rdfs:label "US Dollars (2001)";
currency:symbol "$";
currency:at "before";
currency:thousands ",";
currency:decimal ".";
currency:negativeBefore "(";
currency:negativeAfter ")";
currency:mask "($#,##0.00)";
currency:prefix "USD";
currency:country country:USA;
currency:referenceYear "2001"^^xsd:gYear;
currency:baseline currency:USD1970;
currency:value "0.76"^^xsd:double;
.
In this case, the metric is annual revenue as defined by FASB section 121 (note, this is a fictitious reference) using a currency baseline of the US Dollar in 2001 dollars. The currency description provides a lot of information for representing this currency as well, and is defined in such a way that it also reference another baseline, the US dollar in 1970 dollars, along with the current value in comparison to the older value (it has only seventy six cents on the dollar in 2001 compared to 1970). The currency also contains a mask which may be used by certain formatting functions.
The citation information is actually two-fold. A citation cites a bibliograph reference (in this case the Financial Accounting Standard Board codex) and has a selector for section 121. This is useful for dealing with any kind of formal definition from a codex or similar legal or industry code reference. It also includes a link to an online representation of that code section, which can then be displayed by user agents.
Now this model does not track versioning changes. This is in fact somewhat beyond the scope of this article, but in applications that this author has built around this model, versioning employs the use of graphs and graph references to store the associated date when they move out of date.
Associating Resources
The timeSeries in turn applies to a specific resource.
timeSeries:appliesTo salesRegion:salesPacificNorthWest.salesRegion:salesPacificNorthWest
a class:SalesRegion;
rdfs:label "Pacific Northwest Sales Region";
salesRegion:code "PNW";
.
If the metric is what property is being measured, then the appliesTo is what resource that property is associated with. In this case, for instance, the revenue time series is applicable to the Pacific Northwest sales region of a given company.
This should be sufficient to specify most basic time series semantically.
Querying the Time Series
This seems like a lot of information, to retain, but this is actually about as efficient (and considerably more flexible) as storing it in XML or JSON formats. You can use SPARQL to recover the information. For instance, suppose that you wanted the labels and values for the sales in the Pacific Northwest region for the dates in the system. The sparql to generate this would be written as:
select ?eventLabel ?value where {?salesRegion salesRegion:code ?SalesRegionCode.
?timeSeries timeSeries:appliesTo ?salesRegion.
?timeSeries timeSeries:metric ?metric.
?measure measurement:metric ?metric.
?metric metric:code ?MetricCode.
?metric metric:units ?currency.?metric metric:code $metricCode.
?measure measurement:metricEvent ?event.
?measure measurement:value ?value.
?event rdfs:label ?eventLabel.
?event event:startDateTime ?startTime.
} order by ?startTime,
{SalesRegionCode:"PNW",MetricCode:"RevDef121"}
which in turn generates the output:
eventLabel value
2001 103425922
2002 134992726
2003 142285022or, as a JSON object:
[
{
eventLabel:"2001",
value: 103425833},
{
eventLabel:"2002",
value: 134992726 },
{
eventLabel:"2003",
value:142285022
}
]
Technically, there is actual a SPARQL JSON output format which is a bit more complex than this, but the idea is the same.
Note that you can also format the output within the SPARQL query:
select (?eventLabel as ?Event_Label) (?value as ?Value) where {?salesRegion salesRegion:code ?SalesRegionCode.
?timeSeries timeSeries:appliesTo ?salesRegion.
?timeSeries timeSeries:metric ?metric.
?measure measurement:metric ?metric.
?metric metric:code ?MetricCode.
?metric metric:units ?currency.?currency currency:mask ?format.
?measure measurement:metricEvent ?event.
?measure measurement:value ?unformattedValue.
bind (fn:formatNumber(?unformattedValue,?format) as ?value
?event rdfs:label ?eventLabel.
?event event:startDateTime ?startTime.
} order by ?startTime,
{salesRegionCode:"PNW",
metricCode:"RevDef121"}
Where the extra code retrieve the currency definition from the metric and then retrieves the mask to be applied to the value to produce a formatted value. (Note that the formatNumber function is implementation specific).
This produces the result:
Event_Label Value
2001 $103,425,922.00
2002 $134,992,726.00
2003 $142,285,022.00You can sum up and find the average for all of the values:
select (sum(?value) as ?Sum) (avg(?value) as ?Average) where {?salesRegion salesRegion:code ?SalesRegionCode.
?timeSeries timeSeries:appliesTo ?salesRegion.
?timeSeries timeSeries:metric ?metric.
?measure measurement:metric ?metric.?metric metric:code ?MetricCode.
?metric metric:units ?currency.
?currency currency:mask ?format.
?measure measurement:metricEvent ?event.
?measure measurement:value ?unformattedValue.
bind (fn:formatNumber(?unformattedValue,?format) as ?value
?event rdfs:label ?eventLabel.
?event event:startDateTime ?startTime.
} order by ?startTime,
{SalesRegionCode:"PNW",
MetricCode:"RevDef121"}
Where sum(?value) as ?Sum adds up all of the values while average gives the average.
Sum Average
$380,703,581.00 $126,901,193.66You can also retrieve the time series summaries as tables for all of the regions:
select
?SalesRegionCode (sum(?value) as ?Sum) (avg(?value) as ?Average) where {
?salesRegion salesRegion:code ?SalesRegionCode.
?timeSeries timeSeries:appliesTo ?salesRegion.
?timeSeries timeSeries:metric ?metric.
?measure measurement:metric ?metric.
?metric metric:code $metricCode.
?metric metric:units ?currency.
?currency currency:mask ?format.
?measure measurement:metricEvent ?event.
?measure measurement:value ?unformattedValue.
bind (fn:formatNumber(?unformattedValue,?format) as ?value
?event rdfs:label ?eventLabel.
?event event:startDateTime ?startTime.
} order by ?SalesRegionCode ?startTime
group by ?SalesRegionCode,
{metricCode:"RevDef121"}With output
SalesRegionCode Sum AverageSE $326,124,575.00 $108,672,315.66
NE $412,314,763.00 $142,561,772.66
PNW $380,703,581.00 $126,901,193.66
This is scratching the surface. The above assumes a single independent variable (the resource in question) over time, but you can set up multiple resource metrics by making a composite entity (for instance, sales region and a gender) by creating a hybrid key:
salesRegionGender:PNW-Female
a class:SalesRegionGender;
salesRegionGender:region salesRegion:PacificNorthwest;
salesRegionGender:gender gender:Female;
.
timeSeries:salesPNW-Female
a class:TimeSeries;
timeSeries:metric metric:revenueDef121;
timeSeries:appliesTo salesRegionGender:PNW-Female;
.
This would then create a data cube for the metric:revenueDef121 with 3 x 3 x 2 = 18 potential values and six time series (assuming two genders).
Finally measurements can be extended with upper and lower error bounds to better model risk and uncertainty. These would be given as measurement:upperBound and measurement:lowerBound, such as:
measurement:trm1
a class:Measurement;
measurement:timeSeries timeSeries:totalRevenue101;
measurement:value "103425922.00"^^currency:USD2001;
measurement:upperBound "106221876.00"^^currency:USD2001;
measurement:lowerBound "99328104.00"^^currency:USD2001;
measurement:metricEvent metricEvent:2001;
.Additional analysis tools such as confidence can also be applied, working in much the same way. Similarly, bracketing and normalization can also be applied on time series by storing this information, along with filters and envelopes, into the metric definition itself.
Summary
Time series are ubiquitous. By modeling time series in this fashion using RDF and SPARQL you gain a huge number of advantages. You can represent anything from scientific experiments to financial analysis in the same basic, consistent fashion. You can compare time series for different objects across the same (or different) metrics to determine correlations. You can pass time series as open data to others with minimal (or no) rework, and you can extract time series data in convenient formats with comparatively little effort.
Kurt Cagle is the founder and ontologist for Semantical LLC, and the blogger at the Metaphorical Web, and is available for hire or consultation.
