Where does OpenAQ data come from?

Please note:

  • OpenAQ-aggregate data are gathered in real-time from government agencies and no guarantees can be made for their accuracy. All quality control measures should be done by the user or by contacting the host source.
  • Licensing: The data is licensed under the Creative Commons Attribution 4.0 Generic License. It is attributed to the OpenAQ community. For more information about the data sources of this project, please check openaq.org. Software is licensed as below with The MIT License.
  • More FAQs on data and the OpenAQ Platform and Community can be found here.

So we’ve got a lot of data, but, in point of fact, we do not measure any of it ourselves — we aggregate it from public real-time data sources provided by official, usually government-level, organizations. They do the hard work of measuring these data and publicly sharing them, and we do the work of making them more universally accessible to both humans and machines.

OpenAQ doesn’t measure any data; we aggregate it from public real-time data sources provided by official, usually government-level, organizations. They do the hard work of measuring these data and publicly sharing them, and we do the work of making them more universally accessible to both humans and machines.

This post explains what sorts of data we aggregate onto the OpenAQ platform and why. In briefest form: We’re striving for criteria that maximize the transparency of our system and provide data in the most useful fashion possible.

This same information can also be found in this README file on GitHub in the openaq-fetch repository. But we know that won’t be immediately obvious unless you’re digging through our GitHub stuff (which we encourage you to do), so we’re also putting it here along with a more detailed explanation. OpenAQ is an ever-evolving platform that is shaped by its community: your feedback and questions are actively invited. Drop us a comment below, write an email at info@openaq.org, or chime in on our slack channel.

OpenAQ Criteria for Data Sources

We’ve got five main criteria for data sources that are suitable to include on the OpenAQ platform:

(1) Data must be of one of these pollutant types: PM10, PM2.5, sulfur dioxide (SO2), carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), or black carbon (BC). We chose these pollutants, as they are the most commonly measured outdoor pollutants, except for black carbon. We chose black carbon because it impacts regional and global climate and is also becoming more widely measured.

(2) Data must be from an official-level stationary, outdoor air quality source, defined as data produced by a government entity or international organization. We do not, at this stage, include data from low-cost, temporary, mobile, and/or indoor sensors.

Interested in helping out? We list some of the sites we’re hoping to add soon here on GitHub in the openaq-fetch repository as ‘issues.’ Feel free to add more possibilities, comment on the ones we have listed, or help connect our platform to these sources.

(3) Data must be ‘raw’ and reported in physical concentrations on their originating site. Across the world, air quality is reported in terms of an Air Quality Index (AQI), Pollutant Standard Index (PSI), or other equivalent forms that transforms physical pollutant measurements into one value. This is a valuable technique for quickly communicating the general air quality in one easy, often color-coded number, especially since air pollutants in their raw forms can be confusing with different units (e.g. parts per million versus micrograms per cubic meter) and different time-averaged intervals.

Often, air quality data is reported as an ‘Air Quality Index’ or equivalent form to most simply convey general air quality information to the public.

Often, however,countries have different formulas for defining their Air Quality Index or equivalent forms. That make sense from a national perspective, though it makes it difficult to easily know what the physical pollution levels are at a given location or to compare these data across countres’ boundaries.

For these reasons, we stick to data sources that share out the raw data. Importantly, although the conversion between a country’s Air Quality Index or equivalent forms and the raw value are often public, we will not ‘back-calculate’ any values to re-obtain the raw form. We do this because we would like to avoid what may be an error-prone and unsustainable approach; it can be hard to verify the most current formula of a country, human errors can easily creep in, and the methods used to calculate Air Quality Index values and their equivalent forms change over time and we might not know when they do for a given location.

(4) Data must be at the ‘station-level,’ not aggregated into a higher (e.g. city) level. Often, in order to most clearly communicate air quality to the public, public air quality sites will aggregate all data from sensors in a city or region to one value. However, to realize the full utility of these data, we think it is important to be able to associate each station with a unique physical location.

(5) Data must be from measurements averaged between 10 minutes and 24 hours. Our system checks a given site for an updated pollutant value every 10 minutes. On the high end, we made a judgement call that 24 hours is the longest average time interval that we would define as ‘real-time’ data. Plus, it seems like nearly all official real-time air quality data publicly shared in the world falls within this range anyway.

Thoughts or feedback? Questions? Let us know what you think and why.