Analysing active galaxies with the Sloan Digital Sky Survey

6 min readJun 16, 2020

Introduction

In my day job I’m an enterprise developer, in the evenings I’m studying for a degree in astronomy and physics with the OU, for which I’m in the final year. I recently worked with a team of 7 others on a project to produce a characteristic spectrum for active galaxies (also called quasars, or QSOs for ‘quasi-stellar objects’); these are galaxies at high redshifts (i.e. they are very old) with compact, energetic point like nuclei known as active galactic nuclei (AGN).

Active galaxies are interesting because they allow astronomers to essentially look back in time, sometimes to just a billion or so years after the big bang. There is strong evidence that the quasar luminosity function is variable-there are fewer quasars as redshift becomes smaller, implying that quasars are a stage of galactic evolution that many galaxies-including our own Milky Way-may have gone through in a bygone epoch. Active galaxies may also be the source of the ionisation of neutral hydrogen observed in the intergalactic medium.

The goal

The goal of the project was to collect data from multiple spectra emitted by AGN, to normalise and aggregate the data, such that we would be able to present a high signal-to-noise ratio (SNR) combined spectrum.

An example of such a spectrum is given below from Harris et al. (2001).

Figure 1: A characteristic combined spectrum from ~100k quasars

The plot in Figure 1 shows distinct peaks at key wavelengths that can be used to infer the chemical composition of QSOs, and also to determine the velocity dispersion of the AGN; note that the lines are wide— the explanation for this is that the gas at the centre of an active galaxy is moving at relativistic speeds. Measuring velocity dispersion would be one of the key indicators that we had correctly measured quasars and not some other class of object (e.g. a normal galaxy, like the Milky Way), and this is achieved using the simple equation △v=△λc/λ, where λ is the measured wavelength (the x-axis in figure 1) and c≈300000 km s⁻¹ is the speed of light; the full-width at half-maximum was applied.

Methods

We aimed to download data obtained by the Sloan Digital Sky Survey’s 2.5 m altitude-azimuth telescope located at the Apache Point Observatory in New Mexico, from a huge, freely available, online database. We used data from Data Release 12 (DR12) collected through July 2014, which can be queried using T-SQL (the back-end database is Microsoft SQL Server) through a web portal called SkyServer.

We chose a redshift (z) range of 1.5 ≤ z ≤ 3.1 which, as you can see from Figure 2, covers the majority of quasars found by the Baryon Oscillation Spectroscopic Survey (BOSS) survey (QSO densities may not drop at higher redshifts as is implied here).

Figure 2: QSO density as a function of redshift (age)

First we aimed to select representative spectra, and this meant trying to obtain a sample that is spread broadly across the wavelength range we’d agreed. The first step was to select some good sample objects. This was done by executing ad-hoc SQL queries against the SDSS database using SkyServer. An example finding 30 spectra, meeting a set of quality constrains - such as low SNR and no warning flags- is given below:

SELECT TOP 35
 bestObjId,
 z, 
 — A URL you can paste into something like word and click 
 ‘http://skyserver.sdss.org/dr12/en/tools/explore/summary.aspx?id='+CAST(bestObjId as varchar) as url, 
 — Generate some deterministic randomness 
 REVERSE(SUBSTRING(CAST(specObjId as varchar), 1, 5)) indexer
 FROM SpecObj 
 — We want quasars 
 WHERE class = 'QSO'
 — only select from BOSS
 AND survey='BOSS'
 — Redshift range
 AND z >= 1.7 AND z < 1.9 
 — 0 means all is well 
 AND zWarning = 0 
 — Signal to noise median over all good pixels 
AND snMedian >= 15
 — Some of these come back as zero, don’t know why and finding out is a distraction. Plenty of others, just exclude them 
 AND bestObjId > 0 
 — Bitmask flag for no warnings (see schema docs for flags) 
 AND zWarning_noqso=0x00000000 
 — Redshift error 
 AND zErr<0.0005
 — Important — ensure we’re not picking up rogue objects
 AND sourcetype='QSO'
 — For getting a selection of items using our “random” indexer 
 ORDER BY indexer ASC

This would select 35 objects (though not download their spectra - that would be the next step), that could be downloaded as in a CSV file. With 8 team members doing this for a range of z=0.2 each this would give us 280 spectra spanning the whole range. Mine was 1.7≤z≤1.9.

With objects selected the next step was to download their spectrographic data; doing this through the web-based UI was a laborious process that would need to be repeated hundreds of times, and may need to be repeated several times over if issues were found, so being a developer my immediate instinct was to automate the process. This was achieved in the form of a fairly rudimentary Python script that took the CSV file downloaded and grabbed the spectra from the SDSS website.

Some spectra are extremely intense while others are faint, and this often has nothing to do with their intrinsic luminosity, but is, instead, a function of their distance from the observer, position in the sky, obscuring dust and gas etc. To compensate for this we used a process of normalisation, whereby all spectra were normalised to a baseline value using a region of the continuous spectrum where no prominent emission lines were present. This process is complicated somewhat by the range of redshifts involved; because SDSS observes in the near-ultraviolet to far-infrared, and the data is subsequently corrected to represent the intrinsic emission wavelengths, spectra ranges vary depending on redshift; one spectrum might begin at 100 nm and end at 400 nm, another at 150 nm and end at 600 nm. To normalise to a common baseline it greatly simplifies things if the baseline is common across all spectra. Luckily there were several continuous ranges meeting this criteria, and we selected the range 220–230 nm. A further Python script would churn through the spectra calculating normalisation factors.

With spectra and normalisation factors in hand, we used an OU-provided tool called SpecCombine that would combine the spectra.

Results

Figure 3 shows how our results compared to other examples from the academic literature - including the aforementioned sample from Harris et al. (2016) - showing a high level of consistency.

We also compared our SNR with those from the academic literature, and this too gave impressive results considering the relatively small sample size.

Figure 4: SNR data comparison with academic literature

Calculations for velocity dispersion using prominent emission lines (such as Lyα) also gave reassuring results - of the order ~10⁴ m s⁻¹, consistent with expectations of relativistic velocity dispersions; we had successfully produced a high-quality composite spectrum.

Summary

This project was part of the final year of an astrophysics degree, but is the sort of project anyone can undertake as the data and the tools are freely available (SpecCombine would be trivial to implement in code), although some research is required to attach meaning to the data. Researching the nature of QSOs and understanding how our data could be interpreted gave me a deeper understanding of astrophysical concepts and the methods for processing and analysing scientific data.