I undertake interdisciplinary disciplinary research which mixes computer science with astronomy at the Jodrell Bank Centre for Astrophysics (SKA group) here in Manchester. I work mostly on developing machine learning algorithms and software tools for the processing of Square Kilometre Array (SKA) data (more info here). This is a new telescope under development, which when constructed will be the largest radio telescope ever built. The goal of my work is to help astronomers make new discoveries with the instrument (pulsars in particular) when it eventually comes on-line during the next 10 years.

Square Kilometre Array

Artists Impression of the Square Kilometre Array. Credit: SKA Project Development Office and Swinburne Astronomy Productions – Swinburne Astronomy Productions for SKA Project Development Office, CC BY-SA 3.0,


R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, “Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach”, Monthly Notices of the Royal Astronomical Society (MNRAS), 459 (1): 1104-1123, Arxiv Pre-print, MNRAS, DOI:10.1093/mnras/stw656.


Member of the Program Committee of the following International Conferences:

Datasets and Databases

Pulsar Survey Database

You may find the Pulsar Survey database I created useful for your research. It lists every major pulsar conducted during the past fifty years, along with their technical specifications.

Pulsar Data (HTRU2)

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). These are summarised below:

  1. Mean of the integrated profile.
  2. Standard deviation of the integrated profile.
  3. Excess kurtosis of the integrated profile.
  4. Skewness of the integrated profile.
  5. Mean of the DM-SNR curve.
  6. Standard deviation of the DM-SNR curve.
  7. Excess kurtosis of the DM-SNR curve.
  8. Skewness of the DM-SNR curve.

Classification results for: A Study on Classification in Imbalanced and Partially-Labelled Data Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper,

Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008.


Much of my code is online and released under an open source license.

Pulsar Feature Lab

The pulsar feature lab application is a collection of python scripts useful for extracting machine learning features (otherwise known as scores or variables) from pulsar candidate files. The code was written in order to provide a tool-kit useful for designing and extracting new candidate features, whilst retaining the ability to extract existing features developed by the community at large. This enables newly conceived features to be evaluated with respect to existing features allowing an objective decision on their utility to be reached.


Stuffed is a wrapper for WEKA and MOA classification algorithms, which enables testing and evaluation on unlabelled data streams. This is (or was last I checked) hard to achieve with MOA. Stuffed makes this possible by using custom sampling methods to sample large data sets so that they can contain:

– Varied levels of class balance in both test and training sets.
– Varied levels of labelling in the test data streams.

The custom sampling method produces meta data with each sampling, that allows stream classifier predictions to be evaluated on unlabelled data. For instance, if a data item in the stream is unlabelled (?), typical evaluation mechanisms would not evaluate classifier performance on this example. However since Stuffed keeps meta data at hand, it is possible to evaluate the label assigned by a classifier to each unlabelled instance.

Stuffed is only designed to work on binary classification problems. It can be used to gather statistics on classifier performance, is easily extensible, and can be used with other tools such as MatLab.


An imbalanced data stream classifier, which uses the Hoeffding bound and Hellinger distances, to improve minority class recall. The GH-VFDT utilises a decision tree split criterion, designed to improve minority class recall rates on imbalanced data streams, i.e. those streams where the class distribution is worse than 1:100. This implementation is built upon the Hoeffding Tree provided in MOA, thus a great deal of credit goes to the MOA team for their initial implementation and library. We greatly acknowledge their efforts.

For more details of the algorithm see:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.


Machine Learning & Science Data Processing @ SKA Delivering the Science, Cambridge, 12-13 April 2016.


Industry Engagement Day Poster, Thursday 10 July 2014.

Ph.D Work

During my Ph.D I was under the supervision of Dr. John Brooke (Computer Science), Dr. Josh Knowles (Computer Science) and Prof. Ben Stappers (Jodrell Bank Centre for Astrophysics). The algorithms I developed during my PhD have so far helped to find 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey, aka the LOTAAS survey (see here for more details). The techniques I developed are also being applied to data collected during the GMRT High Resolution southern sky survey (GHRSS), and to High Time Resolution Universe Survey (HTRU) data.

During my time as a Ph.D student I was part of the SUPERB project (SUrvey for Pulsars and Extragalactic Radio Bursts) which searched for new pulsars and the more mysterious fast radio bursts (FRB’s). See the SUPERB website for more details.

So what is a Pulsar?

When stars reach the end of their life, what happens to them often depends on how much mass they have. Extremely large stars collapse to form black holes when they die, whilst smaller stars like our sun become white dwarf stars. Radio pulsars on the other hand are those stars which had too much mass to become white dwarf stars and too little mass to become black holes (they typically have a mass ranging between approximately 1.4 M⊙ [5] and 2 M⊙ [6] solar masses). Pulsars are actually a type of neutron star that rapidly rotate about their axes (often hundreds of times per second), emitting regular radio pulses from their magnetic poles with almost artificial precision (see the picture below). These beams of radio emission are swept across the sky each time the pulsar rotates, similarly to a lighthouse searchlight spinning around. Pulsars are also extremely dense objects. Their enormous mass is actually condensed in to a sphere of only about 20 km [5] in diameter, with a surface gravity 109 times that of the Earth’s [2] ! Pulsars were first discovered back in 1967 [1] by Jocelyn Bell (then a PhD student), and her colleagues at Cambridge University.

A Radio Pulsar

Typical radio pulsar, with radio pulses being emitted from the magnetic poles. Credit Brooke et al. (see [7]).

Pulsars are of particular interest for two reasons: 1) they are much more massive than any object in our own solar system which makes them unique and interesting, and 2) the periodic radio pulses which they emit can be measured here on Earth, meaning they can be studied. But why is this important? Well, its important as it allows us to test many of our scientific theories. These include for example, theories of gravity. This is simply not possible in our own solar system, as we don’t have such large objects with immense gravitational fields nearby [8]. Pulsars are interesting and useful for many other reasons. Pulsars can be used for spacecraft navigation, for precision timing since they are excellent time keepers, as probes of the space between stars (the Interstellar medium), for studies of stellar evolution, and much much more! See [2,8,9] for further examples.

So where does computer science come in?

In recent years a number of technical advances have enabled pulsars to be observed with unprecedented precision. However such precision cosmology presents us with some difficult computational challenges. Chief amongst them is figuring out how to deal with the vast amounts of data being generated by modern radio telescopes. Telescopes now produce so much data, that finding signals of interest is a little bit like finding a needle in a haystack. This is because those of most interest to science are rare. This includes exotic types of pulsar including millisecond pulsars (MSPs)1, binary pulsars, or even pulsar-black hole binaries. To find them we must create sophisticated search algorithms which carefully go through the data for us.

However this is not easy to do. Interesting signals can look very similar to uninteresting signals caused by interference (caused by mobile phones, planes, satellites etc) or noise. In fact they can be almost indistinguishable. Our current best search procedures therefore return millions of signals which must be checked. Almost all of these are caused by interference or noise! So we need new approaches and algorithms that are able to accurately pick out the interesting signals whilst ignoring all the ‘rubbish’.

A branch of computer science called machine learning specialises in producing these types of algorithms and data processing techniques. My work uses these techniques to help astronomers more accurately select signals to follow-up, with the ultimate aim of building an accurate signal selection algorithm for the SKA.


Developing machine learning algorithms for the SKA is far from easy. The SKA will produce terabytes of data each second, placing significant load on any proposed selection algorithm! Whilst under such load a high recall rate (99.9%), and a low false positive rate (0.0001%) must be maintained. Otherwise discoveries could be missed. The main challenges faced are:

  • Skewed class distributions (minimum of 1000 non-pulsar examples seen for every pulsar observed).
  • Data distributions which drift over time (concept drift) due to non-static RFI and galactic noise environments.
  • Operating within strict resource requirements (CPU time, memory use, power use).
  • A lack of readily available labelled examples to train our algorithms with.
  • A real-time operation requirement.
  • The need to audit and monitor classification decisions, so that science teams can improve algorithmic performance over time.

The algorithms I develop have wide applicability outside of astronomy, particularly in those domains characterised by unlabelled/imbalanced data.

1. Pulsars with rotation periods measured in < 30 milliseconds.

Further Reading

[1] T. Damour and J. Taylor, “Strong-Field Tests of Relativistic Gravity and Binary Pulsars,” Physical Review D, vol. 45, pp. 1840–1868, March 1992.

[2] J. Cordes, M. Kramer, T. Lazio, B. W. Stappers, D. Backer, and S. Johnston, “Pulsars as tools for fundamental physics and astrophysics,” New Astronomy Reviews, vol. 48, pp. 1413–1438, November 2004.

[3] T. Damour and G. Esposito-Far`ese, “Gravitational-wave versus binary-pulsar tests of strong-field gravity,” Physical Review D, vol. 58, p. 042001, Jul 1998.

[4] M. Kramer, I. H. Stairs, R. N. Manchester, M. A. McLaughlin, A. G. Lyne, R. D. Ferdman, M. Burgay, D. R. Lorimer, A. Possenti, N. D’Amico, J. M. Sarkissian, G. B. Hobbs, J. E. Reynolds, Freire, P. C. C., and F. Camilo, “Tests of general relativity from timing the double pulsar,” Science, vol. 314, pp. 97–102, October 2006.

[5] P. Haensel, Y. Potekhin, and D. Yakovlev, eds., Neutron Stars I Equation of State and Structure, vol. 326 of Astrophysics and Space Science Library. New York, NY: Springer New York, 2007.

[6] B. Kzltan, “Reassessing The Fundamentals New Constraints on the Evolution, Ages and Masses of Neutron Stars,” in ASTROPHYSICS OF NEUTRON STARS 2010: A Conference in Honor of M. Ali Alpar, pp. 41–47, AIP, August 2010.

[7] J. Brooke, S. Pickles, and P. Carr, “Workflows in pulsar astronomy,” in Workflows for e-Science (I. Taylor, E. Deelman, D. Gannon, and M. Shields, eds.), pp. 60–79, Springer, 2007.

[8] M. Kramer, D. Backer, J. Cordes, T. Lazio, B. W. Stappers, and S. Johnston, “Strong-field tests of gravity using pulsars and black holes,” New Astronomy Reviews, vol. 48, pp. 993–1002, October 2004.

[9] C. Carilli and S. Rawlings, “Motivation, key science projects, standards and assumptions,” in Science with the Square Kilometre Array (C. Carilli and S. Rawlings, eds.), pp. 1363–1375, New Astronomy Reviews, December 2004.

[10] D. R. Lorimer and M. Kramer, Handbook of pulsar astronomy. Cambridge Univ Press, 2005.

[11] A. J. Faulkner, I. H. Stairs, M. Kramer, A. G. Lyne, G. Hobbs, A. Possenti, D. R. Lorimer, R. N. Manchester, M. A. McLaughlin, N. D’Amico, F. Camilo, and M. Burgay, “The Parkes Multibeam Pulsar Survey: V. Finding binary and millisecond pulsars,” Monthly Notices in Astronomy, vol. 355, pp. 147–158, August 2004.

[12] R. P. Eatough, N. Molkenthin, M. Kramer, A. Noutsos, M. J. Keith, B. W. Stappers, and A. G. Lyne, “Selection of radio pulsar candidates using artificial neural networks,” Monthly Notices of the Royal Astronomical Society, vol. 407, p. 24432450, May 2010.