About


Hi, I’m Rob. I’m a post-doctoral researcher at the University of Manchester. I devise intelligent algorithms which help astronomers make interesting and important new discoveries. My work is highly interdisciplinary. It combines data science, machine learning, and radio astronomy. So far I’ve helped colleagues discover more than 20 new pulsars (a 1% increase in the known pulsar population). This figure is expected to rise in the coming years. Recently I’ve been helping design the world’s largest radio telescope, the Square Kilometre Array (SKA). I’m part of both the Central Signal Processor (CSP) and Science Data Processor (SDP) design consortia.

I have a B.Sc. in Software Engineering (First-class honours), and an M.Sc. in Advanced Computer Science (Distinction), both obtained at the University of Liverpool. I also have a Ph.D. in Machine Learning, obtained at the University of Manchester. In the past I’ve worked as a performance and scalability software engineer, and I’ve also volunteered as a STEM science ambassador. A general version of my CV can be viewed here.

I’m a proud Liverpuddlian, a total science nerd, and a big sports fan.

Research

I currently work as research associate at the Jodrell Bank Centre for Astrophysics (SKA group), which is part of the University of Manchester. I develop machine learning algorithms and software tools capable of processing the vast quantities of data produced by the Square Kilometre Array (SKA, more info here). This is a new radio telescope under development by an international team of scientists and engineers. When constructed the SKA will be the worlds largest radio telescope, and most sophisticated scientific instrument ever constructed.

Square Kilometre Array
Artists Impression of the Square Kilometre Array. Credit: SKA Project Development Office and Swinburne Astronomy Productions – Swinburne Astronomy Productions for SKA Project Development Office, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=11315190

Publications

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, “Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach”, Monthly Notices of the Royal Astronomical Society (MNRAS), 459 (1): 1104-1123, Arxiv Pre-print, MNRAS, DOI:10.1093/mnras/stw656.

R. J. Lyon, “Why Are Pulsars Hard To Find?”, PhD Thesis, School Of Computer Science, University of Manchester, 2016. Download.

R. J. Lyon, “50 Years of Candidate Pulsar Selection – What next?”, International Astronomical Union Symposium (IAU) 337, Manchester, 4-8th September, 2017, Arxiv Pre-print. Note that the supporting material can be found here, and the talk slides are here. The supporting material also has a unique identifier, see DOI: 10.5281/zenodo.883844.

Responsibilities

Member of the Program Committee of the following International Conferences:

Journals I have reviewed for:

Datasets and Databases
Pulsar Survey Database

You may find the Pulsar Survey database useful for your research. It lists every major pulsar survey conducted during the past fifty years, along with their respective technical specifications. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.3114130.v1 .

Pulsar Data (HTRU2)

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). The data set contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). If you use this resource, please cite via the DOI :  10.6084/m9.figshare.3080389.v1 .

Classification results for: A Study on Classification in Imbalanced and Partially-Labelled Data Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper,

Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008.

Code

Much of my code is online and released under an open source license.

SKA Data Models

This is a Jupyter notebook that models SKA Science Data Processor (SDP) data rates & volumes. Diagrams are included that define the conceptual and logical structure of Non-Imaging Processing (NIP) data models. Also,  activity diagrams for all NIP pipelines are included. Finally, formulas are presented that provide accurate estimates of NIP pipeline data rates.

DOI: 10.5281/zenodo.836715

SKA-Test Vector Generation Pipeline

This is a software pipeline used to generate SKA-like pulsar observations, aka ‘test vectors’. These are used to test SKA models. The software pipeline is packaged within a docker container.  I’ve also created a web interface which displays the outputs of the test vector generation pipeline. You can find the interface code here.

Docker Images

A collection of Docker images useful for pulsar search and data science analysis.

Pulsar Feature Lab

The pulsar feature lab application is a collection of python scripts useful for extracting machine learning features (otherwise known as scores or variables) from pulsar candidate files. The code was written in order to provide a tool-kit useful for designing and extracting new candidate features, whilst retaining the ability to extract existing features developed by the community for comparison.  If you use this resource, please cite via the DOI :  10.6084/m9.figshare.1536472.v1 .

Stuffed

Stuffed is a wrapper for WEKA and MOA classification algorithms. It enables classifier testing and evaluation on unlabelled data streams. This is (or was last I checked) hard to achieve with MOA. Stuffed makes this possible by using custom sampling methods to sample large data sets so that they can contain:

– Varied levels of class balance in both test and training sets.
– Varied levels of labelling in the test data streams.

Stuffed is only designed to work on binary classification problems. It can be used to gather statistics on classifier performance, is easily extensible, and can be used with other tools such as MatLab.  If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536471.v1 .

GH-VFDT

An imbalanced data stream classifier, which uses the Hoeffding bound and Hellinger distances, to improve minority class recall. The GH-VFDT utilises a decision tree split criterion, designed to improve minority class recall rates on imbalanced data streams, i.e. those streams where the class distribution is worse than 1:100. This implementation is built upon the Hoeffding Tree provided in MOA, thus a great deal of credit goes to the MOA team for their initial implementation and library. We greatly acknowledge their efforts.  If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536470.v1 .

For more details of the algorithm see:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

Talks
Posters

Industry Engagement Day Poster, Thursday 10 July 2014.

Ph.D Work

During my Ph.D I was under the supervision of Dr. John Brooke (Computer Science), Dr. Josh Knowles (Computer Science) and Prof. Ben Stappers (Jodrell Bank Centre for Astrophysics). The algorithms I developed during my PhD have so far helped to find 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey, aka the LOTAAS survey (see here for more details). The techniques I developed are also being applied to data collected during the GMRT High Resolution southern sky survey (GHRSS), and to High Time Resolution Universe Survey (HTRU) data.

During my time as a Ph.D student I was part of the SUPERB project (SUrvey for Pulsars and Extragalactic Radio Bursts) which searched for new pulsars and the more mysterious fast radio bursts (FRB’s). See the SUPERB website for more details.

Expertise

I have expertise in machine learning classification (neural networks, decision trees etc), algorithm design, computational optimization, signal detection, feature extraction & design, big data frameworks (e.g. Apache Strom, Spark), GPU programming, software development (C#, C++, C, Java, Cuda, Python), Docker, automated test, VDI, performance optimisation.