About


Hi, I’m Rob. I’m a Senior Lecturer of AI & Robotics at Edge Hill University. My research interests include real-time machine learning, data stream processing, and imbalanced machine learning problems. I’m the principle investigator for an STFC funded project that aims to use ML methods to improve radiotherapy treatment. I also undertake pure ML research for various international science collaborations that I’m part of. 

Since 2015 I’ve been helping design the world’s largest radio telescope, the Square Kilometre Array (SKA). I was part of both the Central Signal Processor (CSP) and Science Data Processor (SDP) design consortia, and I’m still involved with the machine learning aspects of these projects.

Curriculum Vitae

A general version of my CV can be viewed here.

My Background

I have a B.Sc. in Software Engineering (First-class honours), and an M.Sc. in Advanced Computer Science (Distinction), both obtained at the University of Liverpool. I also have a Ph.D. in Machine Learning, obtained at the University of Manchester. In the past I’ve worked as a performance and scalability software engineer, and I’ve also volunteered as a STEM science ambassador.

I spent four years as post-doctoral researcher at the University of Manchester. During this time I worked on creating intelligent algorithms capable of helping astronomers make interesting and important new discoveries. This work was highly interdisciplinary. It combined software engineering, data science, machine learning, and radio astronomy. So far I’ve helped colleagues discover more than 20 new pulsars (a 1% increase in the known pulsar population). This figure is expected to rise in the coming years.

I’m a proud Liverpuddlian (from a town called Kirkby), a total science nerd, and a big sports fan.

Research

I’m interested in solving big data challenges. Specifically, I like to tackle problems requiring computationally efficient machine learning solutions, or data imbalances that make automated learning difficult.

In recent years I’ve developed machine learning algorithms and software tools capable of processing the vast quantities of data produced by instruments such as the Square Kilometre Array (SKA, more info here). This is a new radio telescope under development by an international team of scientists and engineers. When constructed the SKA will be the worlds largest radio telescope, and most sophisticated scientific instrument ever constructed.

Square Kilometre Array
Artists Impression of the Square Kilometre Array. Credit: SKA Project Development Office and Swinburne Astronomy Productions – Swinburne Astronomy Productions for SKA Project Development Office, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=11315190

Publications

M. Aldraimli, S. Osman, D. Grishchuck, S. Ingram, R. J. Lyon, et. al., “Development and Optimization of a Machine-Learning Prediction Model for Acute Desquamation After Breast Radiation Therapy in the Multicenter REQUITE Cohort”, Advances in Radiation Oncology, Volume 7, Issue 3, 2022. DOI: 10.1016/j.adro.2021.100890.

M. Aldraimli, D. Soria, D. Grishchuck, S. Ingram, R. J. Lyon, et. al., “A Data Science Approach for Predicting Patient’s Susceptibility to Acute Side Effects in Breast Cancer Radiation Therapy, Computers in Biology and Medicine, Volume 135, 2021. DOI: 10.1016/j.compbiomed.2021.104624.

Z. Hosenie, S. Bloemen, P. J. Groot, R. J. Lyon, et. al., “MeerCRAB – MeerLICHT Classification of Real and Bogus transients using deep learning”, Experimental Astronomy, Volume 1, 319-344, 2021. DOI: 10.1007/s10686-021-09757-1.

Z. Hosenie, R. J. Lyon, P. J. Groot, B. W. Stappers, “Classification of Optical Transients at the
MeerLICHT Telescope using Deep Learning”,
Third Workshop on Machine Learning and the Physical 2018 Sciences, NeurIPS 2020. Link.

Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, “Comparing Multi-class, Binary and Hierarchical Machine Learning Classification schemes for variable stars”, Third Workshop on Machine Learning and the Physical 2018 Sciences, NeurIPS 2020. Link.

Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, Imbalance Learning  for Variable Star Classification, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 493 (4):6050–6059, 2020. DOI: 10.1093/mnras/staa642.

Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, Comparing Multiclass, Binary, and Hierarchical Machine Learning Classification schemes for variable stars, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 488 (4):4858–4872, 2019. DOI: 10.1093/mnras/stz1999.

R. J. Lyon, B. W. Stappers, L. Levin, M. B. Mickaliger, A. Scaife, “A Big Data Pipeline for High Volume Scientific Data Streams”, Astronomy & Computing, Volume 28, 2019. DOI: 10.1016/j.ascom.2019.100291.

R. J. Lyon, Imbalanced Learning In Astronomy”, European Week of Astronomy and Space Science (EWASS), April 4-6, 2018.

D. Michilli, J. W. T. Hessels, R. J. Lyon, C. M. Tan, C. Bassa, S. Cooper, V. I. Kondratiev, S. Sanidas, B. W. Stappers, J. van Leeuwen, “Single-pulse classifier for the LOFAR Tied-Array All-sky Survey”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 480 (3): 3457-3467, 2018. DOI: doi.org/10.1093/mnras/sty2072.

C. M. Tan, R. J. Lyon, B. W. Stappers, S. Cooper, J. W. T. Hessels, V. I. Kondratiev, D. Michilli, S. Sanidas, “Ensemble candidate classification for the LOTAAS pulsar survey”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 474 (4): 4571–4583, 2017.  DOI:10.1093/mnras/stx3047.

L. Levin, W. Armour, C. Baffa, E. Barr, S. Cooper, R. Eatough, A. Ensor, E. Giani, A. Karastergiou, R. Karuppusamy, M. Keith, M. Kramer, R. Lyon, M. Mackintosh, M. Mickaliger, R van Nieuwpoort, M. Pearson, T. Prabu, J. Roy, O. Sinnen, L. Spitler, H. Spreeuw, B. W. Stappers, W. van Straten, C. Williams, H. Wang, K. Wiesner, “Pulsar Searches with the SKA”, International Astronomical Union Symposium (IAU) 337, Manchester, 4-8th September, 2017Arxiv Pre-print.

R. J. Lyon, “50 Years of Candidate Pulsar Selection – What next?”, International Astronomical Union Symposium (IAU) 337, Manchester, 4-8th September, 2017, Arxiv Pre-print. Note that the supporting material can be found here, and the talk slides are here. The supporting material also has a unique identifier, see DOI: 10.5281/zenodo.883844.

R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, “Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach”, Monthly Notices of the Royal Astronomical Society (MNRAS), 459 (1): 1104-1123, 2016Arxiv Pre-print, MNRAS, DOI:10.1093/mnras/stw656.

R. J. Lyon, “Why Are Pulsars Hard To Find?”, PhD Thesis, School Of Computer Science, University of Manchester, 2016. Download.

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

Academic Responsibilities
Funding
  • PI for an STFC funded project that aims to leverage astronomy knowledge/machine learning expertise, to solve radiotherapy challenges in the medical domain. See http://radiotherapymlnetwork.co.uk/
  • SKAO-AWS AstroCompute grant programme.

Teaching Resources
Supporting Material: Fifty Years of Candidate Pulsar Selection – What next?

A Jupyter notebook exploring the issues that reduce the accuracy of Machine Learning classifiers. It was written to support a talk delivered at IAU Symposium No. 337, Pulsar Astrophysics: The Next Fifty Years (2017).

DOI

Supporting Material: Imbalanced Learning in Astronomy

A Jupyter notebook which explores the “Imbalanced” learning problem. It was written to support a talk delivered at EWASS 2018.

DOI

OAD Data Science Toolkit

The International Astronomical Union (IAU) Office of Astronomy for Development (OAD) Data Science Toolkit, aims to provide a “common language” between the data science and astronomy communities. I’ve contributed four tutorials to this effort. These can either be found via my “fork” of the toolkit’s GitHub repository , or the project’s Github repository which can be found here.

Datasets and Databases
Pulsar Survey Database

You may find the Pulsar Survey database useful for your research. It lists every major pulsar survey conducted during the past fifty years, along with their respective technical specifications. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.3114130.v1 .

Pulsar Data (HTRU2)

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). The data set contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). If you use this resource, please cite via the DOI :  10.6084/m9.figshare.3080389.v1 .

Classification results for: A Study on Classification in Imbalanced and Partially-Labelled Data Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.

Classification results for: Hellinger Distance Trees for Imbalanced Streams

Data sets supporting the results reported in the paper:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper,

Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008.

Code

Much of my code is online and released under an open source license.

SKA Data Models

This is a Jupyter notebook that models SKA Science Data Processor (SDP) data rates & volumes. Diagrams are included that define the conceptual and logical structure of Non-Imaging Processing (NIP) data models. Also,  activity diagrams for all NIP pipelines are included. Finally, formulas are presented that provide accurate estimates of NIP pipeline data rates.

DOI

SKA-Test Vector Generation Pipeline

This is a software pipeline used to generate SKA-like pulsar observations, aka ‘test vectors’. These are used to test SKA algorithms and data processing pipelines. The software pipeline is packaged within a docker container.  I’ve also created a web interface which displays the outputs of the test vector generation pipeline. You can find the interface code here.

Update: During the Astron Hackathon, upgrades and changes were made to the pipeline. Big thanks to Yan Grange, Sophie Ashcroft, Liam Conner, Wietze Albers, Anne Archibald, and Amruta Jaodand for contributing! For those who are interested, my project pitch slides can be found here.

DOI

Docker Images

A collection of Docker images useful for pulsar search and data science analysis.

Pulsar Feature Lab

The pulsar feature lab application is a collection of python scripts useful for extracting machine learning features (otherwise known as scores or variables) from pulsar candidate files. The code was written in order to provide a tool-kit useful for designing and extracting new candidate features, whilst retaining the ability to extract existing features developed by the community for comparison.  If you use this resource, please cite via the DOI :  10.6084/m9.figshare.1536472.v1 .

Stuffed

Stuffed is a wrapper for WEKA and MOA classification algorithms. It enables classifier testing and evaluation on unlabelled data streams. This is (or was last I checked) hard to achieve with MOA. Stuffed makes this possible by using custom sampling methods to sample large data sets so that they can contain:

– Varied levels of class balance in both test and training sets.
– Varied levels of labelling in the test data streams.

Stuffed is only designed to work on binary classification problems. It can be used to gather statistics on classifier performance, is easily extensible, and can be used with other tools such as MatLab.  If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536471.v1 .

GH-VFDT

An imbalanced data stream classifier, which uses the Hoeffding bound and Hellinger distances, to improve minority class recall. The GH-VFDT utilises a decision tree split criterion, designed to improve minority class recall rates on imbalanced data streams, i.e. those streams where the class distribution is worse than 1:100. This implementation is built upon the Hoeffding Tree provided in MOA, thus a great deal of credit goes to the MOA team for their initial implementation and library. We greatly acknowledge their efforts.  If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536470.v1 .

For more details of the algorithm see:

R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.

Talks & Public Engagement
Posters

Industry Engagement Day Poster, Thursday 10 July 2014.

Ph.D Work

During my Ph.D I was under the supervision of Dr. John Brooke (Computer Science), Dr. Josh Knowles (Computer Science) and Prof. Ben Stappers (Jodrell Bank Centre for Astrophysics). The algorithms I developed during my PhD have so far helped to find 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey, aka the LOTAAS survey (see here for more details). The techniques I developed are also being applied to data collected during the GMRT High Resolution southern sky survey (GHRSS), and to High Time Resolution Universe Survey (HTRU) data.

During my time as a Ph.D student I was part of the SUPERB project (SUrvey for Pulsars and Extragalactic Radio Bursts) which searched for new pulsars and the more mysterious fast radio bursts (FRB’s). See the SUPERB website for more details.

Expertise

I have expertise in machine learning classification (neural networks, decision trees etc), algorithm design, computational optimization, signal detection, feature extraction & design, big data frameworks (e.g. Apache Strom, Spark), GPU programming, software development (C#, C++, C, Java, Cuda, Python), Docker, automated test, VDI, performance optimisation.