Hi, I’m Rob. I’m a Senior Lecturer of AI & Robotics at Edge Hill University. My research interests include real-time machine learning, data stream processing, and imbalanced machine learning problems. I’m the principle investigator for an STFC funded project that aims to use ML methods to improve radiotherapy treatment. I also undertake pure ML research for various international science collaborations that I’m part of.
Since 2015 I’ve been helping design the world’s largest radio telescope, the Square Kilometre Array (SKA). I was part of both the Central Signal Processor (CSP) and Science Data Processor (SDP) design consortia, and I’m still involved with the machine learning aspects of these projects.
I have a B.Sc. in Software Engineering (First-class honours), and an M.Sc. in Advanced Computer Science (Distinction), both obtained at the University of Liverpool. I also have a Ph.D. in Machine Learning, obtained at the University of Manchester. In the past I’ve worked as a performance and scalability software engineer, and I’ve also volunteered as a STEM science ambassador.
I spent four years as post-doctoral researcher at the University of Manchester. During this time I worked on creating intelligent algorithms capable of helping astronomers make interesting and important new discoveries. This work was highly interdisciplinary. It combined software engineering, data science, machine learning, and radio astronomy. So far I’ve helped colleagues discover more than 20 new pulsars (a 1% increase in the known pulsar population). This figure is expected to rise in the coming years.
I’m a proud Liverpuddlian (from a town called Kirkby), a total science nerd, and a big sports fan.
Research
I’m interested in solving big data challenges. Specifically, I like to tackle problems requiring computationally efficient machine learning solutions, or data imbalances that make automated learning difficult.
In recent years I’ve developed machine learning algorithms and software tools capable of processing the vast quantities of data produced by instruments such as the Square Kilometre Array (SKA, more info here). This is a new radio telescope under development by an international team of scientists and engineers. When constructed the SKA will be the worlds largest radio telescope, and most sophisticated scientific instrument ever constructed.
Artists Impression of the Square Kilometre Array. Credit: SKA Project Development Office and Swinburne Astronomy Productions – Swinburne Astronomy Productions for SKA Project Development Office, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=11315190
Publications
M. Aldraimli, S. Osman, D. Grishchuck, S. Ingram, R. J. Lyon, et. al., “Development and Optimization of a Machine-Learning Prediction Model for Acute Desquamation After Breast Radiation Therapy in the Multicenter REQUITE Cohort”, Advances in Radiation Oncology, Volume 7, Issue 3, 2022. DOI: 10.1016/j.adro.2021.100890.
M. Aldraimli, D. Soria, D. Grishchuck, S. Ingram, R. J. Lyon, et. al., “A Data Science Approach for Predicting Patient’s Susceptibility to Acute Side Effects in Breast Cancer Radiation Therapy”, Computers in Biology and Medicine, Volume 135, 2021. DOI: 10.1016/j.compbiomed.2021.104624.
Z. Hosenie, S. Bloemen, P. J. Groot, R. J. Lyon, et. al., “MeerCRAB – MeerLICHT Classification of Real and Bogus transients using deep learning”, Experimental Astronomy, Volume 1, 319-344, 2021. DOI: 10.1007/s10686-021-09757-1.
Z. Hosenie, R. J. Lyon, P. J. Groot, B. W. Stappers, “Classification of Optical Transients at the MeerLICHT Telescope using Deep Learning”, Third Workshop on Machine Learning and the Physical 2018 Sciences, NeurIPS 2020. Link.
Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, “Comparing Multi-class, Binary and Hierarchical Machine Learning Classification schemes for variable stars”, Third Workshop on Machine Learning and the Physical 2018 Sciences, NeurIPS 2020. Link.
Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, “Imbalance Learning for Variable Star Classification”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 493 (4):6050–6059, 2020. DOI: 10.1093/mnras/staa642.
Z. Hosenie, R. J. Lyon, B. W. Stappers, A. Mootoovaloo, “Comparing Multiclass, Binary, and Hierarchical Machine Learning Classification schemes for variable stars”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 488 (4):4858–4872, 2019. DOI: 10.1093/mnras/stz1999.
R. J. Lyon, B. W. Stappers, L. Levin, M. B. Mickaliger, A. Scaife, “A Big Data Pipeline for High Volume Scientific Data Streams”, Astronomy & Computing, Volume 28, 2019. DOI: 10.1016/j.ascom.2019.100291.
R. J. Lyon,“Imbalanced Learning In Astronomy”, European Week of Astronomy and Space Science (EWASS), April 4-6, 2018.
D. Michilli, J. W. T. Hessels, R. J. Lyon, C. M. Tan, C. Bassa, S. Cooper, V. I. Kondratiev, S. Sanidas, B. W. Stappers, J. van Leeuwen, “Single-pulse classifier for the LOFAR Tied-Array All-sky Survey”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 480 (3): 3457-3467, 2018. DOI: doi.org/10.1093/mnras/sty2072.
C. M. Tan, R. J. Lyon, B. W. Stappers, S. Cooper, J. W. T. Hessels, V. I. Kondratiev, D. Michilli, S. Sanidas, “Ensemble candidate classification for the LOTAAS pulsar survey”, Monthly Notices of the Royal Astronomical Society (MNRAS), Volume 474 (4): 4571–4583, 2017. DOI:10.1093/mnras/stx3047.
L. Levin, W. Armour, C. Baffa, E. Barr, S. Cooper, R. Eatough, A. Ensor, E. Giani, A. Karastergiou, R. Karuppusamy, M. Keith, M. Kramer, R. Lyon, M. Mackintosh, M. Mickaliger, R van Nieuwpoort, M. Pearson, T. Prabu, J. Roy, O. Sinnen, L. Spitler, H. Spreeuw, B. W. Stappers, W. van Straten, C. Williams, H. Wang, K. Wiesner, “Pulsar Searches with the SKA”, International Astronomical Union Symposium (IAU) 337, Manchester, 4-8th September, 2017, Arxiv Pre-print.
R. J. Lyon, “50 Years of Candidate Pulsar Selection – What next?”, International Astronomical Union Symposium (IAU) 337, Manchester, 4-8th September, 2017, Arxiv Pre-print. Note that the supporting material can be found here, and the talk slides are here. The supporting material also has a unique identifier, see DOI: 10.5281/zenodo.883844.
R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, “Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach”, Monthly Notices of the Royal Astronomical Society (MNRAS), 459 (1): 1104-1123, 2016. Arxiv Pre-print, MNRAS, DOI:10.1093/mnras/stw656.
R. J. Lyon, “Why Are Pulsars Hard To Find?”, PhD Thesis, School Of Computer Science, University of Manchester, 2016. Download.
R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.
R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.
Academic Responsibilities
Co-supervisor of one PhD students working in machine learning and astronomy.
PI for an STFC funded project that aims to leverage astronomy knowledge/machine learning expertise, to solve radiotherapy challenges in the medical domain. See http://radiotherapymlnetwork.co.uk/
SKAO-AWS AstroCompute grant programme.
Teaching Resources
Supporting Material: Fifty Years of Candidate Pulsar Selection – What next?
A Jupyter notebook exploring the issues that reduce the accuracy of Machine Learning classifiers. It was written to support a talk delivered at IAU Symposium No. 337, Pulsar Astrophysics: The Next Fifty Years (2017).
You may find the Pulsar Survey database useful for your research. It lists every major pulsar survey conducted during the past fifty years, along with their respective technical specifications. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.3114130.v1 .
Pulsar Data (HTRU2)
HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). The data set contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). If you use this resource, please cite via the DOI : 10.6084/m9.figshare.3080389.v1 .
Classification results for: A Study on Classification in Imbalanced and Partially-Labelled Data Streams
Data sets supporting the results reported in the paper:
R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “A Study on Classification in Imbalanced and Partially-Labelled Data Streams”, in Simple and Effective Machine Learning for Big Data, Special Session, IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013. Arxiv Pre-print, DOI: 10.1109/SMC.2013.260.
Classification results for: Hellinger Distance Trees for Imbalanced Streams
Data sets supporting the results reported in the paper:
R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.
Contained in this distribution are results of stream classifier perfromance on four different data sets. Also included are the test results from our attempt at reproducing the outcome of the paper,
Learning Decision Trees for Un-balanced Data, D. A. Cieslak and N. V. Chawla, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), vol. 5211 of LNCS, pp. 241-256, 2008.
Code
Much of my code is online and released under an open source license.
This is a Jupyter notebook that models SKA Science Data Processor (SDP) data rates & volumes. Diagrams are included that define the conceptual and logical structure of Non-Imaging Processing (NIP) data models. Also, activity diagrams for all NIP pipelines are included. Finally, formulas are presented that provide accurate estimates of NIP pipeline data rates.
This is a software pipeline used to generate SKA-like pulsar observations, aka ‘test vectors’. These are used to test SKA algorithms and data processing pipelines. The software pipeline is packaged within a docker container. I’ve also created a web interface which displays the outputs of the test vector generation pipeline. You can find the interface code here.
Update: During the Astron Hackathon, upgrades and changes were made to the pipeline. Big thanks to Yan Grange, Sophie Ashcroft, Liam Conner, Wietze Albers, Anne Archibald, and Amruta Jaodand for contributing! For those who are interested, my project pitch slides can be found here.
The pulsar feature lab application is a collection of python scripts useful for extracting machine learning features (otherwise known as scores or variables) from pulsar candidate files. The code was written in order to provide a tool-kit useful for designing and extracting new candidate features, whilst retaining the ability to extract existing features developed by the community for comparison. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536472.v1 .
Stuffed is a wrapper for WEKA and MOA classification algorithms. It enables classifier testing and evaluation on unlabelled data streams. This is (or was last I checked) hard to achieve with MOA. Stuffed makes this possible by using custom sampling methods to sample large data sets so that they can contain:
– Varied levels of class balance in both test and training sets.
– Varied levels of labelling in the test data streams.
Stuffed is only designed to work on binary classification problems. It can be used to gather statistics on classifier performance, is easily extensible, and can be used with other tools such as MatLab. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536471.v1 .
An imbalanced data stream classifier, which uses the Hoeffding bound and Hellinger distances, to improve minority class recall. The GH-VFDT utilises a decision tree split criterion, designed to improve minority class recall rates on imbalanced data streams, i.e. those streams where the class distribution is worse than 1:100. This implementation is built upon the Hoeffding Tree provided in MOA, thus a great deal of credit goes to the MOA team for their initial implementation and library. We greatly acknowledge their efforts. If you use this resource, please cite via the DOI : 10.6084/m9.figshare.1536470.v1 .
For more details of the algorithm see:
R. J. Lyon, J. M. Brooke, J. D. Knowles, B. W. Stappers, “Hellinger Distance Trees for Imbalanced Streams”, in 22nd International Conference on Pattern Recognition, pp.1969-1974, 2014. Arxiv Pre-print, DOI: 10.1109/ICPR.2014.344.
Time-domain Machine Learning – Opportunities and Challenges for the SKA @ Third ASTERICS-OBELICS Workshop : New paths in data analysis and open data provision in Astronomy and Astroparticle Physics 23rd-26th October 2018, Cambridge, UK.
Time-domain Machine Learning – Opportunities and Challenges for the SKA @ AI at SKA and CERN, 17th-18th September 2018, Alan Turing Institute, London, UK.
Data processing with the SKA – Machine learning at scale @ University of Southampton Physics Seminar, 20th November 2018.
During my Ph.D I was under the supervision of Dr. John Brooke (Computer Science), Dr. Josh Knowles (Computer Science) and Prof. Ben Stappers (Jodrell Bank Centre for Astrophysics). The algorithms I developed during my PhD have so far helped to find 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey, aka the LOTAAS survey (see here for more details). The techniques I developed are also being applied to data collected during the GMRT High Resolution southern sky survey (GHRSS), and to High Time Resolution Universe Survey (HTRU) data.