Next generation telescopes like MeerKAT and Square Kilometer Array (SKA), will generate huge amounts of data. The SKA, for instance, when completed by the mid 2020’s is expected to produce more data in a day than the entire internet!
The MeerKAT radio telescope, precursor to the SKA, is currently under construction in the Karoo desert of South Africa. Until SKA is completed, MeerKAT will be the most sensitive telescope at cm wavelengths and will propel transformational science ahead of SKA.
Dr.Neeraj Gupta, Associate Professor at IUCAA in Pune, has approximately 1700 hours of observing time at MeerKAT to carry out a large survey, the MeerKAT Absorption Line Survey (http://mals.iucaa.in/) or MALS which will map the evolution of cold gas in the universe. As processing the data (approx. 4 PB over 5 years) from this survey using conventional methods, that also involve manual interventions will take decades - automating the data processing techniques has become the need of the hour.
To address this, Thoughtworks and IUCAA are developing an Automated Radio Telescope Image Processing Pipeline (ARTIP) that will automate the entire process of flagging, calibrating and imaging while processing the data. The pipeline will also use various statistical techniques to identify bad data patterns like completely or partially bad antennas, baselines and timeranges, apart from generating flagging statistics and various diagnostics at each stage of the pipeline.
ARTIP has been tested and validated against various datasets (each size: approx. 10GB) from the Giant Meter Wave Radio Telescope (GMRT) and the Very Large Array (VLA) telescopes. The time taken to run the pipeline sequentially, on a server class machine with 256GB RAM and 40 cores is around 20 minutes, as opposed to a manual process that would take anywhere between 3 to 4 hours.
The pipeline runs sequentially, and does not utilize hardware at full capacity. Further enhancements have been planned to scale the pipeline to handle petabytes of data. For parallelization, the team is currently working on the imaging stage, which is takes more than 70% of the total processing time. The approach is to partition the data in frequency axis and run imaging parallely on a 13 node cluster (each node having 128GB RAM and 16 cores) provided by IUCAA, backed by a Lustre File System. This gives a gain of 60% on a 1.4TB dataset, but the scaling is restricted by the number of logical partitions (useful for achieving science goals) that can be made.
Amongst the key achievements of the ARTIP are:
- The detection of the OH radical in a galaxy. Such detections are rare (only 3-4 are known so far) and would have been easily missed without the pipeline.
- Presented at hiAbsorption 2017, international astronomy conference, at ASTRON, Netherlands.
- Work for MALS data processing with ARTIP, has been recognized in international astronomical journal, Proceedings of Science, in the publication The MeerKAT Absorption Line Survey (MALS).