Predicting thunderstorm risk probability at very short time range using deep learning

Bosc, Mélanie; Chan-Hon-Tong, Adrien; Bouchard, Aurélie; Béréziat, Dominique

doi:10.5194/nhess-26-1603-2026

Articles | Volume 26, issue 3

https://doi.org/10.5194/nhess-26-1603-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/nhess-26-1603-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 3

Research article

|

31 Mar 2026

Research article |

| 31 Mar 2026

Predicting thunderstorm risk probability at very short time range using deep learning

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard, and Dominique Béréziat

Download

Final revised paper (published on 31 Mar 2026)
Preprint (discussion started on 23 Jul 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-2893', Anonymous Referee #1, 02 Oct 2025

Review of “Predicting thunderstorm risk probability at very short time range using deep learning”
The preprint proposes a deep learning methodology for very short-term (5-60 minutes) probabilistic forecasting of lightning risk, motivated by aviation safety within the ALBATROS project. It adapts the ED-DRAP neural network, incorporating spatio-temporal sequences from satellite (GOES-16 ABI brightness temperature and GLM lightning groups) and NWP (GFS lifted index and relative humidity) data over a region centered on the Gulf of Mexico and Florida. A key focus is on achieving well-calibrated outputs through a combined cross-entropy and Dice loss function, enabling interpretable risk probability maps. Results report F1 scores of 0.65 at 5 minutes and 0.5 at 30 minutes, with ECE below 10%. In general, the manuscript is well-structured, the approach is innovative in emphasizing calibration for probabilistic lightning nowcasting without radar data, and the topic is highly relevant for natural hazards research, particularly in aviation and thunderstorm impacts. However, I have some concerns regarding the scope, comparisons, and generalizability.
My main concern is the limited scope and potential lack of generalizability of the dataset and results. The data is restricted to winter mornings (00:00-05:00 UTC, December-February) from 2020-2023, covering only 154 days with a balanced split of stormy and non-stormy periods. While this controls for variability, it may not capture seasonal, diurnal, or regional differences in thunderstorm dynamics (e.g., summer afternoons or other global hotspots). The study area is narrowed to a subset of CONUS, but no sensitivity analysis is provided for other regions. A discussion on how these choices affect broader applicability, perhaps with preliminary tests on extended data, would strengthen the contribution.
My second concern is the benchmarking and novelty assessment. The model is compared to ConvLSTM, PredRNN, persistence, and U-Net, showing superior F1 and calibration scores. However, it lacks direct comparison to recent lightning-specific DL models from the literature, such as those in Brodehl et al. (2022), Geng et al. (2021), or Leinonen et al. (2023), which also use satellite/radar data for nowcasting. While the intentional exclusion of radar data is well-justified for enhancing applicability to aircraft flight paths where radar coverage may be limited or absent, discussing how the proposed method might compare to radar-inclusive baselines would better contextualize its advantages and limitations.
Other comments

L90-95: Clarify why the smaller area (red rectangle in Fig. 1) was chosen beyond computation cost, does it represent typical thunderstorm regimes?

Fig. 2: Add coordinate axes (latitude/longitude) to subfigure (b) to match (a) for consistency and better spatial context.

L164-165: The effective training/testing area is further cropped to 256x256 pixels (17.3°N–37.7°N, 93°W–72°W) from the subselected red rectangle; consider adding this cropped boundary as an inner rectangle in Fig. 1 for clarity.

L175-180: The input sequence (6 timesteps) is justified by a comparative study, but I suggest including a table or figure summarizing F1 scores for 2/4/6/8 timesteps to support this.

L305-310: The example in Fig. 9 misses only 5% lightning, but it’s not clear which threshold is used in this case.

Citation: https://doi.org/10.5194/egusphere-2025-2893-RC1
- AC1: 'Reply on RC1', Mélanie Bosc, 06 Nov 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2893/egusphere-2025-2893-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-2893-AC1
RC2:
'Comment on egusphere-2025-2893', Anonymous Referee #2, 19 Nov 2025

This work is a novel application of AI to lightning forecasts and has significant potential for impact.
The authors successfully adapt the ED-DRAP (Encoder-Decoder Deep Residual Attention Prediction) network, demonstrating that its architecture, particularly the spatial and sequential attention mechanisms, is superior to other spatio-temporal models like ConvLSTM and PredRNN for this specific task. The introduction of a composite loss function combining Cross-Entropy and Dice Loss, with an optimally tuned parameter, effectively addresses the severe class imbalance (only 1% lightning pixels on average) and contributes to the excellent calibration scores.
Overall the work is well done, but I have some concerns which if addressed can strengthen the results of the paper:

1) Since the models were trained separately for each forecast horizon, there can be concerns of incoherent forecasts between different forecast horizons. The authors should provide some discussion or visualizations on how the forecasts look between different timestamps.
2) Was the evaluation dataset fixed for the different models per forecast horizon or was the 30% chosen separately for each horizon?
3) The training / evaluation dataset seems quite small, this also shows in the results as the results are quite jumpy from one forecast horizon to another. I wonder if there was any overfitting also due to this.
4) To overcome the concerns around a small training / validation dataset, it might be interesting to see if the results generalize to a different part of CONUS - likely keeping the latitude boundaries the same but shifting the longitude bounding box more to the west. If the model yields good evaluation results trained over the Gulf of Mexico but evaluated over a different region the results might be more robust.
5) It would be good to discuss the results separated by diurnal cycles and any peaks through the day / hours of the day.
6) The authors state they selected the 13th band of the ABI sensor (infrared at 10.3mu ) because it is "more sensitive to cloud classification". While this band's Brightness Temperature (BT) is correlated with high cloud tops (cumulonimbus), the argument for selecting only this single band out of 16 is not fully explored. The addition of other relevant channels (e.g., water vapor channels) could provide complementary information about the atmospheric column. The authors could at least outline any restrictions they faced in incorporating other bands.
7) The authors use of NWP data is not entirely clear with regards to which initialization / forecast time is fed as input into the model. The authors state: "Specifically, the following configuration was adopted: 00:00 UTC forecasts were applied from 00:00 UTC to 01:30 UTC, 03:00 UTC forecasts from 01:30 UTC to 04:30 UTC, and 06:00 UTC forecasts from 04:30 UTC to 05:00 UTC".

My questions are - (a) how will this work in realtime because it seems the forecasts initialized at 6:00 UTC are being applied to init times in the past? (b) How will the operational latencies of GFS impact performance?
8) In Figure 11. it would be more useful to have a PR curve for a few forecast forizons instead of two different figures for Precision and Recall and on the curve the impact of choosing different thresholds can be plotted. That would make it much more easier to understand the tradeoff.
9) Authors state that they use 0.05 threshold to plot the risk probability map since they want high recall but that can lead to a very low precision. I think a more robust explanation of chosen thresholds and their impact on metrics should be discussed.
10) In Figure 11(a) and (b) the result for precision and recall jumps quite a bit across different horizons and sometimes lower , sometimes higher than other models. It;s actually unclear if the model truly performs better than others. In 11(c) the ED-DRAP model actually performs worse than others for first 30 mins and then better. I think it would help to report more metrics here and better understand the performance at earlier horizons across the different baselines. Maybe visualize the probability maps for the different models.

Citation: https://doi.org/10.5194/egusphere-2025-2893-RC2
- AC2: 'Reply on RC2', Mélanie Bosc, 02 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2893/egusphere-2025-2893-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-2893-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (05 Dec 2025) by Ricardo Trigo

AR by Mélanie Bosc on behalf of the Authors (17 Dec 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (02 Jan 2026) by Ricardo Trigo

RR by Anonymous Referee #1 (02 Jan 2026)

RR by Anonymous Referee #2 (03 Feb 2026)

ED: Publish as is (17 Feb 2026) by Ricardo Trigo

AR by Mélanie Bosc on behalf of the Authors (20 Feb 2026)

Short summary

In the context of aeronautics, one of the main dangers along flight paths is the presence of cumulonimbus clouds, which can generate lightning and strike aircraft causing damages. To address this issue, we have developed a data-driven AI method to predict thunderstorms risk that allows to estimate electrical activity probability at very short time range (every 5 min up to 1 h ahead).

Predicting thunderstorm risk probability at very short time range using deep learning

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection