<strong>техники визуализации данных для более глубокого понимания</strong>

Доступность и качество данных

Вызовы и возможности для будущего визуализации пандемии

Оптимальная техника визуализации зависит от конкретного контекста и задуманного использования. Например, общественные здравоохранительные службы могут приоритизировать четкие и простые визуализации для передачи ключевых сообщений общественности, в то время как исследователи могут извлечь пользу из более сложных интерактивных визуализаций для глубокого анализа.

Проблемы и возможности для будущего визуализации пандемии

Несмотря на значительный потенциал, визуализация пандемии COVID-19 сталкивается с непрерывными вызовами. Среди них:

Интеграция и гармонизация данных: Интеграция данных из различных источников и обеспечение их согласованности остается сложной задачей.
Разработка новых техник визуализации: Пандемия подчеркнула необходимость инновационных техник визуализации для работы с сложными наборами данных и эффективной передачи нюансированной информации.
Борьба с дезинформацией и продвижение данных об грамотности: Борьба с дезинформацией требует продвижения данных об грамотности и использования визуально точных и прозрачных визуализаций.

Заключение

В заключение, визуализация данных играет критическую роль в понимании и решении проблем пандемии COVID-19. С учетом ключевых факторов, вызовов и этических соображений исследователи и коммуникаторы могут создавать влиятельные визуализации, информируя, образовывая и вдохновляя общественность.

Создавайте свои видео, начиная с основной концепции и до полной готовности, при помощи Envato Elements:

This video is currently unavailable

Визуализация коронавируса Covid-19 V5

Проект для изучения глобального распространения COVID-19. Обновляется ежедневно. Обзор и мотивация.

Надежные и актуальные источники данных о Covid-19 больше не доступны

Август 2023 года: В настоящее время больше нет постоянных, ежедневных и надежных источников данных о случаях COVID в Соединенных Штатах. Университет Джонса Хопкинса, основной и наиболее надежный источник данных, более не публикует обновления.

Набор данных от Our World in Data теперь собирается почти исключительно из данных Всемирной организации здравоохранения (ВОЗ). Данные ВОЗ обновляются странами-членами ВОЗ, и обновления кажутся поступать каплями (необычное явление видеть месяцы между обновлениями от некоторых стран).

91-DIVOC останется активным на протяжении некоторого времени как архив данных COVID-19, но больше не будет видеть никаких обновлений. С нетерпением жду следующего проекта и надеюсь, что вы присоединитесь ко мне в изучении данных там! 🙂

Мне было честь иметь вас на сайте 91-DIVOC и доверие к моим данным. Надеюсь, что мы встретимся снова! 🙂

Несколько интересных моментов, которые я нашел интересными для изучения:

Изучите больше 91-DIVOC

Руководство по визуализациям 91-DIVOC

Все данные, представленные на этой визуализации, взяты из проверенного и качественного источника данных: либо университета Джонса Хопкинса, университета Оксфорда (Our World in Data), либо из источника The Atlantic (Covid Tracking Project). Вы можете переключаться между источниками данных, используя элемент управления Источник данных в верхней части визуализации. По умолчанию отображается университет Джонса Хопкинса, но ваш выбор данных может быть сохранен с помощью закладки Прямая ссылка, которая отображается под каждой визуализацией. Все наборы данных обновляются несколько раз в день.

Регионы

Помимо данных о странах и штатах, предоставленных источниками данных, добавлены несколько регионов для дополнительного контекста. Эти регионы включают:

Мир
Страны
Штаты США

Каждый регион может быть выбран как опция Выделить в элементах управления под каждым графиком. Кроме того, выбрав + Добавить дополнительное выделение, можно выделить несколько выборов на каждом графике.

Нормализованные данные по населению

На первых двух визуализациях отображаются необработанные данные о случаях (например, число случаев, смертей, тестов и т. д.). Последние две визуализации отображают те же данные, нормализованные официальным населением, отображенным в случаях на 100 000 человек. Этот нормализованный вид предоставляет более справедливое сравнение между регионами разных размеров (например, Калифорния, Техас, Иллинойс и Нью-Йорк будут иметь больше случаев, если бы распределение было равномерным, просто потому, что у них больше жителей). Используемая формула:

[ (значение данных) / (население) \times 100000 ]

Прямая ссылка

В элементах управления визуализации есть еще много опций для исследования, включая возможность сохранить изображение или видео/GIF вашего графика. Этот проект 91-DIVOC — это мой проект в свободное время, когда я не занимаюсь обучением следующего поколения студентов об информатике в университете Иллинойса. Вы можете прочитать о моем мотивировании создать этот проект и некоторых его применениях на странице Обзор и мотивация.

Наконец, если у вас есть вопросы или отзывы, вы можете написать мне (ссылка на сайт преподавателя и дополнительные контактные данные на этой странице) — я прочту все электронные письма, отвечу на столько, насколько смогу, и буду рад услышать от вас!

Прогностические модели для случаев COVID-19: Фокус на моделях, основанных на данных

In addition to the predictive accuracy, the importance of predictive models’ interpretability has been discussed in plenty of previous works23,24,25,26,27. A higher model interpretability facilitates human’s ability to understand its predictions, and thus promotes bias detection and other factors that contribute to policy making. Specifically, we demonstrate how the coefficients from the AR part of the trained hybrid model shed light onto understanding the underlying disease transmission mechanism, and thus could help to predict its prevalence trends, and to inform public health policy makers to improve pandemic planning, resource allocation, and implementation of social distancing measures and other interventions. A long-term mission of this paper is to stretch the application of hybrid models beyond COVID-19 forecasting: toward other fast-moving epidemics and cases that require accurate prediction and interpretability.

Although in this paper we focus on confirmed cases prediction, we note that the proposed framework can be easily extended to tackle other COVID-19 or more general epidemiological tasks (e.g., hot spot prediction). Furthermore, the proposed method has its own research significance from a methodological perspective. For example, it raised the open questions on studying its theoretical guarantees, mathematical quantification of prediction, and interpretability.

Related work

Recently, numerous studies have employed machine learning techniques to investigate various tasks on COVID-19 and achieved impressive results. Examples include using deep learning to detect COVID-19 through CXR images and predicting death status based on food categories to recommend healthy foods during the pandemic28,29,30. In light of these advances, our research focuses on predicting confirmed cases of COVID-19.

In this section, we provide a more detailed review of data-driven models that formulate the prediction problem as a regression problem. Regression-based models, including simple AR models and more complex models such as Random Forest, Gradient Boosting, and CNN-LSTM, have been widely used for COVID-19 prediction. For example, Mumtaz et al.31 used ARIMA to predict the daily confirmed cases in European countries, while Yesilkanat32 used a Random Forest model to predict the number of cases and deaths. Muhammad et al.33 used a CNN-LSTM model to predict the number of confirmed cases and deaths in Nigeria, South Africa, and Botswana. We summarize a list of recent work from year 2020 to 2022 in Table 1.

One advantage of these models is that they do not require a priori knowledge of the disease dynamics and can capture rich relationships in the data. They have been shown to be effective in predicting COVID-19 cases in various regions around the world. However, COVID-19 data displays rich variability, and therefore a single predictive model may not be sufficient and has its own limitations. For example, one major disadvantage of ARIMA models is that they may not be able to capture non-linear patterns in the data, which can lead to inaccurate predictions. On the other hand, more complex models such as Random Forest and CNN-LSTM may suffer from overfitting, where the model becomes too specialized to the training data and cannot generalize well to new data. These complex models may also lack interpretability, making it difficult to understand the factors driving the predictions and thus provide little to none guidance to actual public health policy making.

Hybrid predictive models that combine different regression models may offer the best of both worlds by capturing both linear and non-linear patterns in the data while maintaining some degrees of interpretability. The idea is to decompose a model into different components that are designed to capture specific characteristics of the data. It has proven to be an effective way of improving empirical predictions in various applications, including those in COVID-19 prediction34,35,36,37,38.

Comparison to previous works on hybrid modeling

In our study, we design a general network architecture that includes both an AR part and an LSTM part additively and trains the entire architecture jointly by minimizing the empirical risk. By doing so, we do not arbitrarily give preference to any of the two additive components. Instead, the relative weights of the interpretable AR part and the predictive LSTM part are determined fully by the data.

Table 1 A non-exhaustive list of previous works on data-driven models for COVID-19 cases prediction in the past three years.

Full size table

In summary, our contributions can be summarized as:

Methods

In time series, we often observe associations between past and present values. For example, by knowing the price of a stock in the past few days, we can often make a rough prediction about its value tomorrow. AR is a simple model that utilized this empirical observation and can yield very accurate prediction in certain applications. It represents the time series values using linear combination of the past values. The number of past values used is called the lag number and often denoted by p. Let denote the Gaussian noise at time t with mean 0 and variance . The structure equation of AR(p) model can be represented as

and higher order differencing operation can be defined recursively. However, an AR model is not sufficient to capture the non-linear dependence structure, which is found to be an important feature of the COVID-19 data, indicated by Fig. 1. A purely AR based model is thus often insufficient for the task of COVID-19 cases prediction.

An example of visualizing daily observations, where blue line represents the data before smoothing, orange line represents data after smoothing. The data is collected from the Los Angeles county.

Full size image

Long short term memory networks (LSTM)

RNN (Recurrent Neural Network)52 is known to suffer from the long term dependency problem: as the network grows larger through time, the gradient decays quickly during back propagation, making it impossible to train RNN models with long unfolding in time. To solve this problem, Hochreiter and Schmidhuber (1997) introduced a special type of RNN called LSTM with a proper gradient-based learning algorithm22.

We employ a LSTM regression model, which is represented as

To achieve optimal prediction results using LSTM model, it is crucial to have a careful hyperparameter tuning, including the choice of units (dimension of the hidden state), the number of cells (i.e. the number of time steps), and layers. This is usually a difficult task in practice. For example, few LSTM cells are unlikely to capture the structure of the sequence, while too many LSTM cells might lead to overfitting. However, just like other neural networks, a well-known limitation of LSTM is its lack of interpretability23.

The hybrid model

As discussed above, both AR and LSTM have their relative strength and limitations in their prospective domains. We propose to combine the two models additively into one single hybrid model, which is expressed as

where p is the lag number and weights the contribution of two components: by tuning the value of , one can strike a balance between the prediction given by AR and LSTM parts, and thus a prediction of linear and nonlinear signals.

We illustrate the structure of the hybrid model in Fig. 2. The hybrid model is characterized as one neural network architecture where the two composing models are added through the last layer. The AR component captures the linear relationship in time series and the LSTM component would describe the nonlinear patterns. In section “Training” of the Supplementary Material, we show how to train the weights in each of the two components in a fully data-adaptive manner by minimizing the empirical risk. We will compare the contribution of the hybrid model’s AR component and LSTM component in section Results.

Visualization of the hybrid architecture.

Results

The results include four sections: Model evaluations, Prediction, Interpretability, and Comparative study on the WHO datasets. In Model evaluations, we introduce the metrics we use to evaluate the models and on which we compare the models’ performances. In section Prediction, we exhibit the visualizations of several interesting trials and compare the numerical predictions and evaluations of the three models. In Interpretability, we compare the AR component of the hybrid model with the AR model. This is to examine how we may interpret the hybrid model. We leave other training details in Supplementary Material. In Comparative study on the WHO datasets, we further examine the performance of the proposed hybrid model by applying it to data of 7 different countries around the world and comparing its performance with that of its component models and 3 additional models.

Data description and statistical analysis

We utilize two primary data sources. The first data source is a dataset specific to California counties, which is available in the CHHS Open Data repository under the title COVID-19 Time-Series Metrics by County and State. This dataset includes information on populations, positive and total tests, number of deaths, and positive cases. We conducted a preliminary statistical analysis to examine correlations between these variables and the number of daily cases. The results of this analysis can be found in Supplementary Fig. 3 in Supplementary Material, and we anticipate that they will provide valuable insights for future research.

The second data source, used for comparative analysis, can be found in the WHO repository at the WHO Coronavirus (COVID-19) Dashboard. This resource presents official daily counts of COVID-19 cases, deaths, and vaccine utilization, as reported by countries, territories, and areas. In this study, we use 7 countries: Japan, Canada, Brazil, Argentina, Singapore, Italy, and the United Kingdom.

All datasets generated and analysed during the current study are also available in the author’s Github repository24.

Model evaluations

We use a quantitative measure to evaluate and compare the performance of models: the Mean Absolute Percentage Error (MAPE), defined as:

A model with small values of MAPE is preferred.

We examine the performance of the three models (hybrid, AR, and LSTM) on different time periods within the available range. This is essential in our research, since the performance of a model is not constant on different trends; by intuition, a model performs better on smooth curves than it does on steep curves. By repeating our evaluation process on different time periods thus different trends, we wish to understand what trends do the model give the best performance. Such understanding will help us decide to what degrees we may trust the performance of the models. We evaluate the models repeatedly to reduce the influence brought by the instability of model training. Specifically, we leave 7 days between the first date of any two consecutive training data points. Although a larger number of repetitions seems desirable, increasing the repetition number is at the cost of making neighboring training points closer to each other. However, the difference in performance between two neighboring training points, that are too close to each other, would be attributed more to the instability of model training than to the difference in trend. Such results give us little information about the model performance over trend. In the end, we let the step number be the same as our lag number. By doing so, we suppose the concept of a week is important in forecasting.

Additional evaluation metrics

In the Supplementary Material, we additionally evaluate and compare above models using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). The evaluation is done on the same dataset across different comparing methods.

The left panels show the training and testing data. The right panels show the ground truth versus forecasts of the AR, LSTM, and hybrid model, respectively. We display the average prediction (solid line) with 2 times standard error (shaded region). The standard error across 100 runs are reported for LSTM and hybrid. The hybrid model is more stable than the LSTM.

Prediction

In this section, we present the numerical results for all three models. We perform a comprehensive comparison of the performance for the three models in multiple counties, showing the advantage of the hybrid model. All predictions are transformed back to the original scale.

Figure 3a shows models being trained on curved data and being tested on down trend data, as shown on the left and right panel, respectively. Figure 3b shows models being trained on up trend data and being tested on down trend data. Figure 3c shows models being trained on up trend data and being tested on up trend data. Figure 3d shows models being trained on down trend data and being tested on down trend data. Figure 4a,b show models being trained on down trend data and being tested on up trend data, while Fig. 4a has gentle upward testing data and Fig. 4b has sharp upward testing data. Figure 4c show models being trained and tested on jagged data.

To ensure the results above are representative, we run each selected trial 100 times, visualize the mean and standard error of these trials, and present averaged MAPE. While AR outperforms LSTM on some cases, the hybrid model outperforms both in most cases, except that in Fig. 3b and in Fig. 4c. The MAPE, averaged on the 100 trials, shows that LSTM (4.469%) outperforms hybrid (4.993%) slightly in Fig. 3b. However, as shown in the right panel of Fig. 3b, the hybrid model captures the general trend of ground truth better than LSTM does. Similarly, in Fig. 4c, AR (3.675%) outperforms hybrid (3.718%) slightly. Yet, as shown in the right panel of Fig. 4c, the hybrid model captures the general trend of ground truth better than AR does.

Beside, interestingly enough, the hybrid model always seems to capture the ground truth’s trend. Actually, the shape of hybrid ’s forecasts resembles either that of the AR model or that of the LSTM model, or it resembles a combination of both. When AR model captures the trend better than the LSTM does, the hybrid model resembles the AR model in forecast shape: for example, in Fig. 3b, San Francisco 2020-02-17 to 2020-05-14, and in Fig. 4a, Santa Barbara 2022-01-17 to 2022-04-14. When LSTM model captures the trend better than the AR does, the hybrid model resembles the LSTM model in forecast shape: for example, in Fig. 3d, San Francisco 2022-06-10 to 2022-09-05, and in Fig. 4b, Riverside 2022-02-16 to 2022-12-20. On jagged testing data, where AR performs better on some part and LSTM better on the other, the hybrid model presents advantages of both models: for example, in Fig. 4c, the hybrid model resembles AR on the two ends, where AR performs better, and it resembles LSTM in shape between day 5 to day 15, where LSTM seems to capture the trend better.

General performance

We evaluated the model performances numerically, in the 8 California counties across multiple trials. The results are given in Table 2. We observe that the hybrid model outperforms the AR model and the LSTM models almost uniformly: it generally yields the smallest average MAPE. To be specific, the general MAPE of each model (AR, LSTM, LSTM with 2 layers, and hybrid), averaged on the results for all 8 counties, is 5.629%, 4.934%, 6.804%, and 4.173% in order. In general, the hybrid model has the best general performance, and it outperforms the AR model by approximately 1.5%. The LSTM model suffers from overfitting when a second LSTM layer is added. As seen in the Supplementary Material, the proposed hybrid model also yields the lowest RMSE and MAE values.

Table 2 MAPE (by percentage) for each model on each county.

Interpretability

Interpretability of hybrid models can be defined as the ability to provide insight into the relationships they have learned, as introduced by Murdoch et al.23. The hybrid model proposed a decomposition approach to decipher the learned model underlying the data-generating mechanism, where the estimated AR model provides the easy-to-understand linear trend. On the other hand, the LSTM is able to capture the long-term and nonlinear trend in the time series data. Our hybrid model aims to strike a balance between interpretability and accuracy, enabling us to gain insights into the underlying data while still achieving high predictive performance.

In this section, we study how AR and LSTM components contribute to the hybrid model when fitting the data. Our purpose is to seek the insights into explaining why the hybrid model enjoys the better performance in general. And more importantly, we seek to use the interpretation from the fitted hybrid model to provide practical guidance to the public health policy making process.

Note that all models are trained on the normalized data as described in section “Training” (Supplementary Material). Consequently all figures below report predictions on the normalized scales.

In Fig. 5, we present three settings with different signal strength ratio (represented by the value of ) of the AR components and LSTM components in the prediction of the hybrid model. Specifically, the larger value of indicates the AR component dominates the LSTM component in prediction, and the smaller value of indicates otherwise. We found that the component that has stronger signal characterizes the general trend in the data while the other helps to stabilize the variance. This observation sheds light into why the hybrid model provides better predictive performance in general than a single model.

Moreover, the fitted value of provides a characterization of the intrinsic nonlinearity of the data, and consequently the difficulty of exploiting interpretation in the linear components of the fitted hybrid model. The smaller the value of , the higher weight the nonlinear fit using LSTM has in the final prediction. In such a setting, coefficients in the AR components should be given less weight into generating interpretation for policy making. Equivalently, for larger value of , it is more trustworthy to derive coefficients interpretation from the important AR part. This observation is helpful for public policy maker to distinguish among different virus transmission stages.

Table 3 Coefficients of AR model v.s. AR coefficients of hybrid model.

The forecasts of a hybrid model versus the ground truth, and the contribution of its AR and its LSTM component.

Comparative study on the WHO datasets

In this section, we compare our proposed hybrid model for COVID-19 prediction with its two component models, the ARIMA and LSTM models, as well as three other commonly used models: Support Vector Machines53 (SVM), Random Forest54 (RF), and eXtreme Gradient Boosting55 (XGBoost). To ensure the effectiveness of our model in different application settings, we use a country-level data for this comparative study, focusing on datasets from seven different countries collected by the World Health Organization.

We provide a brief overview of the three additional comparing methods. Support Vector Machines (SVM)42,47 is a machine learning model that identifies the optimal hyperplane in a high-dimensional space that maximally separates data points into different classes. An SVM applies to both classification and regression problems. SVM is know to not perform well on noisy or unbalanced data56,57.

Random Forest43,44,45 is an ensemble learning method that constructs a multitude of decision trees. A Random Forest is very flexible and can handle complex data types. On the other hand, the Random Forests are known for their reduced interpretability, sensitivity to noise, the need for hyperparameter tuning, and potential issues with imbalanced data. These factors may impact their performance in the context of COVID-19 predictions58,59,60.

Extreme Gradient Boosting (XGBoost)44,46,48 has shown exceptional performance in various tasks. XGBoost is an ensemble learning method based on gradient boosting trees. It is known for its efficiency, scalability, and accuracy. However, like other tree-based ensemble methods, it can be more challenging to interpret. This may make it difficult to understand the driving factors behind predictions. In addition, XGBoost can be prone to overfitting, especially with small datasets or when the hyperparameters are not tuned properly61,62.

We present the numerical results of the comparative study, which are visualized in Fig. 6. The comparative study is done on data collected by the World Health Organization63 in Japan (JPN), Canada (CAN), Brazil (BRA), Argentina (ARG), Singapore (SGP), Italy (ITA), and the United Kingdom (GBR).

Overall, the proposed hybrid model performs better than the other models in most cases, as evidenced by its lower MAPE. This suggests that our model is effective in various situations and outperforms other commonly used models for COVID-19 prediction.

A heatmap exhibiting the performance, measured by MAPE in percentage, of the 7 models from this study and from previous work: AR, Single LSTM(LSTM), Double LSTM(DLSTM), hybrid, SVM, Random Forest(RF), XGBoost(XGB). The assessment has been done on data collected by World Health Organization, from 7 different countries around the world: Japan(JPN), Canada(CAN), Brazil(BRA), Argentina(ARG), Singapore(SGP), Italy(ITA), and The United Kingdom(GBR).

Discussion

It is also noteworthy that the predictive performance of the proposed hybrid model can be further improved by properly choosing the hyperparameters. Furthermore, while we considered LSTM as the nonlinear component in the hybrid model, it can be substituted by any other deep learning models.

Funding

Y. Zhang was partially supported by Raymond L Wilder Award sponsored by University of California Santa Barbara and Hellman Family Faculty Fellowship. S.T. was partially supported by Regents Junior Faculty fellowship, Faculty Early Career Acceleration grant sponsored by University of California Santa Barbara, Hellman Family Faculty Fellowship and the NSF DMS-2111303. G.Y. was partially supported by Regents Junior Faculty fellowship, Faculty Early Career Acceleration grant sponsored by University of California Santa Barbara.

Author information

S.T. and Y.G. provided the initial idea and research plan, Y.Z. collected data, implement algorithms and performed simulations. All authors participated in the analysis and discussion of the results, and participated in the writing of the manuscript.

Corresponding authors

Correspondence to Sui Tang or Guo Yu.

Ethics declarations

The authors declare no competing interests.

Additional information

Reprints and permissions

About this article

Zhang, Y., Tang, S. & Yu, G. An interpretable hybrid predictive model of COVID-19 cases using autoregressive model and LSTM. Sci Rep 13, 6708 (2023). https://doi.org/10.1038/s41598-023-33685-z