Python 102 - Plotting data on a map

Before you start make sure that you have run the following commands in anaconda prompt:

conda activate analysis-101

conda install git

conda install pycountry

git clone https://github.com/CSSEGISandData/COVID-19.git us_jhu_data

Explore environment and downloaded data

As with any downloaded data, it’s prudent to explore a little what data we received and wehther there are obvious patterns.

Let’s start by exploring what we see from our notebook. We can do this using the jupyter magic command %ls which behaves like the Unix ls command or similar to the Windows DIR command listing directories from where the notebook is opened from.

%ls

AR_pyTemplate.xlsx                  introduction-to-SIR-models-2.ipynb
choropleth_1.html                   introduction-to-SIR-models-3.ipynb
choropleth_2.html                   mapping-data-with-plotly.ipynb
choropleth_3.html                   us_jhu_data/
introduction-to-SIR-models-1.ipynb

So it looks like we’re in our own folder, we’ve got our notebook and the data we just downloaded, let’s have a look inside.

%ls us_jhu_data

README.md                       csse_covid_19_data/
archived_data/                  who_covid_19_situation_reports/

That command wasn’t super helpful, all we got back from it was a list of more folders to look into. Let’s just go nuclear and look at everything together. To do that we can use the -R switch to expand all the files and folders.

%ls -R us_jhu_data/ 
# %ls /S us_jhu_data

README.md                       csse_covid_19_data/
archived_data/                  who_covid_19_situation_reports/

us_jhu_data//archived_data:
README.md                    archived_time_series/
archived_daily_case_updates/

us_jhu_data//archived_data/archived_daily_case_updates:
01-21-2020_2200.csv  01-29-2020_1430.csv  02-06-2020_1318.csv
01-22-2020_1200.csv  01-29-2020_2100.csv  02-06-2020_2005.csv
01-23-2020_1200.csv  01-30-2020_1100.csv  02-07-2020_2024.csv
01-24-2020_0000.csv  01-30-2020_2130.csv  02-08-2020_1024.csv
01-24-2020_1200.csv  01-31-2020_1400.csv  02-08-2020_2304.csv
01-25-2020_0000.csv  02-01-2020_1000.csv  02-09-2020_1030.csv
01-25-2020_1200.csv  02-01-2020_1800.csv  02-09-2020_2320.csv
01-25-2020_2200.csv  02-01-2020_2300.csv  02-10-2020_1030.csv
01-26-2020_1100.csv  02-02-2020_0500.csv  02-10-2020_1930.csv
01-26-2020_2300.csv  02-02-2020_1945.csv  02-11-2020_1050.csv
01-27-2020_0900.csv  02-02-2020_2100.csv  02-11-2020_2044.csv
01-27-2020_1900.csv  02-03-2020_1230.csv  02-12-2020_1020.csv
01-27-2020_2030.csv  02-03-2020_2140.csv  02-12-2020_2200.csv
01-28-2020_1300.csv  02-04-2020_0800.csv  02-13-2020_1000.csv
01-28-2020_1800.csv  02-04-2020_1150.csv  02-13-2020_2115.csv
01-28-2020_2300.csv  02-04-2020_2200.csv  02-14-2020_1123.csv
01-29-2020_1330.csv  02-05-2020_1220.csv  README.md

us_jhu_data//archived_data/archived_time_series:
README.md
time_series_19-covid-Confirmed_archived_0325.csv
time_series_19-covid-Deaths_archived_0325.csv
time_series_19-covid-Recovered_archived_0325.csv
time_series_2019-ncov-Confirmed.csv
time_series_2019-ncov-Deaths.csv
time_series_2019-ncov-Recovered.csv

us_jhu_data//csse_covid_19_data:
README.md                       csse_covid_19_daily_reports_us/
UID_ISO_FIPS_LookUp_Table.csv   csse_covid_19_time_series/
csse_covid_19_daily_reports/

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports:
01-22-2020.csv  02-09-2020.csv  02-27-2020.csv  03-16-2020.csv  04-03-2020.csv
01-23-2020.csv  02-10-2020.csv  02-28-2020.csv  03-17-2020.csv  04-04-2020.csv
01-24-2020.csv  02-11-2020.csv  02-29-2020.csv  03-18-2020.csv  04-05-2020.csv
01-25-2020.csv  02-12-2020.csv  03-01-2020.csv  03-19-2020.csv  04-06-2020.csv
01-26-2020.csv  02-13-2020.csv  03-02-2020.csv  03-20-2020.csv  04-07-2020.csv
01-27-2020.csv  02-14-2020.csv  03-03-2020.csv  03-21-2020.csv  04-08-2020.csv
01-28-2020.csv  02-15-2020.csv  03-04-2020.csv  03-22-2020.csv  04-09-2020.csv
01-29-2020.csv  02-16-2020.csv  03-05-2020.csv  03-23-2020.csv  04-10-2020.csv
01-30-2020.csv  02-17-2020.csv  03-06-2020.csv  03-24-2020.csv  04-11-2020.csv
01-31-2020.csv  02-18-2020.csv  03-07-2020.csv  03-25-2020.csv  04-12-2020.csv
02-01-2020.csv  02-19-2020.csv  03-08-2020.csv  03-26-2020.csv  04-13-2020.csv
02-02-2020.csv  02-20-2020.csv  03-09-2020.csv  03-27-2020.csv  04-14-2020.csv
02-03-2020.csv  02-21-2020.csv  03-10-2020.csv  03-28-2020.csv  04-15-2020.csv
02-04-2020.csv  02-22-2020.csv  03-11-2020.csv  03-29-2020.csv  README.md
02-05-2020.csv  02-23-2020.csv  03-12-2020.csv  03-30-2020.csv
02-06-2020.csv  02-24-2020.csv  03-13-2020.csv  03-31-2020.csv
02-07-2020.csv  02-25-2020.csv  03-14-2020.csv  04-01-2020.csv
02-08-2020.csv  02-26-2020.csv  03-15-2020.csv  04-02-2020.csv

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports_us:
04-12-2020.csv  04-13-2020.csv  04-14-2020.csv  04-15-2020.csv  README.md

us_jhu_data//csse_covid_19_data/csse_covid_19_time_series:
README.md
time_series_covid19_confirmed_US.csv
time_series_covid19_confirmed_global.csv
time_series_covid19_deaths_US.csv
time_series_covid19_deaths_global.csv
time_series_covid19_recovered_global.csv

us_jhu_data//who_covid_19_situation_reports:
README.md                         who_covid_19_sit_rep_time_series/
who_covid_19_sit_rep_pdfs/

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_pdfs:
20200121-sitrep-1-2019-ncov.pdf     20200222-sitrep-33-covid-19.pdf
20200122-sitrep-2-2019-ncov.pdf     20200223-sitrep-34-covid-19.pdf
20200123-sitrep-3-2019-ncov.pdf     20200224-sitrep-35-covid-19.pdf
20200124-sitrep-4-2019-ncov.pdf     20200225-sitrep-36-covid-19.pdf
20200125-sitrep-5-2019-ncov.pdf     20200226-sitrep-37-covid-19.pdf
20200126-sitrep-6-2019--ncov.pdf    20200227-sitrep-38-covid-19.pdf
20200127-sitrep-7-2019--ncov.pdf    20200228-sitrep-39-covid-19.pdf
20200128-sitrep-8-ncov-cleared.pdf  20200229-sitrep-40-covid-19.pdf
20200129-sitrep-9-ncov-v2.pdf       20200301-sitrep-41-covid-19.pdf
20200130-sitrep-10-ncov.pdf         20200302-sitrep-42-covid-19.pdf
20200131-sitrep-11-ncov.pdf         20200303-sitrep-43-covid-19.pdf
20200201-sitrep-12-ncov.pdf         20200304-sitrep-44-covid-19.pdf
20200202-sitrep-13-ncov-v3.pdf      20200305-sitrep-45-covid-19.pdf
20200203-sitrep-14-ncov.pdf         20200306-sitrep-46-covid-19.pdf
20200204-sitrep-15-ncov.pdf         20200307-sitrep-47-covid-19.pdf
20200205-sitrep-16-ncov.pdf         20200308-sitrep-48-covid-19.pdf
20200206-sitrep-17-ncov.pdf         20200309-sitrep-49-covid-19.pdf
20200207-sitrep-18-ncov.pdf         20200310-sitrep-50-covid-19.pdf
20200208-sitrep-19-ncov.pdf         20200311-sitrep-51-covid-19.pdf
20200209-sitrep-20-ncov.pdf         20200312-sitrep-52-covid-19.pdf
20200210-sitrep-21-ncov.pdf         20200313-sitrep-53-covid-19.pdf
20200211-sitrep-22-ncov.pdf         20200314-sitrep-54-covid-19.pdf
20200212-sitrep-23-ncov.pdf         20200315-sitrep-55-covid-19.pdf
20200213-sitrep-24-covid-19.pdf     20200316-sitrep-56-covid-19.pdf
20200214-sitrep-25-covid-19.pdf     20200317-sitrep-57-covid-19.pdf
20200215-sitrep-26-covid-19.pdf     20200318-sitrep-58-covid-19.pdf
20200216-sitrep-27-covid-19.pdf     20200319-sitrep-59-covid-19.pdf
20200217-sitrep-28-covid-19.pdf     20200320-sitrep-60-covid-19.pdf
20200218-sitrep-29-covid-19.pdf     20200321-sitrep-61-covid-19.pdf
20200219-sitrep-30-covid-19.pdf     20200322-sitrep-62-covid-19.pdf
20200220-sitrep-31-covid-19.pdf     20200323-sitrep-63-covid-19.pdf
20200221-sitrep-32-covid-19.pdf     20200324-sitrep-64-covid-19.pdf

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_time_series:
who_covid_19_sit_rep_time_series.csv

From the output of ls we can see that we have the following data sets: - archived daily case updates - archived time series data - a lookup table to convert IDs to US FIPS codes (county codes) - daily reports from jan 22 to april 10 - time series data with different focuses - unstructured WHO situation reports - WHO structured time series of the situation reports

For brevity, we will only explore the following two data sets: - time series data with different focuses - daily reports from jan 22 to april 10

Let’s start with the global time series data

Start the Python engines, import libararies

Before we start we need to load the libararies we need. Below are a collection of libraries that I reach for time and time again so it’s generally prudent to load them first so that we don’t end up needing to load them later.

import pandas as pd
import numpy as np
import os
import plotly.express as px
from pandas_profiling import ProfileReport

Global time series data

pd.read_csv('us_jhu_data/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')\
  .head()

#'us_jhu_data//csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv' this file
# will be left to participants as an exercise.

	Province/State	Country/Region	Lat	Long	...	4/6/20	4/7/20	4/8/20	4/9/20	4/10/20	4/11/20	4/12/20	4/13/20	4/14/20	4/15/20
0	NaN	Afghanistan	33.0000	65.0000	...	367	423	444	484	521	555	607	665	714	784
1	NaN	Albania	41.1533	20.1683	...	377	383	400	409	416	433	446	467	475	494
2	NaN	Algeria	28.0339	1.6596	...	1423	1468	1572	1666	1761	1825	1914	1983	2070	2160
3	NaN	Andorra	42.5063	1.5218	...	525	545	564	583	601	601	638	646	659	673
4	NaN	Angola	-11.2027	17.8739	...	16	17	19	19	19	19	19	19	19	19

5 rows × 89 columns

This file has almost exactly what we need, but it’s kind of cheating, so let’s use the other dataset, the daily reports and clean that data up for us to use in a mapping exercise.

Explore chosen dataset – Daily Reports

Let’s start by peeking into one file and having a look at what’s in there

idata = 'us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/'

pd.read_csv(os.path.join(idata,'03-13-2020.csv'))

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	Latitude	Longitude
0	Hubei	China	2020-03-13T11:09:03	67786	3062	51553	30.9756	112.2707
1	Guangdong	China	2020-03-13T11:09:03	1356	8	1296	23.3417	113.4244
2	Henan	China	2020-03-11T08:13:09	1273	22	1249	33.8820	113.6140
3	Zhejiang	China	2020-03-12T01:33:02	1215	1	1197	29.1832	120.0934
4	Hunan	China	2020-03-13T11:09:03	1018	4	1005	27.6104	111.7088
...	...	...	...	...	...	...	...	...
225	NaN	Afghanistan	2020-03-11T20:00:00	7	0	0	33.0000	65.0000
226	NaN	Monaco	2020-03-11T20:00:00	2	0	0	43.7333	7.4167
227	NaN	Liechtenstein	2020-03-11T20:00:00	1	0	0	47.1400	9.5500
228	NaN	Guyana	2020-03-11T20:00:00	1	1	0	5.0000	-58.7500
229	NaN	Taiwan*	2020-03-11T20:00:00	50	1	20	23.7000	121.0000

230 rows × 8 columns

Check the size and column count for all the files

import glob

dataset_path = os.path.join(idata,'*.csv')

for file in glob.glob(dataset_path):
    print( pd.read_csv(file).shape) # calculate data shape (rows, columns)

(101, 6)
(105, 6)
(2883, 12)
(2911, 12)
(75, 6)
(76, 6)
(3416, 12)
(3420, 12)
(2569, 12)
(2625, 12)
(206, 8)
(216, 8)
(73, 6)
(74, 6)
(225, 8)
(199, 8)
(3027, 12)
(3014, 12)
(71, 6)
(70, 6)
(72, 6)
(72, 6)
(125, 8)
(2989, 12)
(3002, 12)
(67, 6)
(68, 6)
(62, 6)
(58, 6)
(84, 6)
(76, 6)
(3413, 12)
(3417, 12)
(2764, 12)
(2679, 12)
(276, 8)
(272, 8)
(75, 6)
(75, 6)
(47, 6)
(51, 6)
(3434, 12)
(3430, 12)
(160, 8)
(173, 8)
(71, 6)
(72, 6)
(3439, 12)
(2434, 12)
(2483, 12)
(230, 8)
(218, 8)
(38, 6)
(46, 6)
(73, 6)
(72, 6)
(94, 6)
(90, 6)
(292, 8)
(284, 8)
(52, 6)
(54, 6)
(3429, 12)
(3421, 12)
(2809, 12)
(2857, 12)
(114, 6)
(119, 6)
(249, 8)
(258, 8)
(44, 6)
(41, 6)
(75, 6)
(75, 6)
(84, 6)
(85, 6)
(299, 8)
(304, 8)
(151, 8)
(141, 8)
(2966, 12)
(2942, 12)
(67, 6)
(266, 8)
(255, 8)

for file in glob.glob(dataset_path):
    print(file # print file path
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)

us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv (101, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv (105, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv (2883, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv (2911, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-18-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-19-2020.csv (76, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv (3416, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv (3420, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-02-2020.csv (2569, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-03-2020.csv (2625, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-10-2020.csv (206, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv (216, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-12-2020.csv (73, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-13-2020.csv (74, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-07-2020.csv (225, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2020.csv (199, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv (3027, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv (3014, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-05-2020.csv (71, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-04-2020.csv (70, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-08-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-09-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-01-2020.csv (125, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv (2989, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv (3002, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-02-2020.csv (67, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-03-2020.csv (68, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-31-2020.csv (62, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv (58, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-21-2020.csv (84, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-20-2020.csv (76, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv (3413, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv (3417, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv (2764, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv (2679, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-17-2020.csv (276, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-16-2020.csv (272, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-15-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-14-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-26-2020.csv (47, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-27-2020.csv (51, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-29-2020.csv (3434, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv (3430, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-04-2020.csv (160, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-05-2020.csv (173, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-06-2020.csv (71, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-07-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-30-2020.csv (3439, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-31-2020.csv (2434, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-01-2020.csv (2483, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-13-2020.csv (230, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-12-2020.csv (218, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv (38, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv (46, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-11-2020.csv (73, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-10-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-25-2020.csv (94, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-24-2020.csv (90, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-19-2020.csv (292, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-18-2020.csv (284, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-28-2020.csv (52, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv (54, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-27-2020.csv (3429, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv (3421, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv (2809, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-07-2020.csv (2857, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-28-2020.csv (114, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-29-2020.csv (119, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-14-2020.csv (249, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-15-2020.csv (258, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-25-2020.csv (44, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-24-2020.csv (41, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-16-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-17-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv (84, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-23-2020.csv (85, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv (299, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv (304, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-03-2020.csv (151, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-02-2020.csv (141, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv (2966, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-10-2020.csv (2942, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-01-2020.csv (67, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-09-2020.csv (266, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv (255, 8)

for file in glob.glob(dataset_path):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)

02-26-2020.csv (101, 6)
02-27-2020.csv (105, 6)
04-08-2020.csv (2883, 12)
04-09-2020.csv (2911, 12)
02-18-2020.csv (75, 6)
02-19-2020.csv (76, 6)
03-24-2020.csv (3416, 12)
03-25-2020.csv (3420, 12)
04-02-2020.csv (2569, 12)
04-03-2020.csv (2625, 12)
03-10-2020.csv (206, 8)
03-11-2020.csv (216, 8)
02-12-2020.csv (73, 6)
02-13-2020.csv (74, 6)
03-07-2020.csv (225, 8)
03-06-2020.csv (199, 8)
04-15-2020.csv (3027, 12)
04-14-2020.csv (3014, 12)
02-05-2020.csv (71, 6)
02-04-2020.csv (70, 6)
02-08-2020.csv (72, 6)
02-09-2020.csv (72, 6)
03-01-2020.csv (125, 8)
04-12-2020.csv (2989, 12)
04-13-2020.csv (3002, 12)
02-02-2020.csv (67, 6)
02-03-2020.csv (68, 6)
01-31-2020.csv (62, 6)
01-30-2020.csv (58, 6)
02-21-2020.csv (84, 6)
02-20-2020.csv (76, 6)
03-23-2020.csv (3413, 12)
03-22-2020.csv (3417, 12)
04-05-2020.csv (2764, 12)
04-04-2020.csv (2679, 12)
03-17-2020.csv (276, 8)
03-16-2020.csv (272, 8)
02-15-2020.csv (75, 6)
02-14-2020.csv (75, 6)
01-26-2020.csv (47, 6)
01-27-2020.csv (51, 6)
03-29-2020.csv (3434, 12)
03-28-2020.csv (3430, 12)
03-04-2020.csv (160, 8)
03-05-2020.csv (173, 8)
02-06-2020.csv (71, 6)
02-07-2020.csv (72, 6)
03-30-2020.csv (3439, 12)
03-31-2020.csv (2434, 12)
04-01-2020.csv (2483, 12)
03-13-2020.csv (230, 8)
03-12-2020.csv (218, 8)
01-22-2020.csv (38, 6)
01-23-2020.csv (46, 6)
02-11-2020.csv (73, 6)
02-10-2020.csv (72, 6)
02-25-2020.csv (94, 6)
02-24-2020.csv (90, 6)
03-19-2020.csv (292, 8)
03-18-2020.csv (284, 8)
01-28-2020.csv (52, 6)
01-29-2020.csv (54, 6)
03-27-2020.csv (3429, 12)
03-26-2020.csv (3421, 12)
04-06-2020.csv (2809, 12)
04-07-2020.csv (2857, 12)
02-28-2020.csv (114, 6)
02-29-2020.csv (119, 6)
03-14-2020.csv (249, 8)
03-15-2020.csv (258, 8)
01-25-2020.csv (44, 6)
01-24-2020.csv (41, 6)
02-16-2020.csv (75, 6)
02-17-2020.csv (75, 6)
02-22-2020.csv (84, 6)
02-23-2020.csv (85, 6)
03-20-2020.csv (299, 8)
03-21-2020.csv (304, 8)
03-03-2020.csv (151, 8)
03-02-2020.csv (141, 8)
04-11-2020.csv (2966, 12)
04-10-2020.csv (2942, 12)
02-01-2020.csv (67, 6)
03-09-2020.csv (266, 8)
03-08-2020.csv (255, 8)

# Sort the file list

for file in sorted(glob.glob(dataset_path)):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)

01-22-2020.csv (38, 6)
01-23-2020.csv (46, 6)
01-24-2020.csv (41, 6)
01-25-2020.csv (44, 6)
01-26-2020.csv (47, 6)
01-27-2020.csv (51, 6)
01-28-2020.csv (52, 6)
01-29-2020.csv (54, 6)
01-30-2020.csv (58, 6)
01-31-2020.csv (62, 6)
02-01-2020.csv (67, 6)
02-02-2020.csv (67, 6)
02-03-2020.csv (68, 6)
02-04-2020.csv (70, 6)
02-05-2020.csv (71, 6)
02-06-2020.csv (71, 6)
02-07-2020.csv (72, 6)
02-08-2020.csv (72, 6)
02-09-2020.csv (72, 6)
02-10-2020.csv (72, 6)
02-11-2020.csv (73, 6)
02-12-2020.csv (73, 6)
02-13-2020.csv (74, 6)
02-14-2020.csv (75, 6)
02-15-2020.csv (75, 6)
02-16-2020.csv (75, 6)
02-17-2020.csv (75, 6)
02-18-2020.csv (75, 6)
02-19-2020.csv (76, 6)
02-20-2020.csv (76, 6)
02-21-2020.csv (84, 6)
02-22-2020.csv (84, 6)
02-23-2020.csv (85, 6)
02-24-2020.csv (90, 6)
02-25-2020.csv (94, 6)
02-26-2020.csv (101, 6)
02-27-2020.csv (105, 6)
02-28-2020.csv (114, 6)
02-29-2020.csv (119, 6)
03-01-2020.csv (125, 8)
03-02-2020.csv (141, 8)
03-03-2020.csv (151, 8)
03-04-2020.csv (160, 8)
03-05-2020.csv (173, 8)
03-06-2020.csv (199, 8)
03-07-2020.csv (225, 8)
03-08-2020.csv (255, 8)
03-09-2020.csv (266, 8)
03-10-2020.csv (206, 8)
03-11-2020.csv (216, 8)
03-12-2020.csv (218, 8)
03-13-2020.csv (230, 8)
03-14-2020.csv (249, 8)
03-15-2020.csv (258, 8)
03-16-2020.csv (272, 8)
03-17-2020.csv (276, 8)
03-18-2020.csv (284, 8)
03-19-2020.csv (292, 8)
03-20-2020.csv (299, 8)
03-21-2020.csv (304, 8)
03-22-2020.csv (3417, 12)
03-23-2020.csv (3413, 12)
03-24-2020.csv (3416, 12)
03-25-2020.csv (3420, 12)
03-26-2020.csv (3421, 12)
03-27-2020.csv (3429, 12)
03-28-2020.csv (3430, 12)
03-29-2020.csv (3434, 12)
03-30-2020.csv (3439, 12)
03-31-2020.csv (2434, 12)
04-01-2020.csv (2483, 12)
04-02-2020.csv (2569, 12)
04-03-2020.csv (2625, 12)
04-04-2020.csv (2679, 12)
04-05-2020.csv (2764, 12)
04-06-2020.csv (2809, 12)
04-07-2020.csv (2857, 12)
04-08-2020.csv (2883, 12)
04-09-2020.csv (2911, 12)
04-10-2020.csv (2942, 12)
04-11-2020.csv (2966, 12)
04-12-2020.csv (2989, 12)
04-13-2020.csv (3002, 12)
04-14-2020.csv (3014, 12)
04-15-2020.csv (3027, 12)

import os

for file in sorted(glob.glob(dataset_path)):
       print(file.rsplit(os.path.sep,1)[1] # isolate file name
             , pd.read_csv(file).shape # calculate data shape (rows, columns)
             , os.stat(file).st_size/(1024*1024)) # get the individual file size in MB

01-22-2020.csv (38, 6) 0.0015974044799804688
01-23-2020.csv (46, 6) 0.00174713134765625
01-24-2020.csv (41, 6) 0.0016164779663085938
01-25-2020.csv (44, 6) 0.0017070770263671875
01-26-2020.csv (47, 6) 0.00180816650390625
01-27-2020.csv (51, 6) 0.0019540786743164062
01-28-2020.csv (52, 6) 0.0020046234130859375
01-29-2020.csv (54, 6) 0.00208282470703125
01-30-2020.csv (58, 6) 0.0022258758544921875
01-31-2020.csv (62, 6) 0.0024499893188476562
02-01-2020.csv (67, 6) 0.0026559829711914062
02-02-2020.csv (67, 6) 0.0030050277709960938
02-03-2020.csv (68, 6) 0.0030527114868164062
02-04-2020.csv (70, 6) 0.0031423568725585938
02-05-2020.csv (71, 6) 0.0031862258911132812
02-06-2020.csv (71, 6) 0.0031900405883789062
02-07-2020.csv (72, 6) 0.0032444000244140625
02-08-2020.csv (72, 6) 0.0032510757446289062
02-09-2020.csv (72, 6) 0.0032701492309570312
02-10-2020.csv (72, 6) 0.0032739639282226562
02-11-2020.csv (73, 6) 0.0033283233642578125
02-12-2020.csv (73, 6) 0.0033311843872070312
02-13-2020.csv (74, 6) 0.0033807754516601562
02-14-2020.csv (75, 6) 0.0033416748046875
02-15-2020.csv (75, 6) 0.0033464431762695312
02-16-2020.csv (75, 6) 0.0033473968505859375
02-17-2020.csv (75, 6) 0.0034189224243164062
02-18-2020.csv (75, 6) 0.003421783447265625
02-19-2020.csv (76, 6) 0.0033855438232421875
02-20-2020.csv (76, 6) 0.0033864974975585938
02-21-2020.csv (84, 6) 0.0037584304809570312
02-22-2020.csv (84, 6) 0.0038251876831054688
02-23-2020.csv (85, 6) 0.0038623809814453125
02-24-2020.csv (90, 6) 0.004063606262207031
02-25-2020.csv (94, 6) 0.004210472106933594
02-26-2020.csv (101, 6) 0.004431724548339844
02-27-2020.csv (105, 6) 0.004573822021484375
02-28-2020.csv (114, 6) 0.004901885986328125
02-29-2020.csv (119, 6) 0.0051116943359375
03-01-2020.csv (125, 8) 0.0073528289794921875
03-02-2020.csv (141, 8) 0.008268356323242188
03-03-2020.csv (151, 8) 0.008862495422363281
03-04-2020.csv (160, 8) 0.009377479553222656
03-05-2020.csv (173, 8) 0.010189056396484375
03-06-2020.csv (199, 8) 0.011743545532226562
03-07-2020.csv (225, 8) 0.013422012329101562
03-08-2020.csv (255, 8) 0.015294075012207031
03-09-2020.csv (266, 8) 0.01593017578125
03-10-2020.csv (206, 8) 0.011493682861328125
03-11-2020.csv (216, 8) 0.0117645263671875
03-12-2020.csv (218, 8) 0.011874198913574219
03-13-2020.csv (230, 8) 0.012880325317382812
03-14-2020.csv (249, 8) 0.013600349426269531
03-15-2020.csv (258, 8) 0.014292716979980469
03-16-2020.csv (272, 8) 0.014932632446289062
03-17-2020.csv (276, 8) 0.015168190002441406
03-18-2020.csv (284, 8) 0.015623092651367188
03-19-2020.csv (292, 8) 0.016092300415039062
03-20-2020.csv (299, 8) 0.01649951934814453
03-21-2020.csv (304, 8) 0.016945838928222656
03-22-2020.csv (3417, 12) 0.3128471374511719
03-23-2020.csv (3413, 12) 0.3327207565307617
03-24-2020.csv (3416, 12) 0.3328866958618164
03-25-2020.csv (3420, 12) 0.33330821990966797
03-26-2020.csv (3421, 12) 0.3335227966308594
03-27-2020.csv (3429, 12) 0.334259033203125
03-28-2020.csv (3430, 12) 0.31421375274658203
03-29-2020.csv (3434, 12) 0.3146400451660156
03-30-2020.csv (3439, 12) 0.31514453887939453
03-31-2020.csv (2434, 12) 0.23666667938232422
04-01-2020.csv (2483, 12) 0.24163150787353516
04-02-2020.csv (2569, 12) 0.23265457153320312
04-03-2020.csv (2625, 12) 0.25572681427001953
04-04-2020.csv (2679, 12) 0.2430257797241211
04-05-2020.csv (2764, 12) 0.26952648162841797
04-06-2020.csv (2809, 12) 0.25507640838623047
04-07-2020.csv (2857, 12) 0.2788209915161133
04-08-2020.csv (2883, 12) 0.2814149856567383
04-09-2020.csv (2911, 12) 0.2843132019042969
04-10-2020.csv (2942, 12) 0.2873687744140625
04-11-2020.csv (2966, 12) 0.2899484634399414
04-12-2020.csv (2989, 12) 0.2915000915527344
04-13-2020.csv (3002, 12) 0.2954998016357422
04-14-2020.csv (3014, 12) 0.2967643737792969
04-15-2020.csv (3027, 12) 0.29817771911621094

What’s the goal when preparing this dataset for vizualization?

We are going to use plotly.js to visualize the data, however that means that the data needs to be ‘clean’. Basically, the data needs to be in the format and shape that is required for visualization given that plotly won’t make assumptions about how to draw your data points. I’ve seen some beautifully picassoesque and Dr Seuss-like graphs when working with malformed data.

In this instance we want to create a graph that show’s us the number of COVID cases per country per day, and animate through the days. To do this we will need to have the data in the following format:

Country Name	Country ISO Code	Day	Number of Cases
Switzerland	CHE	2020-02-29	xxx
France	FRA	2020-02-29	yyy
Switzerland	CHE	2020-03-01	zzz
France	FRA	2020-03-01	aaa
…	…	…	…

To get to that format we’ll have to go from whichever format we have in the chosen data set to the format above. To be able to do that we’ll need to load and harmonize the daily reports with the following steps:

Load the data into memory
Analyze the dataset and make decisions about what to do
Consolidate data which has changed names over time
Clean up the country names to identify the ISO codes for countries
Visualize the data on a world map
Tweak the visualization
Adapt our visualization to look at cases per 100k instead of number of cases

1. Load the data set into a pandas dataframe

Load all the data into memory and see how we can combine the data into a single set for us to use with data visualization.

import glob 
# Create two lists to store our file metadata and file data
all_data = []
meta_data = []

# For every file in our data set path
for file in sorted(glob.glob(dataset_path)):
    
    # 1. Read the file to a temporary data frame
    df = pd.read_csv(file)
    
    # 2. Append a dictionary with the file meta_data into the metadata list
    meta_data.append( {  'file_name': file.rsplit(os.path.sep,1)[1]
                       , 'num_rows': df.shape[0]
                       , 'num_cols': df.shape[1]
                       , 'col_names': '\t'.join(sorted(list(df.columns)))} )
    
    # Add the file name to the loaded data
    df['source_file'] = file
    
    # 4. Add the loaded data with the file name to a list with all the loaded data
    all_data.append(df)

# 5. Create a table/dataframe out of our meta_data 
meta_data = pd.DataFrame(meta_data)

# show the metadata in jupyter notebook
meta_data

	file_name	num_rows	num_cols	col_names
0	01-22-2020.csv	38	6	Confirmed\tCountry/Region\tDeaths\tLast Update...
1	01-23-2020.csv	46	6	Confirmed\tCountry/Region\tDeaths\tLast Update...
2	01-24-2020.csv	41	6	Confirmed\tCountry/Region\tDeaths\tLast Update...
3	01-25-2020.csv	44	6	Confirmed\tCountry/Region\tDeaths\tLast Update...
4	01-26-2020.csv	47	6	Confirmed\tCountry/Region\tDeaths\tLast Update...
...	...	...	...	...
80	04-11-2020.csv	2966	12	Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
81	04-12-2020.csv	2989	12	Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
82	04-13-2020.csv	3002	12	Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
83	04-14-2020.csv	3014	12	Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
84	04-15-2020.csv	3027	12	Active\tAdmin2\tCombined_Key\tConfirmed\tCount...

85 rows × 4 columns


pd.set_option('max_colwidth', 150)

# output result to notebook window
meta_data.groupby(['num_cols'])\
         .agg({ 'num_rows': 'sum'
              , 'file_name': sorted
              , 'col_names': set })

	num_rows	file_name	col_names
num_cols
6	2818	[01-22-2020.csv, 01-23-2020.csv, 01-24-2020.csv, 01-25-2020.csv, 01-26-2020.csv, 01-27-2020.csv, 01-28-2020.csv, 01-29-2020.csv, 01-30-2020.csv, 0...	{Confirmed\tCountry/Region\tDeaths\tLast Update\tProvince/State\tRecovered}
8	4799	[03-01-2020.csv, 03-02-2020.csv, 03-03-2020.csv, 03-04-2020.csv, 03-05-2020.csv, 03-06-2020.csv, 03-07-2020.csv, 03-08-2020.csv, 03-09-2020.csv, 0...	{Confirmed\tCountry/Region\tDeaths\tLast Update\tLatitude\tLongitude\tProvince/State\tRecovered}
12	75773	[03-22-2020.csv, 03-23-2020.csv, 03-24-2020.csv, 03-25-2020.csv, 03-26-2020.csv, 03-27-2020.csv, 03-28-2020.csv, 03-29-2020.csv, 03-30-2020.csv, 0...	{Active\tAdmin2\tCombined_Key\tConfirmed\tCountry_Region\tDeaths\tFIPS\tLast_Update\tLat\tLong_\tProvince_State\tRecovered}

d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)

d_data.head()

	Active	Admin2	Combined_Key	Confirmed	Country/Region	Country_Region	Deaths	FIPS	Last Update	Last_Update	Lat	Latitude	Long_	Longitude	Province/State	Province_State	Recovered	source_file
0	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Anhui	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1	NaN	NaN	NaN	14.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Beijing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2	NaN	NaN	NaN	6.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Chongqing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Fujian	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4	NaN	NaN	NaN	NaN	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Gansu	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

2. Analyze the data set

pr = ProfileReport(d_data)

pr.to_widgets()

Summarize dataset: 100%|██████████| 32/32 [00:19<00:00,  1.68it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.54s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

	Active	Admin2	Combined_Key	Confirmed	Country/Region	Country_Region	Deaths	FIPS	Last Update	Last_Update	Lat	Latitude	Long_	Longitude	Province/State	Province_State	Recovered	source_file
0	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Anhui	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1	NaN	NaN	NaN	14.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Beijing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2	NaN	NaN	NaN	6.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Chongqing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Fujian	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4	NaN	NaN	NaN	NaN	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Gansu	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
5	NaN	NaN	NaN	26.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Guangdong	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
6	NaN	NaN	NaN	2.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Guangxi	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
7	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Guizhou	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
8	NaN	NaN	NaN	4.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Hainan	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
9	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Hebei	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

	Active	Admin2	Combined_Key	Confirmed	Country/Region	Country_Region	Deaths	FIPS	Last Update	Last_Update	Lat	Latitude	Long_	Longitude	Province/State	Province_State	Recovered	source_file
83380	85608.0	NaN	United Kingdom	98476.0	NaN	United Kingdom	12868.0	NaN	NaN	2020-04-15 22:56:32	55.378100	NaN	-3.436000	NaN	NaN	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83381	224.0	NaN	Uruguay	492.0	NaN	Uruguay	8.0	NaN	NaN	2020-04-15 22:56:32	-32.522800	NaN	-55.765800	NaN	NaN	NaN	260.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83382	1191.0	NaN	Uzbekistan	1302.0	NaN	Uzbekistan	4.0	NaN	NaN	2020-04-15 22:56:32	41.377491	NaN	64.585262	NaN	NaN	NaN	107.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83383	77.0	NaN	Venezuela	197.0	NaN	Venezuela	9.0	NaN	NaN	2020-04-15 22:56:32	6.423800	NaN	-66.589700	NaN	NaN	NaN	111.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83384	96.0	NaN	Vietnam	267.0	NaN	Vietnam	0.0	NaN	NaN	2020-04-15 22:56:32	14.058324	NaN	108.277199	NaN	NaN	NaN	171.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83385	309.0	NaN	West Bank and Gaza	374.0	NaN	West Bank and Gaza	2.0	NaN	NaN	2020-04-15 22:56:32	31.952200	NaN	35.233200	NaN	NaN	NaN	63.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83386	6.0	NaN	Western Sahara	6.0	NaN	Western Sahara	0.0	NaN	NaN	2020-04-15 22:56:32	24.215500	NaN	-12.885800	NaN	NaN	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83387	1.0	NaN	Yemen	1.0	NaN	Yemen	0.0	NaN	NaN	2020-04-15 22:56:32	15.552727	NaN	48.516388	NaN	NaN	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83388	16.0	NaN	Zambia	48.0	NaN	Zambia	2.0	NaN	NaN	2020-04-15 22:56:32	-13.133897	NaN	27.849332	NaN	NaN	NaN	30.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83389	19.0	NaN	Zimbabwe	23.0	NaN	Zimbabwe	3.0	NaN	NaN	2020-04-15 22:56:32	-19.015438	NaN	29.154857	NaN	NaN	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv

Render widgets: 100%|██████████| 1/1 [00:13<00:00, 13.76s/it]

3. Consolidate data whose names may have changed over time

Some of the data columns we looked at ealier looked like they were renamed, let’s verify that for each pair and then merge the columns.


d_data[ 
    (~pd.isnull(d_data['Country/Region'])) # select rows where Country/Region is null
   &                                       # AND
    (~pd.isnull(d_data['Country_Region'])) # select rows where Country_Region is null
]

# we want to see no rows from this query.

	Active	Admin2	Combined_Key	Confirmed	Country/Region	Country_Region	Deaths	FIPS	Last Update	Last_Update	Lat	Latitude	Long_	Longitude	Province/State	Province_State	Recovered	source_file

Now that we’ve confirmed that we can merge the columns we can perform the merge operation

d_data['Country'] = d_data['Country/Region'].fillna(d_data['Country_Region']) 
#          ↑                      ↑                              ↑           
#     3. Add to       2. put it here if its null            1. take this     
#     newcolumn  

d_data['Country/Region'].fillna(d_data['Country_Region'])

0            Mainland China
1            Mainland China
2            Mainland China
3            Mainland China
4            Mainland China
                ...        
83385    West Bank and Gaza
83386        Western Sahara
83387                 Yemen
83388                Zambia
83389              Zimbabwe
Name: Country/Region, Length: 83390, dtype: object

This worked well for the first column pair let’s do it for the others. We’ll automate the check that makes sure the columns don’t overlap and will stop execution if the columns overlap.

if len(d_data[ 
    (~pd.isnull(d_data['Province/State']))
   &(~pd.isnull(d_data['Province_State'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['ps'] = d_data['Province/State'].fillna(d_data['Province_State'])
#          ↑                    ↑                            ↑           
#     3. Add to     2. put it here if its null          1. take this     
#     newcolumn  

d_data['Province/State'].fillna(d_data['Province_State'])

0            Anhui
1          Beijing
2        Chongqing
3           Fujian
4            Gansu
           ...    
83385          NaN
83386          NaN
83387          NaN
83388          NaN
83389          NaN
Name: Province/State, Length: 83390, dtype: object

if len(d_data[ 
    (~pd.isnull(d_data['Long_']))
   &(~pd.isnull(d_data['Longitude'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['long_val'] = d_data['Long_'].fillna(d_data['Longitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Long_'].fillna(d_data['Longitude'])

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    35.233200
83386   -12.885800
83387    48.516388
83388    27.849332
83389    29.154857
Name: Long_, Length: 83390, dtype: float64

if len(d_data[ 
    (~pd.isnull(d_data['Lat']))
   &(~pd.isnull(d_data['Latitude'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['lat_val'] = d_data['Lat'].fillna(d_data['Latitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Lat'].fillna(d_data['Latitude'])

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    31.952200
83386    24.215500
83387    15.552727
83388   -13.133897
83389   -19.015438
Name: Lat, Length: 83390, dtype: float64

if len(d_data[ 
    (~pd.isnull(d_data['Last Update']))
   &(~pd.isnull(d_data['Last_Update'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['updated_on'] = d_data['Last Update'].fillna(d_data['Last_Update']) 
#          ↑                         ↑                            ↑           
#     3. Add to          2. put it here if its null          1. take this   

d_data['Last Update'].fillna(d_data['Last_Update'])

0            1/22/2020 17:00
1            1/22/2020 17:00
2            1/22/2020 17:00
3            1/22/2020 17:00
4            1/22/2020 17:00
                ...         
83385    2020-04-15 22:56:32
83386    2020-04-15 22:56:32
83387    2020-04-15 22:56:32
83388    2020-04-15 22:56:32
83389    2020-04-15 22:56:32
Name: Last Update, Length: 83390, dtype: object

With the columns merged, we can drop the old columns as they are no longer needed.

d_data.columns

Index(['Active', 'Admin2', 'Combined_Key', 'Confirmed', 'Country/Region',
       'Country_Region', 'Deaths', 'FIPS', 'Last Update', 'Last_Update', 'Lat',
       'Latitude', 'Long_', 'Longitude', 'Province/State', 'Province_State',
       'Recovered', 'source_file', 'Country', 'ps', 'long_val', 'lat_val',
       'updated_on'],
      dtype='object')

d_data.drop(columns=['Active', 'Admin2', 'Combined_Key'
                     , 'Country/Region', 'Country_Region'
                     , 'Last Update', 'Last_Update', 'Lat'
                     , 'Latitude', 'Long_', 'Longitude'
                     , 'Province/State', 'Province_State']
            , inplace=True)

In order to work with the data later on, let’s convert the updated_on column to date types from strings so that we can sort and group it properly.

# convert updated time to a datetime object to work with
d_data['updated_on'] = pd.to_datetime(d_data['updated_on'])

4. Clean up the country names to identify the ISO codes for countries

In order to add data onto the map we need to have the ISO codes for all the countries. To do that we can use the pycountry library we installed at the beginning of the course. Let’s first see if there will be any issues by looking at the country data.

set(d_data['Country'].unique())

{' Azerbaijan',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Channel Islands',
 'Chile',
 'China',
 'Colombia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cruise Ship',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Czechia',
 'Denmark',
 'Diamond Princess',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'East Timor',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'Gabon',
 'Gambia',
 'Gambia, The',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hong Kong SAR',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Korea, South',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'MS Zaandam',
 'Macao SAR',
 'Macau',
 'Madagascar',
 'Mainland China',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Moldova',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'North Ireland',
 'North Macedonia',
 'Norway',
 'Oman',
 'Others',
 'Pakistan',
 'Palestine',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Ireland',
 'Republic of Korea',
 'Republic of Moldova',
 'Republic of the Congo',
 'Reunion',
 'Romania',
 'Russia',
 'Russian Federation',
 'Rwanda',
 'Saint Barthelemy',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin',
 'Saint Vincent and the Grenadines',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Slovakia',
 'Slovenia',
 'Somalia',
 'South Africa',
 'South Korea',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'St. Martin',
 'Sudan',
 'Suriname',
 'Sweden',
 'Switzerland',
 'Syria',
 'Taipei and environs',
 'Taiwan',
 'Taiwan*',
 'Tanzania',
 'Thailand',
 'The Bahamas',
 'The Gambia',
 'Timor-Leste',
 'Togo',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'UK',
 'US',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'Uruguay',
 'Uzbekistan',
 'Vatican City',
 'Venezuela',
 'Viet Nam',
 'Vietnam',
 'West Bank and Gaza',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe',
 'occupied Palestinian territory'}

import pycountry

d_data.loc[ d_data['Country']=='Mainland China', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='Macau', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='South Korea', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Korea, South', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Ivory Coast', 'Country'] = "Republic of Cote d'Ivoire"
d_data.loc[ d_data['Country']=='North Ireland', 'Country'] = "United Kingdom"
d_data.loc[ d_data['Country']=='Republic of Ireland', 'Country'] = "Ireland"
d_data.loc[ d_data['Country']=='St. Martin', 'Country'] = "France" 
d_data.loc[ d_data['Country']=='Iran (Islamic Republic of)', 'Country'] = "Iran"
d_data.loc[ d_data['Country']=='West Bank and Gaza', 'Country'] = "Palestine"
d_data.loc[ d_data['Country']=='Channel Islands', 'Country'] = "UK" ## Not technically, but effectively, great tax laws
d_data.loc[ d_data['Country'].isin([ 'Congo (Brazzaville)'
                                    ,'Congo (Kinshasa)']), 'Country'] = 'Congo'
d_data.loc[ d_data['Country']=='Gambia, The', 'Country'] = "Gambia" 
d_data.loc[ d_data['Country']=='Bahamas, The', 'Country'] = "Bahamas" 
d_data.loc[ d_data['Country']=='Cape Verde', 'Country'] = 'Republic of Cabo Verde'
d_data.loc[ d_data['Country']=='East Timor', 'Country'] = 'Timor-Leste'
d_data.loc[ d_data['Country']=='Laos', 'Country'] = "Lao People's Democratic Republic" 
d_data.loc[ d_data['Country']=="Burma", 'Country'] = 'Myanmar'

# dropping disputed teritories and not teritories
d_data = d_data.drop(d_data[d_data['Country']=='Others'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Taipei and environs'].index)
d_data = d_data.drop(d_data[d_data['Country']=='occupied Palestinian territory'].index) 
d_data = d_data.drop(d_data[d_data['Country']=='Taiwan*'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Taiwan'].index)
d_data = d_data.drop(d_data[d_data['Country']=='MS Zaandam'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Cruise Ship'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Diamond Princess'].index)

countries = pd.Series(d_data['Country'].unique())

def get_iso(country_name):
    return {'Country':country_name, 'ISO_3': pycountry.countries.search_fuzzy(country_name)[0].alpha_3}

countries = pd.DataFrame(list(countries.map(get_iso)))

d_data.merge(countries
            , on='Country'
            , how='inner'
            , validate='m:1')

	Confirmed	Deaths	FIPS	Recovered	source_file	Country	ps	long_val	lat_val	updated_on	ISO_3
0	1.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	China	Anhui	NaN	NaN	2020-01-22 17:00:00	CHN
1	14.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	China	Beijing	NaN	NaN	2020-01-22 17:00:00	CHN
2	6.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	China	Chongqing	NaN	NaN	2020-01-22 17:00:00	CHN
3	1.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	China	Fujian	NaN	NaN	2020-01-22 17:00:00	CHN
4	NaN	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	China	Gansu	NaN	NaN	2020-01-22 17:00:00	CHN
...	...	...	...	...	...	...	...	...	...	...	...
83206	1.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv	Yemen	NaN	48.516388	15.552727	2020-04-11 22:45:13	YEM
83207	1.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv	Yemen	NaN	48.516388	15.552727	2020-04-12 23:17:00	YEM
83208	1.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv	Yemen	NaN	48.516388	15.552727	2020-04-13 23:07:34	YEM
83209	1.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv	Yemen	NaN	48.516388	15.552727	2020-04-14 23:33:12	YEM
83210	1.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	Yemen	NaN	48.516388	15.552727	2020-04-15 22:56:32	YEM

83211 rows × 11 columns

d_data =\
    d_data.merge(countries
                , on='Country'
                , how='inner'
                , validate='m:1')

4. a. identify and fix issues with the last-reported data column for our grouping purposes

pd.set_option("display.max_rows", 300) # increase the number of rows visible 

d_data[d_data['ISO_3'] == 'DEU']

	Confirmed	Deaths	FIPS	Recovered	source_file	Country	ps	long_val	lat_val	updated_on	ISO_3
76227	4.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-28-2020.csv	Germany	Bavaria	NaN	NaN	2020-01-28 23:00:00	DEU
76228	4.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv	Germany	Bavaria	NaN	NaN	2020-01-29 19:30:00	DEU
76229	4.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv	Germany	Bavaria	NaN	NaN	2020-01-30 16:00:00	DEU
76230	5.0	NaN	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-31-2020.csv	Germany	Bavaria	NaN	NaN	2020-01-31 23:59:00	DEU
76231	8.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-01-2020.csv	Germany	NaN	NaN	NaN	2020-02-01 18:33:00	DEU
76232	10.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-02-2020.csv	Germany	NaN	NaN	NaN	2020-02-02 18:03:05	DEU
76233	12.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-03-2020.csv	Germany	NaN	NaN	NaN	2020-02-03 20:53:02	DEU
76234	12.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-04-2020.csv	Germany	NaN	NaN	NaN	2020-02-03 20:53:02	DEU
76235	12.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-05-2020.csv	Germany	NaN	NaN	NaN	2020-02-03 20:53:02	DEU
76236	12.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-06-2020.csv	Germany	NaN	NaN	NaN	2020-02-03 20:53:02	DEU
76237	13.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-07-2020.csv	Germany	NaN	NaN	NaN	2020-02-07 16:33:03	DEU
76238	13.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-08-2020.csv	Germany	NaN	NaN	NaN	2020-02-07 16:33:03	DEU
76239	14.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-09-2020.csv	Germany	NaN	NaN	NaN	2020-02-09 06:13:11	DEU
76240	14.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-10-2020.csv	Germany	NaN	NaN	NaN	2020-02-09 06:13:11	DEU
76241	16.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-11-2020.csv	Germany	NaN	NaN	NaN	2020-02-11 19:33:03	DEU
76242	16.0	0.0	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-12-2020.csv	Germany	NaN	NaN	NaN	2020-02-11 19:33:03	DEU
76243	16.0	0.0	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-13-2020.csv	Germany	NaN	NaN	NaN	2020-02-13 15:13:11	DEU
76244	16.0	0.0	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-14-2020.csv	Germany	NaN	NaN	NaN	2020-02-13 15:13:11	DEU
76245	16.0	0.0	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-15-2020.csv	Germany	NaN	NaN	NaN	2020-02-13 15:13:11	DEU
76246	16.0	0.0	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-16-2020.csv	Germany	NaN	NaN	NaN	2020-02-13 15:13:11	DEU
76247	16.0	0.0	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-17-2020.csv	Germany	NaN	NaN	NaN	2020-02-13 15:13:11	DEU
76248	16.0	0.0	NaN	12.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-18-2020.csv	Germany	NaN	NaN	NaN	2020-02-18 17:03:03	DEU
76249	16.0	0.0	NaN	12.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-19-2020.csv	Germany	NaN	NaN	NaN	2020-02-18 17:03:03	DEU
76250	16.0	0.0	NaN	12.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-20-2020.csv	Germany	NaN	NaN	NaN	2020-02-18 17:03:03	DEU
76251	16.0	0.0	NaN	14.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-21-2020.csv	Germany	NaN	NaN	NaN	2020-02-21 23:03:13	DEU
76252	16.0	0.0	NaN	14.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv	Germany	NaN	NaN	NaN	2020-02-21 23:03:13	DEU
76253	16.0	0.0	NaN	14.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-23-2020.csv	Germany	NaN	NaN	NaN	2020-02-21 23:03:13	DEU
76254	16.0	0.0	NaN	14.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-24-2020.csv	Germany	NaN	NaN	NaN	2020-02-21 23:03:13	DEU
76255	17.0	0.0	NaN	14.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-25-2020.csv	Germany	NaN	NaN	NaN	2020-02-25 21:33:02	DEU
76256	27.0	0.0	NaN	15.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv	Germany	NaN	NaN	NaN	2020-02-26 23:43:03	DEU
76257	46.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv	Germany	NaN	NaN	NaN	2020-02-27 23:13:04	DEU
76258	48.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-28-2020.csv	Germany	NaN	NaN	NaN	2020-02-28 00:13:18	DEU
76259	79.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-29-2020.csv	Germany	NaN	NaN	NaN	2020-02-29 14:43:03	DEU
76260	130.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-01-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-01 23:23:02	DEU
76261	159.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-02-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-02 20:33:02	DEU
76262	196.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-03-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-03 20:03:06	DEU
76263	262.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-04-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-04 19:33:03	DEU
76264	482.0	0.0	NaN	16.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-05-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-05 17:43:03	DEU
76265	670.0	0.0	NaN	17.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-06 17:53:03	DEU
76266	799.0	0.0	NaN	18.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-07-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-07 17:43:05	DEU
76267	1040.0	0.0	NaN	18.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-08 21:03:03	DEU
76268	1176.0	2.0	NaN	18.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-09-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-09 18:13:11	DEU
76269	1457.0	2.0	NaN	18.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-10-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-10 18:53:02	DEU
76270	1908.0	3.0	NaN	25.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-11 19:13:17	DEU
76271	2078.0	3.0	NaN	25.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-12-2020.csv	Germany	NaN	9.000000	51.000000	2020-03-12 09:53:06	DEU
76272	3675.0	7.0	NaN	46.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-13-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-11 20:00:00	DEU
76273	4585.0	9.0	NaN	46.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-14-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-14 22:13:11	DEU
76274	5795.0	11.0	NaN	46.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-15-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-15 18:20:18	DEU
76275	7272.0	17.0	NaN	67.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-16-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-16 20:13:11	DEU
76276	9257.0	24.0	NaN	67.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-17-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-17 18:53:02	DEU
76277	12327.0	28.0	NaN	105.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-18-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-18 19:33:02	DEU
76278	15320.0	44.0	NaN	113.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-19-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-19 20:13:08	DEU
76279	19848.0	67.0	NaN	180.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-20 20:13:15	DEU
76280	22213.0	84.0	NaN	233.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv	Germany	NaN	10.451500	51.165700	2020-03-21 20:43:02	DEU
76281	24873.0	94.0	NaN	266.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-22 23:45:00	DEU
76282	29056.0	123.0	NaN	453.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-23 23:19:21	DEU
76283	32986.0	157.0	NaN	3243.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-24 23:37:15	DEU
76284	37323.0	206.0	NaN	3547.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-25 23:33:04	DEU
76285	43938.0	267.0	NaN	5673.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-26 23:48:18	DEU
76286	50871.0	342.0	NaN	6658.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-27-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-27 23:23:03	DEU
76287	57695.0	433.0	NaN	8481.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-28 23:05:00	DEU
76288	62095.0	533.0	NaN	9211.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-29-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-29 23:08:00	DEU
76289	66885.0	645.0	NaN	13500.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-30-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-30 22:52:00	DEU
76290	71808.0	775.0	NaN	16100.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-31-2020.csv	Germany	NaN	10.451526	51.165691	2020-03-31 23:43:43	DEU
76291	77872.0	920.0	NaN	18700.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-01-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-01 21:58:34	DEU
76292	84794.0	1107.0	NaN	22440.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-02-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-02 23:25:00	DEU
76293	91159.0	1275.0	NaN	24575.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-03-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-03 22:46:20	DEU
76294	96092.0	1444.0	NaN	26400.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-04 23:34:00	DEU
76295	100123.0	1584.0	NaN	28700.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-05 23:06:26	DEU
76296	103374.0	1810.0	NaN	28700.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-06 23:21:00	DEU
76297	107663.0	2016.0	NaN	36081.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-07-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-07 23:04:29	DEU
76298	113296.0	2349.0	NaN	46300.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-08 22:51:39	DEU
76299	118181.0	2607.0	NaN	52407.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-09 23:02:19	DEU
76300	122171.0	2767.0	NaN	53913.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-10-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-10 22:53:48	DEU
76301	124908.0	2736.0	NaN	57400.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-11 22:45:13	DEU
76302	127854.0	3022.0	NaN	60300.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-12 23:17:00	DEU
76303	130072.0	3194.0	NaN	64300.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-13 23:07:34	DEU
76304	131359.0	3294.0	NaN	68200.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-14 23:33:12	DEU
76305	134753.0	3804.0	NaN	72600.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	Germany	NaN	10.451526	51.165691	2020-04-15 22:56:32	DEU

d_data['report_date'] = \
     pd.to_datetime( 
         d_data['source_file'].astype('str')                      # convert the values in the column to string
                              .str.rsplit(os.path.sep,1, expand=True)[1]  # extract the file name from the file path
                              .str.replace('.csv','')             # remove the csv extension
       , dayfirst=False)                                          # convert the newly extracted string to date

5. Visualize the data on a map

To take the data we’ve prepared and put it onto the map we need to sum the number of confirmed cases by the date of the report and the country code.

d_data.groupby(by=['report_date', 'ISO_3'])\
      .agg({'Confirmed': 'sum'})

		Confirmed
report_date	ISO_3
2020-01-22	CHN	548.0
	HKG	0.0
	JPN	2.0
	PRK	1.0
	THA	2.0
...	...	...
2020-04-15	VNM	267.0
	YEM	1.0
	ZAF	2506.0
	ZMB	48.0
	ZWE	23.0

7866 rows × 1 columns


viz = \
d_data.groupby(by=['report_date', 'ISO_3', 'Country'])\
      .agg({'Confirmed': 'sum'})\
      .reset_index()

We also need to convert the report date back to string to display it. So let’s convert the report date column to string again.

viz['report_date'] = viz['report_date'].dt.strftime( '%Y-%m-%d')

from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

subset = viz[viz['report_date'] == '2020-04-14']

px.choropleth(subset
             , locations='ISO_3'
             , locationmode = 'ISO-3'
             , geojson=counties
             , animation_frame='report_date'
             , animation_group='ISO_3'
             , color='Confirmed'
             , color_continuous_scale= 'Greens').write_html('choropleth_1.html', full_html=False, include_plotlyjs='cdn')

6. Tweak the vizualization

viz['sqrt_Confirmed'] = np.sqrt(viz['Confirmed'].clip(lower=1)) # calculate the square root of the confirmed
                                                                # cases, clip the columns lower values to at 
                                                                # least 1.

fig = \
px.choropleth(viz
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['Confirmed', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_Confirmed'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')
fig

Unable to display output for mime type(s): application/vnd.plotly.v1+json

fig.show(renderer='browser')

7. What if we look at the data by proportion of the population?

# Load population data for countries
pop_data = px.data.gapminder()

# Select the most recent year available 
pop_data = pop_data[pop_data['year']==2007]

# Join the population data to our data set based on the country ISO3 code
pop_viz =viz.merge(pop_data[['pop', 'iso_alpha']]
          , left_on='ISO_3'
          , right_on='iso_alpha'
          , how='left').dropna()

# calculate the number of cases per 100k population members
pop_viz['proportion'] = (pop_viz['Confirmed']/pop_viz['pop'])*100_000

pop_viz['sqrt_proportion'] = np.sqrt(pop_viz['proportion'])

# map the data 
px.choropleth(pop_viz
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['Confirmed', 'proportion', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_proportion'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')\
.show(renderer='browser')

8. Extra problems to solve

After this course you should be armed with some tools to work on other datasets and problems. Here are some additional problems which will stretch your abilities a bit and will require extra reading to solve:

Can you rewrite the column merging code into a function so that all you need to do is pass the function, a dataframe, two input column names, and an output column name to have it check if the columns can be merged and merge them?

def merge_cols(input_df, col1, col2, outcol):
    ...

Can you change the way we load the files to avoid needing to merge columns? (hint: you will need to use the pandas rename function dataframe.rename(columns={'from':'to') )
Can you create a visualization from the timeline dataset? (hint: you will need to use the dataframe.melt() command very rapidly demonstrated in python 101 and you will likley need to do clean up on the countries before doing the ISO_3 lookups)

Solution for problem 1

def merge_cols(input_df, col1, col2, outcol):
    if len(input_df[ 
        (~pd.isnull(input_df[col1]))
       &(~pd.isnull(input_df[col2])) 
        ]) != 0:
        raise ValueError('Columns overlap, further investigation is needed.')

    input_df[outcol] = input_df[col1].fillna(input_df[col2])

d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)

merge_cols(d_data, 'Last Update', 'Last_Update', 'updated_on')

d_data

	Active	Admin2	Combined_Key	Confirmed	Country/Region	Country_Region	Deaths	FIPS	Last Update	Last_Update	Lat	Latitude	Long_	Longitude	Province/State	Province_State	Recovered	source_file	updated_on
0	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Anhui	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	1/22/2020 17:00
1	NaN	NaN	NaN	14.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Beijing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	1/22/2020 17:00
2	NaN	NaN	NaN	6.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Chongqing	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	1/22/2020 17:00
3	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Fujian	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	1/22/2020 17:00
4	NaN	NaN	NaN	NaN	Mainland China	NaN	NaN	NaN	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	Gansu	NaN	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv	1/22/2020 17:00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
83385	309.0	NaN	West Bank and Gaza	374.0	NaN	West Bank and Gaza	2.0	NaN	NaN	2020-04-15 22:56:32	31.952200	NaN	35.233200	NaN	NaN	NaN	63.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	2020-04-15 22:56:32
83386	6.0	NaN	Western Sahara	6.0	NaN	Western Sahara	0.0	NaN	NaN	2020-04-15 22:56:32	24.215500	NaN	-12.885800	NaN	NaN	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	2020-04-15 22:56:32
83387	1.0	NaN	Yemen	1.0	NaN	Yemen	0.0	NaN	NaN	2020-04-15 22:56:32	15.552727	NaN	48.516388	NaN	NaN	NaN	0.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	2020-04-15 22:56:32
83388	16.0	NaN	Zambia	48.0	NaN	Zambia	2.0	NaN	NaN	2020-04-15 22:56:32	-13.133897	NaN	27.849332	NaN	NaN	NaN	30.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	2020-04-15 22:56:32
83389	19.0	NaN	Zimbabwe	23.0	NaN	Zimbabwe	3.0	NaN	NaN	2020-04-15 22:56:32	-19.015438	NaN	29.154857	NaN	NaN	NaN	1.0	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv	2020-04-15 22:56:32

83390 rows × 19 columns

Solution for problem 2


col_pairs = {
    'Country/Region': 'Country_Region'
    ,'Province/State': 'Province_State'
    ,'Long_': 'Longitude'
    ,'Lat': 'Latitude'
    ,'Last Update': 'Last_Update'
}

all_data = []
meta_data = []

# For every file in our data set path
for file in sorted(glob.glob(dataset_path)):
    
    # 1. Read the file to a temporary data frame
    df = pd.read_csv(file).rename(columns=col_pairs)
    
    # 2. Append a dictionary with the file meta_data into the metadata list
    meta_data.append( {  'file_name': file.rsplit(os.path.sep,1)[1]
                       , 'num_rows': df.shape[0]
                       , 'num_cols': df.shape[1]
                       , 'col_names': '\t'.join(sorted(list(df.columns)))} )
    
    # Add the file name to the loaded data
    df['source_file'] = file
    
    # 4. Add the loaded data with the file name to a list with all the loaded data
    all_data.append(df)

# 5. Create a table/dataframe out of our meta_data 
meta_data = pd.DataFrame(meta_data)


# output result to notebook window
meta_data.groupby(['num_cols'])\
         .agg({ 'num_rows': 'sum'
              , 'file_name': sorted
              , 'col_names': set })

	num_rows	file_name	col_names
num_cols
6	2818	[01-22-2020.csv, 01-23-2020.csv, 01-24-2020.csv, 01-25-2020.csv, 01-26-2020.csv, 01-27-2020.csv, 01-28-2020.csv, 01-29-2020.csv, 01-30-2020.csv, 0...	{Confirmed\tCountry_Region\tDeaths\tLast_Update\tProvince_State\tRecovered}
8	4799	[03-01-2020.csv, 03-02-2020.csv, 03-03-2020.csv, 03-04-2020.csv, 03-05-2020.csv, 03-06-2020.csv, 03-07-2020.csv, 03-08-2020.csv, 03-09-2020.csv, 0...	{Confirmed\tCountry_Region\tDeaths\tLast_Update\tLatitude\tLongitude\tProvince_State\tRecovered}
12	75773	[03-22-2020.csv, 03-23-2020.csv, 03-24-2020.csv, 03-25-2020.csv, 03-26-2020.csv, 03-27-2020.csv, 03-28-2020.csv, 03-29-2020.csv, 03-30-2020.csv, 0...	{Active\tAdmin2\tCombined_Key\tConfirmed\tCountry_Region\tDeaths\tFIPS\tLast_Update\tLatitude\tLongitude\tProvince_State\tRecovered}

d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)


d_data.head()

	Active	Admin2	Combined_Key	Confirmed	Country_Region	Deaths	FIPS	Last_Update	Latitude	Longitude	Province_State	Recovered	source_file
0	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	1/22/2020 17:00	NaN	NaN	Anhui	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1	NaN	NaN	NaN	14.0	Mainland China	NaN	NaN	1/22/2020 17:00	NaN	NaN	Beijing	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2	NaN	NaN	NaN	6.0	Mainland China	NaN	NaN	1/22/2020 17:00	NaN	NaN	Chongqing	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3	NaN	NaN	NaN	1.0	Mainland China	NaN	NaN	1/22/2020 17:00	NaN	NaN	Fujian	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4	NaN	NaN	NaN	NaN	Mainland China	NaN	NaN	1/22/2020 17:00	NaN	NaN	Gansu	NaN	us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

Solution for problem 3

df = pd.read_csv('us_jhu_data/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

df = df.melt(  id_vars=['Province/State', 'Country/Region', 'Lat', 'Long']
        , var_name='report_date'
        , value_name='confirmed_cases')

df.rename(columns={'Country/Region': 'Country'}, inplace=True)

import pycountry

df.loc[ df['Country']=='Mainland China', 'Country'] = 'China'
df.loc[ df['Country']=='Macau', 'Country'] = 'China'
df.loc[ df['Country']=='South Korea', 'Country'] = 'Republic of Korea'
df.loc[ df['Country']=='Korea, South', 'Country'] = 'Republic of Korea'
df.loc[ df['Country']=='Ivory Coast', 'Country'] = "Republic of Cote d'Ivoire"
df.loc[ df['Country']=='North Ireland', 'Country'] = "United Kingdom"
df.loc[ df['Country']=='Republic of Ireland', 'Country'] = "Ireland"
df.loc[ df['Country']=='St. Martin', 'Country'] = "France" 
df.loc[ df['Country']=='Iran (Islamic Republic of)', 'Country'] = "Iran"
df.loc[ df['Country']=='West Bank and Gaza', 'Country'] = "Palestine"
df.loc[ df['Country']=='Channel Islands', 'Country'] = "UK" ## Not technically, but effectively, great tax laws
df.loc[ df['Country'].isin([ 'Congo (Brazzaville)'
                                    ,'Congo (Kinshasa)']), 'Country'] = 'Congo'
df.loc[ df['Country']=='Gambia, The', 'Country'] = "Gambia" 
df.loc[ df['Country']=='Bahamas, The', 'Country'] = "Bahamas" 
df.loc[ df['Country']=='Cape Verde', 'Country'] = 'Republic of Cabo Verde'
df.loc[ df['Country']=='East Timor', 'Country'] = 'Timor-Leste'
df.loc[ df['Country']=='Laos', 'Country'] = "Lao People's Democratic Republic" 
df.loc[ df['Country']=="Burma", 'Country'] = 'Myanmar'

# dropping disputed teritories and not teritories
df = df.drop(df[df['Country']=='Others'].index)
df = df.drop(df[df['Country']=='Taipei and environs'].index)
df = df.drop(df[df['Country']=='occupied Palestinian territory'].index) 
df = df.drop(df[df['Country']=='Taiwan*'].index)
df = df.drop(df[df['Country']=='Taiwan'].index)
df = df.drop(df[df['Country']=='MS Zaandam'].index)
df = df.drop(df[df['Country']=='Cruise Ship'].index)
df = df.drop(df[df['Country']=='Diamond Princess'].index)

countries = pd.Series(df['Country'].unique())

def get_iso(country_name):
    return {'Country':country_name, 'ISO_3': pycountry.countries.search_fuzzy(country_name)[0].alpha_3}

countries = pd.DataFrame(list(countries.map(get_iso)))

df = df.merge(countries
            , on='Country'
            , how='inner'
            , validate='m:1')

df

	Province/State	Country	Lat	Long	report_date	confirmed_cases	ISO_3
0	NaN	Afghanistan	33.000000	65.000000	1/22/20	0	AFG
1	NaN	Afghanistan	33.000000	65.000000	1/23/20	0	AFG
2	NaN	Afghanistan	33.000000	65.000000	1/24/20	0	AFG
3	NaN	Afghanistan	33.000000	65.000000	1/25/20	0	AFG
4	NaN	Afghanistan	33.000000	65.000000	1/26/20	0	AFG
...	...	...	...	...	...	...	...
22180	NaN	Yemen	15.552727	48.516388	4/11/20	1	YEM
22181	NaN	Yemen	15.552727	48.516388	4/12/20	1	YEM
22182	NaN	Yemen	15.552727	48.516388	4/13/20	1	YEM
22183	NaN	Yemen	15.552727	48.516388	4/14/20	1	YEM
22184	NaN	Yemen	15.552727	48.516388	4/15/20	1	YEM

22185 rows × 7 columns

df[pd.isnull(df['ISO_3'])]

	Province/State	Country	Lat	Long	report_date	confirmed_cases	ISO_3

viz2 = \
df.groupby(by=['report_date', 'ISO_3', 'Country'])\
      .agg({'confirmed_cases': 'sum'})\
      .reset_index()

viz2['sqrt_Confirmed'] = np.sqrt(viz2['confirmed_cases'].clip(lower=1)) # calculate the square root of the confirmed
                                                                # cases, clip the columns lower values to at 
                                                                # least 1.

fig = \
px.choropleth(viz2
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['confirmed_cases', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_Confirmed'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')
fig

Unable to display output for mime type(s): application/vnd.plotly.v1+json