Python 102 - Plotting data on a map

Before you start make sure that you have run the following commands in anaconda prompt:

conda activate analysis-101

conda install git

conda install pycountry

git clone https://github.com/CSSEGISandData/COVID-19.git us_jhu_data

Explore environment and downloaded data

As with any downloaded data, it’s prudent to explore a little what data we received and wehther there are obvious patterns.

Let’s start by exploring what we see from our notebook. We can do this using the jupyter magic command %ls which behaves like the Unix ls command or similar to the Windows DIR command listing directories from where the notebook is opened from.

%ls 
AR_pyTemplate.xlsx                  introduction-to-SIR-models-2.ipynb
choropleth_1.html                   introduction-to-SIR-models-3.ipynb
choropleth_2.html                   mapping-data-with-plotly.ipynb
choropleth_3.html                   us_jhu_data/
introduction-to-SIR-models-1.ipynb

So it looks like we’re in our own folder, we’ve got our notebook and the data we just downloaded, let’s have a look inside.

%ls us_jhu_data
README.md                       csse_covid_19_data/
archived_data/                  who_covid_19_situation_reports/

That command wasn’t super helpful, all we got back from it was a list of more folders to look into. Let’s just go nuclear and look at everything together. To do that we can use the -R switch to expand all the files and folders.

%ls -R us_jhu_data/ 
# %ls /S us_jhu_data
README.md                       csse_covid_19_data/
archived_data/                  who_covid_19_situation_reports/

us_jhu_data//archived_data:
README.md                    archived_time_series/
archived_daily_case_updates/

us_jhu_data//archived_data/archived_daily_case_updates:
01-21-2020_2200.csv  01-29-2020_1430.csv  02-06-2020_1318.csv
01-22-2020_1200.csv  01-29-2020_2100.csv  02-06-2020_2005.csv
01-23-2020_1200.csv  01-30-2020_1100.csv  02-07-2020_2024.csv
01-24-2020_0000.csv  01-30-2020_2130.csv  02-08-2020_1024.csv
01-24-2020_1200.csv  01-31-2020_1400.csv  02-08-2020_2304.csv
01-25-2020_0000.csv  02-01-2020_1000.csv  02-09-2020_1030.csv
01-25-2020_1200.csv  02-01-2020_1800.csv  02-09-2020_2320.csv
01-25-2020_2200.csv  02-01-2020_2300.csv  02-10-2020_1030.csv
01-26-2020_1100.csv  02-02-2020_0500.csv  02-10-2020_1930.csv
01-26-2020_2300.csv  02-02-2020_1945.csv  02-11-2020_1050.csv
01-27-2020_0900.csv  02-02-2020_2100.csv  02-11-2020_2044.csv
01-27-2020_1900.csv  02-03-2020_1230.csv  02-12-2020_1020.csv
01-27-2020_2030.csv  02-03-2020_2140.csv  02-12-2020_2200.csv
01-28-2020_1300.csv  02-04-2020_0800.csv  02-13-2020_1000.csv
01-28-2020_1800.csv  02-04-2020_1150.csv  02-13-2020_2115.csv
01-28-2020_2300.csv  02-04-2020_2200.csv  02-14-2020_1123.csv
01-29-2020_1330.csv  02-05-2020_1220.csv  README.md

us_jhu_data//archived_data/archived_time_series:
README.md
time_series_19-covid-Confirmed_archived_0325.csv
time_series_19-covid-Deaths_archived_0325.csv
time_series_19-covid-Recovered_archived_0325.csv
time_series_2019-ncov-Confirmed.csv
time_series_2019-ncov-Deaths.csv
time_series_2019-ncov-Recovered.csv

us_jhu_data//csse_covid_19_data:
README.md                       csse_covid_19_daily_reports_us/
UID_ISO_FIPS_LookUp_Table.csv   csse_covid_19_time_series/
csse_covid_19_daily_reports/

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports:
01-22-2020.csv  02-09-2020.csv  02-27-2020.csv  03-16-2020.csv  04-03-2020.csv
01-23-2020.csv  02-10-2020.csv  02-28-2020.csv  03-17-2020.csv  04-04-2020.csv
01-24-2020.csv  02-11-2020.csv  02-29-2020.csv  03-18-2020.csv  04-05-2020.csv
01-25-2020.csv  02-12-2020.csv  03-01-2020.csv  03-19-2020.csv  04-06-2020.csv
01-26-2020.csv  02-13-2020.csv  03-02-2020.csv  03-20-2020.csv  04-07-2020.csv
01-27-2020.csv  02-14-2020.csv  03-03-2020.csv  03-21-2020.csv  04-08-2020.csv
01-28-2020.csv  02-15-2020.csv  03-04-2020.csv  03-22-2020.csv  04-09-2020.csv
01-29-2020.csv  02-16-2020.csv  03-05-2020.csv  03-23-2020.csv  04-10-2020.csv
01-30-2020.csv  02-17-2020.csv  03-06-2020.csv  03-24-2020.csv  04-11-2020.csv
01-31-2020.csv  02-18-2020.csv  03-07-2020.csv  03-25-2020.csv  04-12-2020.csv
02-01-2020.csv  02-19-2020.csv  03-08-2020.csv  03-26-2020.csv  04-13-2020.csv
02-02-2020.csv  02-20-2020.csv  03-09-2020.csv  03-27-2020.csv  04-14-2020.csv
02-03-2020.csv  02-21-2020.csv  03-10-2020.csv  03-28-2020.csv  04-15-2020.csv
02-04-2020.csv  02-22-2020.csv  03-11-2020.csv  03-29-2020.csv  README.md
02-05-2020.csv  02-23-2020.csv  03-12-2020.csv  03-30-2020.csv
02-06-2020.csv  02-24-2020.csv  03-13-2020.csv  03-31-2020.csv
02-07-2020.csv  02-25-2020.csv  03-14-2020.csv  04-01-2020.csv
02-08-2020.csv  02-26-2020.csv  03-15-2020.csv  04-02-2020.csv

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports_us:
04-12-2020.csv  04-13-2020.csv  04-14-2020.csv  04-15-2020.csv  README.md

us_jhu_data//csse_covid_19_data/csse_covid_19_time_series:
README.md
time_series_covid19_confirmed_US.csv
time_series_covid19_confirmed_global.csv
time_series_covid19_deaths_US.csv
time_series_covid19_deaths_global.csv
time_series_covid19_recovered_global.csv

us_jhu_data//who_covid_19_situation_reports:
README.md                         who_covid_19_sit_rep_time_series/
who_covid_19_sit_rep_pdfs/

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_pdfs:
20200121-sitrep-1-2019-ncov.pdf     20200222-sitrep-33-covid-19.pdf
20200122-sitrep-2-2019-ncov.pdf     20200223-sitrep-34-covid-19.pdf
20200123-sitrep-3-2019-ncov.pdf     20200224-sitrep-35-covid-19.pdf
20200124-sitrep-4-2019-ncov.pdf     20200225-sitrep-36-covid-19.pdf
20200125-sitrep-5-2019-ncov.pdf     20200226-sitrep-37-covid-19.pdf
20200126-sitrep-6-2019--ncov.pdf    20200227-sitrep-38-covid-19.pdf
20200127-sitrep-7-2019--ncov.pdf    20200228-sitrep-39-covid-19.pdf
20200128-sitrep-8-ncov-cleared.pdf  20200229-sitrep-40-covid-19.pdf
20200129-sitrep-9-ncov-v2.pdf       20200301-sitrep-41-covid-19.pdf
20200130-sitrep-10-ncov.pdf         20200302-sitrep-42-covid-19.pdf
20200131-sitrep-11-ncov.pdf         20200303-sitrep-43-covid-19.pdf
20200201-sitrep-12-ncov.pdf         20200304-sitrep-44-covid-19.pdf
20200202-sitrep-13-ncov-v3.pdf      20200305-sitrep-45-covid-19.pdf
20200203-sitrep-14-ncov.pdf         20200306-sitrep-46-covid-19.pdf
20200204-sitrep-15-ncov.pdf         20200307-sitrep-47-covid-19.pdf
20200205-sitrep-16-ncov.pdf         20200308-sitrep-48-covid-19.pdf
20200206-sitrep-17-ncov.pdf         20200309-sitrep-49-covid-19.pdf
20200207-sitrep-18-ncov.pdf         20200310-sitrep-50-covid-19.pdf
20200208-sitrep-19-ncov.pdf         20200311-sitrep-51-covid-19.pdf
20200209-sitrep-20-ncov.pdf         20200312-sitrep-52-covid-19.pdf
20200210-sitrep-21-ncov.pdf         20200313-sitrep-53-covid-19.pdf
20200211-sitrep-22-ncov.pdf         20200314-sitrep-54-covid-19.pdf
20200212-sitrep-23-ncov.pdf         20200315-sitrep-55-covid-19.pdf
20200213-sitrep-24-covid-19.pdf     20200316-sitrep-56-covid-19.pdf
20200214-sitrep-25-covid-19.pdf     20200317-sitrep-57-covid-19.pdf
20200215-sitrep-26-covid-19.pdf     20200318-sitrep-58-covid-19.pdf
20200216-sitrep-27-covid-19.pdf     20200319-sitrep-59-covid-19.pdf
20200217-sitrep-28-covid-19.pdf     20200320-sitrep-60-covid-19.pdf
20200218-sitrep-29-covid-19.pdf     20200321-sitrep-61-covid-19.pdf
20200219-sitrep-30-covid-19.pdf     20200322-sitrep-62-covid-19.pdf
20200220-sitrep-31-covid-19.pdf     20200323-sitrep-63-covid-19.pdf
20200221-sitrep-32-covid-19.pdf     20200324-sitrep-64-covid-19.pdf

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_time_series:
who_covid_19_sit_rep_time_series.csv

From the output of ls we can see that we have the following data sets: - archived daily case updates - archived time series data - a lookup table to convert IDs to US FIPS codes (county codes) - daily reports from jan 22 to april 10 - time series data with different focuses - unstructured WHO situation reports - WHO structured time series of the situation reports

For brevity, we will only explore the following two data sets: - time series data with different focuses - daily reports from jan 22 to april 10

Let’s start with the global time series data

Start the Python engines, import libararies

Before we start we need to load the libararies we need. Below are a collection of libraries that I reach for time and time again so it’s generally prudent to load them first so that we don’t end up needing to load them later.

import pandas as pd
import numpy as np
import os
import plotly.express as px
from pandas_profiling import ProfileReport

Global time series data

pd.read_csv('us_jhu_data/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')\
  .head()

#'us_jhu_data//csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv' this file
# will be left to participants as an exercise.
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/6/20 4/7/20 4/8/20 4/9/20 4/10/20 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 367 423 444 484 521 555 607 665 714 784
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 377 383 400 409 416 433 446 467 475 494
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 1423 1468 1572 1666 1761 1825 1914 1983 2070 2160
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 525 545 564 583 601 601 638 646 659 673
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 16 17 19 19 19 19 19 19 19 19

5 rows × 89 columns

This file has almost exactly what we need, but it’s kind of cheating, so let’s use the other dataset, the daily reports and clean that data up for us to use in a mapping exercise.

Explore chosen dataset – Daily Reports

Let’s start by peeking into one file and having a look at what’s in there

idata = 'us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/'
pd.read_csv(os.path.join(idata,'03-13-2020.csv'))
Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
0 Hubei China 2020-03-13T11:09:03 67786 3062 51553 30.9756 112.2707
1 Guangdong China 2020-03-13T11:09:03 1356 8 1296 23.3417 113.4244
2 Henan China 2020-03-11T08:13:09 1273 22 1249 33.8820 113.6140
3 Zhejiang China 2020-03-12T01:33:02 1215 1 1197 29.1832 120.0934
4 Hunan China 2020-03-13T11:09:03 1018 4 1005 27.6104 111.7088
... ... ... ... ... ... ... ... ...
225 NaN Afghanistan 2020-03-11T20:00:00 7 0 0 33.0000 65.0000
226 NaN Monaco 2020-03-11T20:00:00 2 0 0 43.7333 7.4167
227 NaN Liechtenstein 2020-03-11T20:00:00 1 0 0 47.1400 9.5500
228 NaN Guyana 2020-03-11T20:00:00 1 1 0 5.0000 -58.7500
229 NaN Taiwan* 2020-03-11T20:00:00 50 1 20 23.7000 121.0000

230 rows × 8 columns

Check the size and column count for all the files

import glob

dataset_path = os.path.join(idata,'*.csv')

for file in glob.glob(dataset_path):
    print( pd.read_csv(file).shape) # calculate data shape (rows, columns)
(101, 6)
(105, 6)
(2883, 12)
(2911, 12)
(75, 6)
(76, 6)
(3416, 12)
(3420, 12)
(2569, 12)
(2625, 12)
(206, 8)
(216, 8)
(73, 6)
(74, 6)
(225, 8)
(199, 8)
(3027, 12)
(3014, 12)
(71, 6)
(70, 6)
(72, 6)
(72, 6)
(125, 8)
(2989, 12)
(3002, 12)
(67, 6)
(68, 6)
(62, 6)
(58, 6)
(84, 6)
(76, 6)
(3413, 12)
(3417, 12)
(2764, 12)
(2679, 12)
(276, 8)
(272, 8)
(75, 6)
(75, 6)
(47, 6)
(51, 6)
(3434, 12)
(3430, 12)
(160, 8)
(173, 8)
(71, 6)
(72, 6)
(3439, 12)
(2434, 12)
(2483, 12)
(230, 8)
(218, 8)
(38, 6)
(46, 6)
(73, 6)
(72, 6)
(94, 6)
(90, 6)
(292, 8)
(284, 8)
(52, 6)
(54, 6)
(3429, 12)
(3421, 12)
(2809, 12)
(2857, 12)
(114, 6)
(119, 6)
(249, 8)
(258, 8)
(44, 6)
(41, 6)
(75, 6)
(75, 6)
(84, 6)
(85, 6)
(299, 8)
(304, 8)
(151, 8)
(141, 8)
(2966, 12)
(2942, 12)
(67, 6)
(266, 8)
(255, 8)
for file in glob.glob(dataset_path):
    print(file # print file path
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv (101, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv (105, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv (2883, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv (2911, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-18-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-19-2020.csv (76, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv (3416, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv (3420, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-02-2020.csv (2569, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-03-2020.csv (2625, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-10-2020.csv (206, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv (216, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-12-2020.csv (73, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-13-2020.csv (74, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-07-2020.csv (225, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2020.csv (199, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv (3027, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv (3014, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-05-2020.csv (71, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-04-2020.csv (70, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-08-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-09-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-01-2020.csv (125, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv (2989, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv (3002, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-02-2020.csv (67, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-03-2020.csv (68, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-31-2020.csv (62, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv (58, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-21-2020.csv (84, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-20-2020.csv (76, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv (3413, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv (3417, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv (2764, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv (2679, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-17-2020.csv (276, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-16-2020.csv (272, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-15-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-14-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-26-2020.csv (47, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-27-2020.csv (51, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-29-2020.csv (3434, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv (3430, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-04-2020.csv (160, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-05-2020.csv (173, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-06-2020.csv (71, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-07-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-30-2020.csv (3439, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-31-2020.csv (2434, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-01-2020.csv (2483, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-13-2020.csv (230, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-12-2020.csv (218, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv (38, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv (46, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-11-2020.csv (73, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-10-2020.csv (72, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-25-2020.csv (94, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-24-2020.csv (90, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-19-2020.csv (292, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-18-2020.csv (284, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-28-2020.csv (52, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv (54, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-27-2020.csv (3429, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv (3421, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv (2809, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-07-2020.csv (2857, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-28-2020.csv (114, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-29-2020.csv (119, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-14-2020.csv (249, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-15-2020.csv (258, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-25-2020.csv (44, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-24-2020.csv (41, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-16-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-17-2020.csv (75, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv (84, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-23-2020.csv (85, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv (299, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv (304, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-03-2020.csv (151, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-02-2020.csv (141, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv (2966, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-10-2020.csv (2942, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-01-2020.csv (67, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-09-2020.csv (266, 8)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv (255, 8)
for file in glob.glob(dataset_path):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
02-26-2020.csv (101, 6)
02-27-2020.csv (105, 6)
04-08-2020.csv (2883, 12)
04-09-2020.csv (2911, 12)
02-18-2020.csv (75, 6)
02-19-2020.csv (76, 6)
03-24-2020.csv (3416, 12)
03-25-2020.csv (3420, 12)
04-02-2020.csv (2569, 12)
04-03-2020.csv (2625, 12)
03-10-2020.csv (206, 8)
03-11-2020.csv (216, 8)
02-12-2020.csv (73, 6)
02-13-2020.csv (74, 6)
03-07-2020.csv (225, 8)
03-06-2020.csv (199, 8)
04-15-2020.csv (3027, 12)
04-14-2020.csv (3014, 12)
02-05-2020.csv (71, 6)
02-04-2020.csv (70, 6)
02-08-2020.csv (72, 6)
02-09-2020.csv (72, 6)
03-01-2020.csv (125, 8)
04-12-2020.csv (2989, 12)
04-13-2020.csv (3002, 12)
02-02-2020.csv (67, 6)
02-03-2020.csv (68, 6)
01-31-2020.csv (62, 6)
01-30-2020.csv (58, 6)
02-21-2020.csv (84, 6)
02-20-2020.csv (76, 6)
03-23-2020.csv (3413, 12)
03-22-2020.csv (3417, 12)
04-05-2020.csv (2764, 12)
04-04-2020.csv (2679, 12)
03-17-2020.csv (276, 8)
03-16-2020.csv (272, 8)
02-15-2020.csv (75, 6)
02-14-2020.csv (75, 6)
01-26-2020.csv (47, 6)
01-27-2020.csv (51, 6)
03-29-2020.csv (3434, 12)
03-28-2020.csv (3430, 12)
03-04-2020.csv (160, 8)
03-05-2020.csv (173, 8)
02-06-2020.csv (71, 6)
02-07-2020.csv (72, 6)
03-30-2020.csv (3439, 12)
03-31-2020.csv (2434, 12)
04-01-2020.csv (2483, 12)
03-13-2020.csv (230, 8)
03-12-2020.csv (218, 8)
01-22-2020.csv (38, 6)
01-23-2020.csv (46, 6)
02-11-2020.csv (73, 6)
02-10-2020.csv (72, 6)
02-25-2020.csv (94, 6)
02-24-2020.csv (90, 6)
03-19-2020.csv (292, 8)
03-18-2020.csv (284, 8)
01-28-2020.csv (52, 6)
01-29-2020.csv (54, 6)
03-27-2020.csv (3429, 12)
03-26-2020.csv (3421, 12)
04-06-2020.csv (2809, 12)
04-07-2020.csv (2857, 12)
02-28-2020.csv (114, 6)
02-29-2020.csv (119, 6)
03-14-2020.csv (249, 8)
03-15-2020.csv (258, 8)
01-25-2020.csv (44, 6)
01-24-2020.csv (41, 6)
02-16-2020.csv (75, 6)
02-17-2020.csv (75, 6)
02-22-2020.csv (84, 6)
02-23-2020.csv (85, 6)
03-20-2020.csv (299, 8)
03-21-2020.csv (304, 8)
03-03-2020.csv (151, 8)
03-02-2020.csv (141, 8)
04-11-2020.csv (2966, 12)
04-10-2020.csv (2942, 12)
02-01-2020.csv (67, 6)
03-09-2020.csv (266, 8)
03-08-2020.csv (255, 8)
# Sort the file list

for file in sorted(glob.glob(dataset_path)):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
01-22-2020.csv (38, 6)
01-23-2020.csv (46, 6)
01-24-2020.csv (41, 6)
01-25-2020.csv (44, 6)
01-26-2020.csv (47, 6)
01-27-2020.csv (51, 6)
01-28-2020.csv (52, 6)
01-29-2020.csv (54, 6)
01-30-2020.csv (58, 6)
01-31-2020.csv (62, 6)
02-01-2020.csv (67, 6)
02-02-2020.csv (67, 6)
02-03-2020.csv (68, 6)
02-04-2020.csv (70, 6)
02-05-2020.csv (71, 6)
02-06-2020.csv (71, 6)
02-07-2020.csv (72, 6)
02-08-2020.csv (72, 6)
02-09-2020.csv (72, 6)
02-10-2020.csv (72, 6)
02-11-2020.csv (73, 6)
02-12-2020.csv (73, 6)
02-13-2020.csv (74, 6)
02-14-2020.csv (75, 6)
02-15-2020.csv (75, 6)
02-16-2020.csv (75, 6)
02-17-2020.csv (75, 6)
02-18-2020.csv (75, 6)
02-19-2020.csv (76, 6)
02-20-2020.csv (76, 6)
02-21-2020.csv (84, 6)
02-22-2020.csv (84, 6)
02-23-2020.csv (85, 6)
02-24-2020.csv (90, 6)
02-25-2020.csv (94, 6)
02-26-2020.csv (101, 6)
02-27-2020.csv (105, 6)
02-28-2020.csv (114, 6)
02-29-2020.csv (119, 6)
03-01-2020.csv (125, 8)
03-02-2020.csv (141, 8)
03-03-2020.csv (151, 8)
03-04-2020.csv (160, 8)
03-05-2020.csv (173, 8)
03-06-2020.csv (199, 8)
03-07-2020.csv (225, 8)
03-08-2020.csv (255, 8)
03-09-2020.csv (266, 8)
03-10-2020.csv (206, 8)
03-11-2020.csv (216, 8)
03-12-2020.csv (218, 8)
03-13-2020.csv (230, 8)
03-14-2020.csv (249, 8)
03-15-2020.csv (258, 8)
03-16-2020.csv (272, 8)
03-17-2020.csv (276, 8)
03-18-2020.csv (284, 8)
03-19-2020.csv (292, 8)
03-20-2020.csv (299, 8)
03-21-2020.csv (304, 8)
03-22-2020.csv (3417, 12)
03-23-2020.csv (3413, 12)
03-24-2020.csv (3416, 12)
03-25-2020.csv (3420, 12)
03-26-2020.csv (3421, 12)
03-27-2020.csv (3429, 12)
03-28-2020.csv (3430, 12)
03-29-2020.csv (3434, 12)
03-30-2020.csv (3439, 12)
03-31-2020.csv (2434, 12)
04-01-2020.csv (2483, 12)
04-02-2020.csv (2569, 12)
04-03-2020.csv (2625, 12)
04-04-2020.csv (2679, 12)
04-05-2020.csv (2764, 12)
04-06-2020.csv (2809, 12)
04-07-2020.csv (2857, 12)
04-08-2020.csv (2883, 12)
04-09-2020.csv (2911, 12)
04-10-2020.csv (2942, 12)
04-11-2020.csv (2966, 12)
04-12-2020.csv (2989, 12)
04-13-2020.csv (3002, 12)
04-14-2020.csv (3014, 12)
04-15-2020.csv (3027, 12)
import os

for file in sorted(glob.glob(dataset_path)):
       print(file.rsplit(os.path.sep,1)[1] # isolate file name
             , pd.read_csv(file).shape # calculate data shape (rows, columns)
             , os.stat(file).st_size/(1024*1024)) # get the individual file size in MB
01-22-2020.csv (38, 6) 0.0015974044799804688
01-23-2020.csv (46, 6) 0.00174713134765625
01-24-2020.csv (41, 6) 0.0016164779663085938
01-25-2020.csv (44, 6) 0.0017070770263671875
01-26-2020.csv (47, 6) 0.00180816650390625
01-27-2020.csv (51, 6) 0.0019540786743164062
01-28-2020.csv (52, 6) 0.0020046234130859375
01-29-2020.csv (54, 6) 0.00208282470703125
01-30-2020.csv (58, 6) 0.0022258758544921875
01-31-2020.csv (62, 6) 0.0024499893188476562
02-01-2020.csv (67, 6) 0.0026559829711914062
02-02-2020.csv (67, 6) 0.0030050277709960938
02-03-2020.csv (68, 6) 0.0030527114868164062
02-04-2020.csv (70, 6) 0.0031423568725585938
02-05-2020.csv (71, 6) 0.0031862258911132812
02-06-2020.csv (71, 6) 0.0031900405883789062
02-07-2020.csv (72, 6) 0.0032444000244140625
02-08-2020.csv (72, 6) 0.0032510757446289062
02-09-2020.csv (72, 6) 0.0032701492309570312
02-10-2020.csv (72, 6) 0.0032739639282226562
02-11-2020.csv (73, 6) 0.0033283233642578125
02-12-2020.csv (73, 6) 0.0033311843872070312
02-13-2020.csv (74, 6) 0.0033807754516601562
02-14-2020.csv (75, 6) 0.0033416748046875
02-15-2020.csv (75, 6) 0.0033464431762695312
02-16-2020.csv (75, 6) 0.0033473968505859375
02-17-2020.csv (75, 6) 0.0034189224243164062
02-18-2020.csv (75, 6) 0.003421783447265625
02-19-2020.csv (76, 6) 0.0033855438232421875
02-20-2020.csv (76, 6) 0.0033864974975585938
02-21-2020.csv (84, 6) 0.0037584304809570312
02-22-2020.csv (84, 6) 0.0038251876831054688
02-23-2020.csv (85, 6) 0.0038623809814453125
02-24-2020.csv (90, 6) 0.004063606262207031
02-25-2020.csv (94, 6) 0.004210472106933594
02-26-2020.csv (101, 6) 0.004431724548339844
02-27-2020.csv (105, 6) 0.004573822021484375
02-28-2020.csv (114, 6) 0.004901885986328125
02-29-2020.csv (119, 6) 0.0051116943359375
03-01-2020.csv (125, 8) 0.0073528289794921875
03-02-2020.csv (141, 8) 0.008268356323242188
03-03-2020.csv (151, 8) 0.008862495422363281
03-04-2020.csv (160, 8) 0.009377479553222656
03-05-2020.csv (173, 8) 0.010189056396484375
03-06-2020.csv (199, 8) 0.011743545532226562
03-07-2020.csv (225, 8) 0.013422012329101562
03-08-2020.csv (255, 8) 0.015294075012207031
03-09-2020.csv (266, 8) 0.01593017578125
03-10-2020.csv (206, 8) 0.011493682861328125
03-11-2020.csv (216, 8) 0.0117645263671875
03-12-2020.csv (218, 8) 0.011874198913574219
03-13-2020.csv (230, 8) 0.012880325317382812
03-14-2020.csv (249, 8) 0.013600349426269531
03-15-2020.csv (258, 8) 0.014292716979980469
03-16-2020.csv (272, 8) 0.014932632446289062
03-17-2020.csv (276, 8) 0.015168190002441406
03-18-2020.csv (284, 8) 0.015623092651367188
03-19-2020.csv (292, 8) 0.016092300415039062
03-20-2020.csv (299, 8) 0.01649951934814453
03-21-2020.csv (304, 8) 0.016945838928222656
03-22-2020.csv (3417, 12) 0.3128471374511719
03-23-2020.csv (3413, 12) 0.3327207565307617
03-24-2020.csv (3416, 12) 0.3328866958618164
03-25-2020.csv (3420, 12) 0.33330821990966797
03-26-2020.csv (3421, 12) 0.3335227966308594
03-27-2020.csv (3429, 12) 0.334259033203125
03-28-2020.csv (3430, 12) 0.31421375274658203
03-29-2020.csv (3434, 12) 0.3146400451660156
03-30-2020.csv (3439, 12) 0.31514453887939453
03-31-2020.csv (2434, 12) 0.23666667938232422
04-01-2020.csv (2483, 12) 0.24163150787353516
04-02-2020.csv (2569, 12) 0.23265457153320312
04-03-2020.csv (2625, 12) 0.25572681427001953
04-04-2020.csv (2679, 12) 0.2430257797241211
04-05-2020.csv (2764, 12) 0.26952648162841797
04-06-2020.csv (2809, 12) 0.25507640838623047
04-07-2020.csv (2857, 12) 0.2788209915161133
04-08-2020.csv (2883, 12) 0.2814149856567383
04-09-2020.csv (2911, 12) 0.2843132019042969
04-10-2020.csv (2942, 12) 0.2873687744140625
04-11-2020.csv (2966, 12) 0.2899484634399414
04-12-2020.csv (2989, 12) 0.2915000915527344
04-13-2020.csv (3002, 12) 0.2954998016357422
04-14-2020.csv (3014, 12) 0.2967643737792969
04-15-2020.csv (3027, 12) 0.29817771911621094

What’s the goal when preparing this dataset for vizualization?

We are going to use plotly.js to visualize the data, however that means that the data needs to be ‘clean’. Basically, the data needs to be in the format and shape that is required for visualization given that plotly won’t make assumptions about how to draw your data points. I’ve seen some beautifully picassoesque and Dr Seuss-like graphs when working with malformed data.

In this instance we want to create a graph that show’s us the number of COVID cases per country per day, and animate through the days. To do this we will need to have the data in the following format:

Country Name Country ISO Code Day Number of Cases
Switzerland CHE 2020-02-29 xxx
France FRA 2020-02-29 yyy
Switzerland CHE 2020-03-01 zzz
France FRA 2020-03-01 aaa

To get to that format we’ll have to go from whichever format we have in the chosen data set to the format above. To be able to do that we’ll need to load and harmonize the daily reports with the following steps:

  1. Load the data into memory
  2. Analyze the dataset and make decisions about what to do
  3. Consolidate data which has changed names over time
  4. Clean up the country names to identify the ISO codes for countries
  5. Visualize the data on a world map
  6. Tweak the visualization
  7. Adapt our visualization to look at cases per 100k instead of number of cases

1. Load the data set into a pandas dataframe

Load all the data into memory and see how we can combine the data into a single set for us to use with data visualization.

import glob 
# Create two lists to store our file metadata and file data
all_data = []
meta_data = []

# For every file in our data set path
for file in sorted(glob.glob(dataset_path)):
    
    # 1. Read the file to a temporary data frame
    df = pd.read_csv(file)
    
    # 2. Append a dictionary with the file meta_data into the metadata list
    meta_data.append( {  'file_name': file.rsplit(os.path.sep,1)[1]
                       , 'num_rows': df.shape[0]
                       , 'num_cols': df.shape[1]
                       , 'col_names': '\t'.join(sorted(list(df.columns)))} )
    
    # Add the file name to the loaded data
    df['source_file'] = file
    
    # 4. Add the loaded data with the file name to a list with all the loaded data
    all_data.append(df)

# 5. Create a table/dataframe out of our meta_data 
meta_data = pd.DataFrame(meta_data)

# show the metadata in jupyter notebook
meta_data
file_name num_rows num_cols col_names
0 01-22-2020.csv 38 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
1 01-23-2020.csv 46 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
2 01-24-2020.csv 41 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
3 01-25-2020.csv 44 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
4 01-26-2020.csv 47 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
... ... ... ... ...
80 04-11-2020.csv 2966 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
81 04-12-2020.csv 2989 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
82 04-13-2020.csv 3002 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
83 04-14-2020.csv 3014 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
84 04-15-2020.csv 3027 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...

85 rows × 4 columns


pd.set_option('max_colwidth', 150)

# output result to notebook window
meta_data.groupby(['num_cols'])\
         .agg({ 'num_rows': 'sum'
              , 'file_name': sorted
              , 'col_names': set })
num_rows file_name col_names
num_cols
6 2818 [01-22-2020.csv, 01-23-2020.csv, 01-24-2020.csv, 01-25-2020.csv, 01-26-2020.csv, 01-27-2020.csv, 01-28-2020.csv, 01-29-2020.csv, 01-30-2020.csv, 0... {Confirmed\tCountry/Region\tDeaths\tLast Update\tProvince/State\tRecovered}
8 4799 [03-01-2020.csv, 03-02-2020.csv, 03-03-2020.csv, 03-04-2020.csv, 03-05-2020.csv, 03-06-2020.csv, 03-07-2020.csv, 03-08-2020.csv, 03-09-2020.csv, 0... {Confirmed\tCountry/Region\tDeaths\tLast Update\tLatitude\tLongitude\tProvince/State\tRecovered}
12 75773 [03-22-2020.csv, 03-23-2020.csv, 03-24-2020.csv, 03-25-2020.csv, 03-26-2020.csv, 03-27-2020.csv, 03-28-2020.csv, 03-29-2020.csv, 03-30-2020.csv, 0... {Active\tAdmin2\tCombined_Key\tConfirmed\tCountry_Region\tDeaths\tFIPS\tLast_Update\tLat\tLong_\tProvince_State\tRecovered}
d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)

d_data.head()
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file
0 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Anhui NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1 NaN NaN NaN 14.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Beijing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2 NaN NaN NaN 6.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Chongqing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Fujian NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4 NaN NaN NaN NaN Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Gansu NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

2. Analyze the data set

pr = ProfileReport(d_data)
pr.to_widgets()
Summarize dataset: 100%|██████████| 32/32 [00:19<00:00,  1.68it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.54s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]Render widgets: 100%|██████████| 1/1 [00:13<00:00, 13.76s/it]
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file
0 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Anhui NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1 NaN NaN NaN 14.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Beijing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2 NaN NaN NaN 6.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Chongqing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Fujian NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4 NaN NaN NaN NaN Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Gansu NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
5 NaN NaN NaN 26.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Guangdong NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
6 NaN NaN NaN 2.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Guangxi NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
7 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Guizhou NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
8 NaN NaN NaN 4.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Hainan NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
9 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Hebei NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file
83380 85608.0 NaN United Kingdom 98476.0 NaN United Kingdom 12868.0 NaN NaN 2020-04-15 22:56:32 55.378100 NaN -3.436000 NaN NaN NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83381 224.0 NaN Uruguay 492.0 NaN Uruguay 8.0 NaN NaN 2020-04-15 22:56:32 -32.522800 NaN -55.765800 NaN NaN NaN 260.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83382 1191.0 NaN Uzbekistan 1302.0 NaN Uzbekistan 4.0 NaN NaN 2020-04-15 22:56:32 41.377491 NaN 64.585262 NaN NaN NaN 107.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83383 77.0 NaN Venezuela 197.0 NaN Venezuela 9.0 NaN NaN 2020-04-15 22:56:32 6.423800 NaN -66.589700 NaN NaN NaN 111.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83384 96.0 NaN Vietnam 267.0 NaN Vietnam 0.0 NaN NaN 2020-04-15 22:56:32 14.058324 NaN 108.277199 NaN NaN NaN 171.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83385 309.0 NaN West Bank and Gaza 374.0 NaN West Bank and Gaza 2.0 NaN NaN 2020-04-15 22:56:32 31.952200 NaN 35.233200 NaN NaN NaN 63.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83386 6.0 NaN Western Sahara 6.0 NaN Western Sahara 0.0 NaN NaN 2020-04-15 22:56:32 24.215500 NaN -12.885800 NaN NaN NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83387 1.0 NaN Yemen 1.0 NaN Yemen 0.0 NaN NaN 2020-04-15 22:56:32 15.552727 NaN 48.516388 NaN NaN NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83388 16.0 NaN Zambia 48.0 NaN Zambia 2.0 NaN NaN 2020-04-15 22:56:32 -13.133897 NaN 27.849332 NaN NaN NaN 30.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv
83389 19.0 NaN Zimbabwe 23.0 NaN Zimbabwe 3.0 NaN NaN 2020-04-15 22:56:32 -19.015438 NaN 29.154857 NaN NaN NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv

3. Consolidate data whose names may have changed over time

Some of the data columns we looked at ealier looked like they were renamed, let’s verify that for each pair and then merge the columns.


d_data[ 
    (~pd.isnull(d_data['Country/Region'])) # select rows where Country/Region is null
   &                                       # AND
    (~pd.isnull(d_data['Country_Region'])) # select rows where Country_Region is null
]

# we want to see no rows from this query.
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file

Now that we’ve confirmed that we can merge the columns we can perform the merge operation

d_data['Country'] = d_data['Country/Region'].fillna(d_data['Country_Region']) 
#          ↑                      ↑                              ↑           
#     3. Add to       2. put it here if its null            1. take this     
#     newcolumn  

d_data['Country/Region'].fillna(d_data['Country_Region']) 
0            Mainland China
1            Mainland China
2            Mainland China
3            Mainland China
4            Mainland China
                ...        
83385    West Bank and Gaza
83386        Western Sahara
83387                 Yemen
83388                Zambia
83389              Zimbabwe
Name: Country/Region, Length: 83390, dtype: object

This worked well for the first column pair let’s do it for the others. We’ll automate the check that makes sure the columns don’t overlap and will stop execution if the columns overlap.

if len(d_data[ 
    (~pd.isnull(d_data['Province/State']))
   &(~pd.isnull(d_data['Province_State'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['ps'] = d_data['Province/State'].fillna(d_data['Province_State'])
#          ↑                    ↑                            ↑           
#     3. Add to     2. put it here if its null          1. take this     
#     newcolumn  

d_data['Province/State'].fillna(d_data['Province_State'])
0            Anhui
1          Beijing
2        Chongqing
3           Fujian
4            Gansu
           ...    
83385          NaN
83386          NaN
83387          NaN
83388          NaN
83389          NaN
Name: Province/State, Length: 83390, dtype: object
if len(d_data[ 
    (~pd.isnull(d_data['Long_']))
   &(~pd.isnull(d_data['Longitude'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['long_val'] = d_data['Long_'].fillna(d_data['Longitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Long_'].fillna(d_data['Longitude'])
0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    35.233200
83386   -12.885800
83387    48.516388
83388    27.849332
83389    29.154857
Name: Long_, Length: 83390, dtype: float64
if len(d_data[ 
    (~pd.isnull(d_data['Lat']))
   &(~pd.isnull(d_data['Latitude'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['lat_val'] = d_data['Lat'].fillna(d_data['Latitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Lat'].fillna(d_data['Latitude'])
0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    31.952200
83386    24.215500
83387    15.552727
83388   -13.133897
83389   -19.015438
Name: Lat, Length: 83390, dtype: float64
if len(d_data[ 
    (~pd.isnull(d_data['Last Update']))
   &(~pd.isnull(d_data['Last_Update'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['updated_on'] = d_data['Last Update'].fillna(d_data['Last_Update']) 
#          ↑                         ↑                            ↑           
#     3. Add to          2. put it here if its null          1. take this   

d_data['Last Update'].fillna(d_data['Last_Update']) 
0            1/22/2020 17:00
1            1/22/2020 17:00
2            1/22/2020 17:00
3            1/22/2020 17:00
4            1/22/2020 17:00
                ...         
83385    2020-04-15 22:56:32
83386    2020-04-15 22:56:32
83387    2020-04-15 22:56:32
83388    2020-04-15 22:56:32
83389    2020-04-15 22:56:32
Name: Last Update, Length: 83390, dtype: object

With the columns merged, we can drop the old columns as they are no longer needed.

d_data.columns
Index(['Active', 'Admin2', 'Combined_Key', 'Confirmed', 'Country/Region',
       'Country_Region', 'Deaths', 'FIPS', 'Last Update', 'Last_Update', 'Lat',
       'Latitude', 'Long_', 'Longitude', 'Province/State', 'Province_State',
       'Recovered', 'source_file', 'Country', 'ps', 'long_val', 'lat_val',
       'updated_on'],
      dtype='object')
d_data.drop(columns=['Active', 'Admin2', 'Combined_Key'
                     , 'Country/Region', 'Country_Region'
                     , 'Last Update', 'Last_Update', 'Lat'
                     , 'Latitude', 'Long_', 'Longitude'
                     , 'Province/State', 'Province_State']
            , inplace=True)

In order to work with the data later on, let’s convert the updated_on column to date types from strings so that we can sort and group it properly.

# convert updated time to a datetime object to work with
d_data['updated_on'] = pd.to_datetime(d_data['updated_on'])

4. Clean up the country names to identify the ISO codes for countries

In order to add data onto the map we need to have the ISO codes for all the countries. To do that we can use the pycountry library we installed at the beginning of the course. Let’s first see if there will be any issues by looking at the country data.

set(d_data['Country'].unique())
{' Azerbaijan',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Channel Islands',
 'Chile',
 'China',
 'Colombia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cruise Ship',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Czechia',
 'Denmark',
 'Diamond Princess',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'East Timor',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'Gabon',
 'Gambia',
 'Gambia, The',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hong Kong SAR',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Korea, South',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'MS Zaandam',
 'Macao SAR',
 'Macau',
 'Madagascar',
 'Mainland China',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Moldova',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'North Ireland',
 'North Macedonia',
 'Norway',
 'Oman',
 'Others',
 'Pakistan',
 'Palestine',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Ireland',
 'Republic of Korea',
 'Republic of Moldova',
 'Republic of the Congo',
 'Reunion',
 'Romania',
 'Russia',
 'Russian Federation',
 'Rwanda',
 'Saint Barthelemy',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin',
 'Saint Vincent and the Grenadines',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Slovakia',
 'Slovenia',
 'Somalia',
 'South Africa',
 'South Korea',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'St. Martin',
 'Sudan',
 'Suriname',
 'Sweden',
 'Switzerland',
 'Syria',
 'Taipei and environs',
 'Taiwan',
 'Taiwan*',
 'Tanzania',
 'Thailand',
 'The Bahamas',
 'The Gambia',
 'Timor-Leste',
 'Togo',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'UK',
 'US',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'Uruguay',
 'Uzbekistan',
 'Vatican City',
 'Venezuela',
 'Viet Nam',
 'Vietnam',
 'West Bank and Gaza',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe',
 'occupied Palestinian territory'}
import pycountry

d_data.loc[ d_data['Country']=='Mainland China', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='Macau', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='South Korea', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Korea, South', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Ivory Coast', 'Country'] = "Republic of Cote d'Ivoire"
d_data.loc[ d_data['Country']=='North Ireland', 'Country'] = "United Kingdom"
d_data.loc[ d_data['Country']=='Republic of Ireland', 'Country'] = "Ireland"
d_data.loc[ d_data['Country']=='St. Martin', 'Country'] = "France" 
d_data.loc[ d_data['Country']=='Iran (Islamic Republic of)', 'Country'] = "Iran"
d_data.loc[ d_data['Country']=='West Bank and Gaza', 'Country'] = "Palestine"
d_data.loc[ d_data['Country']=='Channel Islands', 'Country'] = "UK" ## Not technically, but effectively, great tax laws
d_data.loc[ d_data['Country'].isin([ 'Congo (Brazzaville)'
                                    ,'Congo (Kinshasa)']), 'Country'] = 'Congo'
d_data.loc[ d_data['Country']=='Gambia, The', 'Country'] = "Gambia" 
d_data.loc[ d_data['Country']=='Bahamas, The', 'Country'] = "Bahamas" 
d_data.loc[ d_data['Country']=='Cape Verde', 'Country'] = 'Republic of Cabo Verde'
d_data.loc[ d_data['Country']=='East Timor', 'Country'] = 'Timor-Leste'
d_data.loc[ d_data['Country']=='Laos', 'Country'] = "Lao People's Democratic Republic" 
d_data.loc[ d_data['Country']=="Burma", 'Country'] = 'Myanmar'

# dropping disputed teritories and not teritories
d_data = d_data.drop(d_data[d_data['Country']=='Others'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Taipei and environs'].index)
d_data = d_data.drop(d_data[d_data['Country']=='occupied Palestinian territory'].index) 
d_data = d_data.drop(d_data[d_data['Country']=='Taiwan*'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Taiwan'].index)
d_data = d_data.drop(d_data[d_data['Country']=='MS Zaandam'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Cruise Ship'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Diamond Princess'].index)
countries = pd.Series(d_data['Country'].unique())

def get_iso(country_name):
    return {'Country':country_name, 'ISO_3': pycountry.countries.search_fuzzy(country_name)[0].alpha_3}

countries = pd.DataFrame(list(countries.map(get_iso)))
d_data.merge(countries
            , on='Country'
            , how='inner'
            , validate='m:1') 
Confirmed Deaths FIPS Recovered source_file Country ps long_val lat_val updated_on ISO_3
0 1.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Anhui NaN NaN 2020-01-22 17:00:00 CHN
1 14.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Beijing NaN NaN 2020-01-22 17:00:00 CHN
2 6.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Chongqing NaN NaN 2020-01-22 17:00:00 CHN
3 1.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Fujian NaN NaN 2020-01-22 17:00:00 CHN
4 NaN NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Gansu NaN NaN 2020-01-22 17:00:00 CHN
... ... ... ... ... ... ... ... ... ... ... ...
83206 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv Yemen NaN 48.516388 15.552727 2020-04-11 22:45:13 YEM
83207 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv Yemen NaN 48.516388 15.552727 2020-04-12 23:17:00 YEM
83208 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv Yemen NaN 48.516388 15.552727 2020-04-13 23:07:34 YEM
83209 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv Yemen NaN 48.516388 15.552727 2020-04-14 23:33:12 YEM
83210 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv Yemen NaN 48.516388 15.552727 2020-04-15 22:56:32 YEM

83211 rows × 11 columns

d_data =\
    d_data.merge(countries
                , on='Country'
                , how='inner'
                , validate='m:1') 

4. a. identify and fix issues with the last-reported data column for our grouping purposes

pd.set_option("display.max_rows", 300) # increase the number of rows visible 

d_data[d_data['ISO_3'] == 'DEU']
Confirmed Deaths FIPS Recovered source_file Country ps long_val lat_val updated_on ISO_3
76227 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-28-2020.csv Germany Bavaria NaN NaN 2020-01-28 23:00:00 DEU
76228 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv Germany Bavaria NaN NaN 2020-01-29 19:30:00 DEU
76229 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv Germany Bavaria NaN NaN 2020-01-30 16:00:00 DEU
76230 5.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-31-2020.csv Germany Bavaria NaN NaN 2020-01-31 23:59:00 DEU
76231 8.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-01-2020.csv Germany NaN NaN NaN 2020-02-01 18:33:00 DEU
76232 10.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-02-2020.csv Germany NaN NaN NaN 2020-02-02 18:03:05 DEU
76233 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-03-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76234 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-04-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76235 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-05-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76236 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-06-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76237 13.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-07-2020.csv Germany NaN NaN NaN 2020-02-07 16:33:03 DEU
76238 13.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-08-2020.csv Germany NaN NaN NaN 2020-02-07 16:33:03 DEU
76239 14.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-09-2020.csv Germany NaN NaN NaN 2020-02-09 06:13:11 DEU
76240 14.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-10-2020.csv Germany NaN NaN NaN 2020-02-09 06:13:11 DEU
76241 16.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-11-2020.csv Germany NaN NaN NaN 2020-02-11 19:33:03 DEU
76242 16.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-12-2020.csv Germany NaN NaN NaN 2020-02-11 19:33:03 DEU
76243 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-13-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76244 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-14-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76245 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-15-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76246 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-16-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76247 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-17-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76248 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-18-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76249 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-19-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76250 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-20-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76251 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-21-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76252 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76253 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-23-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76254 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-24-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76255 17.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-25-2020.csv Germany NaN NaN NaN 2020-02-25 21:33:02 DEU
76256 27.0 0.0 NaN 15.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv Germany NaN NaN NaN 2020-02-26 23:43:03 DEU
76257 46.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv Germany NaN NaN NaN 2020-02-27 23:13:04 DEU
76258 48.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-28-2020.csv Germany NaN NaN NaN 2020-02-28 00:13:18 DEU
76259 79.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-29-2020.csv Germany NaN NaN NaN 2020-02-29 14:43:03 DEU
76260 130.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-01-2020.csv Germany NaN 9.000000 51.000000 2020-03-01 23:23:02 DEU
76261 159.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-02-2020.csv Germany NaN 9.000000 51.000000 2020-03-02 20:33:02 DEU
76262 196.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-03-2020.csv Germany NaN 9.000000 51.000000 2020-03-03 20:03:06 DEU
76263 262.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-04-2020.csv Germany NaN 9.000000 51.000000 2020-03-04 19:33:03 DEU
76264 482.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-05-2020.csv Germany NaN 9.000000 51.000000 2020-03-05 17:43:03 DEU
76265 670.0 0.0 NaN 17.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2020.csv Germany NaN 9.000000 51.000000 2020-03-06 17:53:03 DEU
76266 799.0 0.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-07-2020.csv Germany NaN 9.000000 51.000000 2020-03-07 17:43:05 DEU
76267 1040.0 0.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv Germany NaN 9.000000 51.000000 2020-03-08 21:03:03 DEU
76268 1176.0 2.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-09-2020.csv Germany NaN 9.000000 51.000000 2020-03-09 18:13:11 DEU
76269 1457.0 2.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-10-2020.csv Germany NaN 9.000000 51.000000 2020-03-10 18:53:02 DEU
76270 1908.0 3.0 NaN 25.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv Germany NaN 9.000000 51.000000 2020-03-11 19:13:17 DEU
76271 2078.0 3.0 NaN 25.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-12-2020.csv Germany NaN 9.000000 51.000000 2020-03-12 09:53:06 DEU
76272 3675.0 7.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-13-2020.csv Germany NaN 10.451500 51.165700 2020-03-11 20:00:00 DEU
76273 4585.0 9.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-14-2020.csv Germany NaN 10.451500 51.165700 2020-03-14 22:13:11 DEU
76274 5795.0 11.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-15-2020.csv Germany NaN 10.451500 51.165700 2020-03-15 18:20:18 DEU
76275 7272.0 17.0 NaN 67.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-16-2020.csv Germany NaN 10.451500 51.165700 2020-03-16 20:13:11 DEU
76276 9257.0 24.0 NaN 67.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-17-2020.csv Germany NaN 10.451500 51.165700 2020-03-17 18:53:02 DEU
76277 12327.0 28.0 NaN 105.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-18-2020.csv Germany NaN 10.451500 51.165700 2020-03-18 19:33:02 DEU
76278 15320.0 44.0 NaN 113.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-19-2020.csv Germany NaN 10.451500 51.165700 2020-03-19 20:13:08 DEU
76279 19848.0 67.0 NaN 180.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv Germany NaN 10.451500 51.165700 2020-03-20 20:13:15 DEU
76280 22213.0 84.0 NaN 233.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv Germany NaN 10.451500 51.165700 2020-03-21 20:43:02 DEU
76281 24873.0 94.0 NaN 266.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv Germany NaN 10.451526 51.165691 2020-03-22 23:45:00 DEU
76282 29056.0 123.0 NaN 453.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv Germany NaN 10.451526 51.165691 2020-03-23 23:19:21 DEU
76283 32986.0 157.0 NaN 3243.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv Germany NaN 10.451526 51.165691 2020-03-24 23:37:15 DEU
76284 37323.0 206.0 NaN 3547.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv Germany NaN 10.451526 51.165691 2020-03-25 23:33:04 DEU
76285 43938.0 267.0 NaN 5673.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv Germany NaN 10.451526 51.165691 2020-03-26 23:48:18 DEU
76286 50871.0 342.0 NaN 6658.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-27-2020.csv Germany NaN 10.451526 51.165691 2020-03-27 23:23:03 DEU
76287 57695.0 433.0 NaN 8481.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv Germany NaN 10.451526 51.165691 2020-03-28 23:05:00 DEU
76288 62095.0 533.0 NaN 9211.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-29-2020.csv Germany NaN 10.451526 51.165691 2020-03-29 23:08:00 DEU
76289 66885.0 645.0 NaN 13500.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-30-2020.csv Germany NaN 10.451526 51.165691 2020-03-30 22:52:00 DEU
76290 71808.0 775.0 NaN 16100.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-31-2020.csv Germany NaN 10.451526 51.165691 2020-03-31 23:43:43 DEU
76291 77872.0 920.0 NaN 18700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-01-2020.csv Germany NaN 10.451526 51.165691 2020-04-01 21:58:34 DEU
76292 84794.0 1107.0 NaN 22440.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-02-2020.csv Germany NaN 10.451526 51.165691 2020-04-02 23:25:00 DEU
76293 91159.0 1275.0 NaN 24575.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-03-2020.csv Germany NaN 10.451526 51.165691 2020-04-03 22:46:20 DEU
76294 96092.0 1444.0 NaN 26400.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv Germany NaN 10.451526 51.165691 2020-04-04 23:34:00 DEU
76295 100123.0 1584.0 NaN 28700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv Germany NaN 10.451526 51.165691 2020-04-05 23:06:26 DEU
76296 103374.0 1810.0 NaN 28700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv Germany NaN 10.451526 51.165691 2020-04-06 23:21:00 DEU
76297 107663.0 2016.0 NaN 36081.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-07-2020.csv Germany NaN 10.451526 51.165691 2020-04-07 23:04:29 DEU
76298 113296.0 2349.0 NaN 46300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv Germany NaN 10.451526 51.165691 2020-04-08 22:51:39 DEU
76299 118181.0 2607.0 NaN 52407.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv Germany NaN 10.451526 51.165691 2020-04-09 23:02:19 DEU
76300 122171.0 2767.0 NaN 53913.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-10-2020.csv Germany NaN 10.451526 51.165691 2020-04-10 22:53:48 DEU
76301 124908.0 2736.0 NaN 57400.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv Germany NaN 10.451526 51.165691 2020-04-11 22:45:13 DEU
76302 127854.0 3022.0 NaN 60300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv Germany NaN 10.451526 51.165691 2020-04-12 23:17:00 DEU
76303 130072.0 3194.0 NaN 64300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv Germany NaN 10.451526 51.165691 2020-04-13 23:07:34 DEU
76304 131359.0 3294.0 NaN 68200.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv Germany NaN 10.451526 51.165691 2020-04-14 23:33:12 DEU
76305 134753.0 3804.0 NaN 72600.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv Germany NaN 10.451526 51.165691 2020-04-15 22:56:32 DEU
d_data['report_date'] = \
     pd.to_datetime( 
         d_data['source_file'].astype('str')                      # convert the values in the column to string
                              .str.rsplit(os.path.sep,1, expand=True)[1]  # extract the file name from the file path
                              .str.replace('.csv','')             # remove the csv extension
       , dayfirst=False)                                          # convert the newly extracted string to date

5. Visualize the data on a map

To take the data we’ve prepared and put it onto the map we need to sum the number of confirmed cases by the date of the report and the country code.

d_data.groupby(by=['report_date', 'ISO_3'])\
      .agg({'Confirmed': 'sum'})
Confirmed
report_date ISO_3
2020-01-22 CHN 548.0
HKG 0.0
JPN 2.0
PRK 1.0
THA 2.0
... ... ...
2020-04-15 VNM 267.0
YEM 1.0
ZAF 2506.0
ZMB 48.0
ZWE 23.0

7866 rows × 1 columns


viz = \
d_data.groupby(by=['report_date', 'ISO_3', 'Country'])\
      .agg({'Confirmed': 'sum'})\
      .reset_index()

We also need to convert the report date back to string to display it. So let’s convert the report date column to string again.

viz['report_date'] = viz['report_date'].dt.strftime( '%Y-%m-%d')
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
subset = viz[viz['report_date'] == '2020-04-14']

px.choropleth(subset
             , locations='ISO_3'
             , locationmode = 'ISO-3'
             , geojson=counties
             , animation_frame='report_date'
             , animation_group='ISO_3'
             , color='Confirmed'
             , color_continuous_scale= 'Greens').write_html('choropleth_1.html', full_html=False, include_plotlyjs='cdn')

6. Tweak the vizualization

viz['sqrt_Confirmed'] = np.sqrt(viz['Confirmed'].clip(lower=1)) # calculate the square root of the confirmed
                                                                # cases, clip the columns lower values to at 
                                                                # least 1.

fig = \
px.choropleth(viz
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['Confirmed', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_Confirmed'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')
fig
Unable to display output for mime type(s): application/vnd.plotly.v1+json
fig.show(renderer='browser')

7. What if we look at the data by proportion of the population?

# Load population data for countries
pop_data = px.data.gapminder()

# Select the most recent year available 
pop_data = pop_data[pop_data['year']==2007]

# Join the population data to our data set based on the country ISO3 code
pop_viz =viz.merge(pop_data[['pop', 'iso_alpha']]
          , left_on='ISO_3'
          , right_on='iso_alpha'
          , how='left').dropna()

# calculate the number of cases per 100k population members
pop_viz['proportion'] = (pop_viz['Confirmed']/pop_viz['pop'])*100_000

pop_viz['sqrt_proportion'] = np.sqrt(pop_viz['proportion'])

# map the data 
px.choropleth(pop_viz
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['Confirmed', 'proportion', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_proportion'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')\
.show(renderer='browser')

8. Extra problems to solve

After this course you should be armed with some tools to work on other datasets and problems. Here are some additional problems which will stretch your abilities a bit and will require extra reading to solve:

  1. Can you rewrite the column merging code into a function so that all you need to do is pass the function, a dataframe, two input column names, and an output column name to have it check if the columns can be merged and merge them?
def merge_cols(input_df, col1, col2, outcol):
    ...
  1. Can you change the way we load the files to avoid needing to merge columns? (hint: you will need to use the pandas rename function dataframe.rename(columns={'from':'to') )

  2. Can you create a visualization from the timeline dataset? (hint: you will need to use the dataframe.melt() command very rapidly demonstrated in python 101 and you will likley need to do clean up on the countries before doing the ISO_3 lookups)

Solution for problem 1

def merge_cols(input_df, col1, col2, outcol):
    if len(input_df[ 
        (~pd.isnull(input_df[col1]))
       &(~pd.isnull(input_df[col2])) 
        ]) != 0:
        raise ValueError('Columns overlap, further investigation is needed.')

    input_df[outcol] = input_df[col1].fillna(input_df[col2])

    
d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)

merge_cols(d_data, 'Last Update', 'Last_Update', 'updated_on')

d_data
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file updated_on
0 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Anhui NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv 1/22/2020 17:00
1 NaN NaN NaN 14.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Beijing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv 1/22/2020 17:00
2 NaN NaN NaN 6.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Chongqing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv 1/22/2020 17:00
3 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Fujian NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv 1/22/2020 17:00
4 NaN NaN NaN NaN Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Gansu NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv 1/22/2020 17:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
83385 309.0 NaN West Bank and Gaza 374.0 NaN West Bank and Gaza 2.0 NaN NaN 2020-04-15 22:56:32 31.952200 NaN 35.233200 NaN NaN NaN 63.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv 2020-04-15 22:56:32
83386 6.0 NaN Western Sahara 6.0 NaN Western Sahara 0.0 NaN NaN 2020-04-15 22:56:32 24.215500 NaN -12.885800 NaN NaN NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv 2020-04-15 22:56:32
83387 1.0 NaN Yemen 1.0 NaN Yemen 0.0 NaN NaN 2020-04-15 22:56:32 15.552727 NaN 48.516388 NaN NaN NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv 2020-04-15 22:56:32
83388 16.0 NaN Zambia 48.0 NaN Zambia 2.0 NaN NaN 2020-04-15 22:56:32 -13.133897 NaN 27.849332 NaN NaN NaN 30.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv 2020-04-15 22:56:32
83389 19.0 NaN Zimbabwe 23.0 NaN Zimbabwe 3.0 NaN NaN 2020-04-15 22:56:32 -19.015438 NaN 29.154857 NaN NaN NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv 2020-04-15 22:56:32

83390 rows × 19 columns

Solution for problem 2


col_pairs = {
    'Country/Region': 'Country_Region'
    ,'Province/State': 'Province_State'
    ,'Long_': 'Longitude'
    ,'Lat': 'Latitude'
    ,'Last Update': 'Last_Update'
}

all_data = []
meta_data = []

# For every file in our data set path
for file in sorted(glob.glob(dataset_path)):
    
    # 1. Read the file to a temporary data frame
    df = pd.read_csv(file).rename(columns=col_pairs)
    
    # 2. Append a dictionary with the file meta_data into the metadata list
    meta_data.append( {  'file_name': file.rsplit(os.path.sep,1)[1]
                       , 'num_rows': df.shape[0]
                       , 'num_cols': df.shape[1]
                       , 'col_names': '\t'.join(sorted(list(df.columns)))} )
    
    # Add the file name to the loaded data
    df['source_file'] = file
    
    # 4. Add the loaded data with the file name to a list with all the loaded data
    all_data.append(df)

# 5. Create a table/dataframe out of our meta_data 
meta_data = pd.DataFrame(meta_data)


# output result to notebook window
meta_data.groupby(['num_cols'])\
         .agg({ 'num_rows': 'sum'
              , 'file_name': sorted
              , 'col_names': set })
num_rows file_name col_names
num_cols
6 2818 [01-22-2020.csv, 01-23-2020.csv, 01-24-2020.csv, 01-25-2020.csv, 01-26-2020.csv, 01-27-2020.csv, 01-28-2020.csv, 01-29-2020.csv, 01-30-2020.csv, 0... {Confirmed\tCountry_Region\tDeaths\tLast_Update\tProvince_State\tRecovered}
8 4799 [03-01-2020.csv, 03-02-2020.csv, 03-03-2020.csv, 03-04-2020.csv, 03-05-2020.csv, 03-06-2020.csv, 03-07-2020.csv, 03-08-2020.csv, 03-09-2020.csv, 0... {Confirmed\tCountry_Region\tDeaths\tLast_Update\tLatitude\tLongitude\tProvince_State\tRecovered}
12 75773 [03-22-2020.csv, 03-23-2020.csv, 03-24-2020.csv, 03-25-2020.csv, 03-26-2020.csv, 03-27-2020.csv, 03-28-2020.csv, 03-29-2020.csv, 03-30-2020.csv, 0... {Active\tAdmin2\tCombined_Key\tConfirmed\tCountry_Region\tDeaths\tFIPS\tLast_Update\tLatitude\tLongitude\tProvince_State\tRecovered}
d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)


d_data.head()
Active Admin2 Combined_Key Confirmed Country_Region Deaths FIPS Last_Update Latitude Longitude Province_State Recovered source_file
0 NaN NaN NaN 1.0 Mainland China NaN NaN 1/22/2020 17:00 NaN NaN Anhui NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1 NaN NaN NaN 14.0 Mainland China NaN NaN 1/22/2020 17:00 NaN NaN Beijing NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2 NaN NaN NaN 6.0 Mainland China NaN NaN 1/22/2020 17:00 NaN NaN Chongqing NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3 NaN NaN NaN 1.0 Mainland China NaN NaN 1/22/2020 17:00 NaN NaN Fujian NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4 NaN NaN NaN NaN Mainland China NaN NaN 1/22/2020 17:00 NaN NaN Gansu NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

Solution for problem 3

df = pd.read_csv('us_jhu_data/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df = df.melt(  id_vars=['Province/State', 'Country/Region', 'Lat', 'Long']
        , var_name='report_date'
        , value_name='confirmed_cases')

df.rename(columns={'Country/Region': 'Country'}, inplace=True)
import pycountry

df.loc[ df['Country']=='Mainland China', 'Country'] = 'China'
df.loc[ df['Country']=='Macau', 'Country'] = 'China'
df.loc[ df['Country']=='South Korea', 'Country'] = 'Republic of Korea'
df.loc[ df['Country']=='Korea, South', 'Country'] = 'Republic of Korea'
df.loc[ df['Country']=='Ivory Coast', 'Country'] = "Republic of Cote d'Ivoire"
df.loc[ df['Country']=='North Ireland', 'Country'] = "United Kingdom"
df.loc[ df['Country']=='Republic of Ireland', 'Country'] = "Ireland"
df.loc[ df['Country']=='St. Martin', 'Country'] = "France" 
df.loc[ df['Country']=='Iran (Islamic Republic of)', 'Country'] = "Iran"
df.loc[ df['Country']=='West Bank and Gaza', 'Country'] = "Palestine"
df.loc[ df['Country']=='Channel Islands', 'Country'] = "UK" ## Not technically, but effectively, great tax laws
df.loc[ df['Country'].isin([ 'Congo (Brazzaville)'
                                    ,'Congo (Kinshasa)']), 'Country'] = 'Congo'
df.loc[ df['Country']=='Gambia, The', 'Country'] = "Gambia" 
df.loc[ df['Country']=='Bahamas, The', 'Country'] = "Bahamas" 
df.loc[ df['Country']=='Cape Verde', 'Country'] = 'Republic of Cabo Verde'
df.loc[ df['Country']=='East Timor', 'Country'] = 'Timor-Leste'
df.loc[ df['Country']=='Laos', 'Country'] = "Lao People's Democratic Republic" 
df.loc[ df['Country']=="Burma", 'Country'] = 'Myanmar'

# dropping disputed teritories and not teritories
df = df.drop(df[df['Country']=='Others'].index)
df = df.drop(df[df['Country']=='Taipei and environs'].index)
df = df.drop(df[df['Country']=='occupied Palestinian territory'].index) 
df = df.drop(df[df['Country']=='Taiwan*'].index)
df = df.drop(df[df['Country']=='Taiwan'].index)
df = df.drop(df[df['Country']=='MS Zaandam'].index)
df = df.drop(df[df['Country']=='Cruise Ship'].index)
df = df.drop(df[df['Country']=='Diamond Princess'].index)
countries = pd.Series(df['Country'].unique())

def get_iso(country_name):
    return {'Country':country_name, 'ISO_3': pycountry.countries.search_fuzzy(country_name)[0].alpha_3}

countries = pd.DataFrame(list(countries.map(get_iso)))

df = df.merge(countries
            , on='Country'
            , how='inner'
            , validate='m:1') 
df
Province/State Country Lat Long report_date confirmed_cases ISO_3
0 NaN Afghanistan 33.000000 65.000000 1/22/20 0 AFG
1 NaN Afghanistan 33.000000 65.000000 1/23/20 0 AFG
2 NaN Afghanistan 33.000000 65.000000 1/24/20 0 AFG
3 NaN Afghanistan 33.000000 65.000000 1/25/20 0 AFG
4 NaN Afghanistan 33.000000 65.000000 1/26/20 0 AFG
... ... ... ... ... ... ... ...
22180 NaN Yemen 15.552727 48.516388 4/11/20 1 YEM
22181 NaN Yemen 15.552727 48.516388 4/12/20 1 YEM
22182 NaN Yemen 15.552727 48.516388 4/13/20 1 YEM
22183 NaN Yemen 15.552727 48.516388 4/14/20 1 YEM
22184 NaN Yemen 15.552727 48.516388 4/15/20 1 YEM

22185 rows × 7 columns

df[pd.isnull(df['ISO_3'])]
Province/State Country Lat Long report_date confirmed_cases ISO_3
viz2 = \
df.groupby(by=['report_date', 'ISO_3', 'Country'])\
      .agg({'confirmed_cases': 'sum'})\
      .reset_index()
viz2['sqrt_Confirmed'] = np.sqrt(viz2['confirmed_cases'].clip(lower=1)) # calculate the square root of the confirmed
                                                                # cases, clip the columns lower values to at 
                                                                # least 1.

fig = \
px.choropleth(viz2
             , locations='ISO_3'
             , locationmode='ISO-3'
             , animation_frame='report_date'
             , hover_data=['confirmed_cases', 'Country']
             , animation_group='ISO_3'
             , color='sqrt_Confirmed'
             , color_continuous_scale= [[0,  'rgba(0, 255, 0, .07)' ]
                                        ,[0.5, 'green']
                                        ,[1, 'palegreen']]
             , template='plotly_dark')
fig
Unable to display output for mime type(s): application/vnd.plotly.v1+json