Mapping Data With Plotly

25 minute read

Plotting data on a map with plotly

Before you start make sure that you have run the following commands in anaconda prompt:

conda activate analysis-101

conda install git

conda install pycountry

git clone https://github.com/CSSEGISandData/COVID-19.git us_jhu_data

Explore environment and downloaded data

As with any downloaded data, it’s prudent to explore a little what data we received and wehther there are obvious patterns.

Let’s start by exploring what we see from our notebook. We can do this using the jupyter magic command %ls which behaves like the Unix ls command or similar to the Windows DIR command listing directories from where the notebook is opened from.

%ls 
mapping-data-with-plotly.ipynb
us_jhu_data/

So it looks like we’re in our own folder, we’ve got our notebook and the data we just downloaded, let’s have a look inside.

%ls us_jhu_data
README.md                       csse_covid_19_data/
archived_data/                 who_covid_19_situation_reports/

That command wasn’t super helpful, all we got back from it was a list of more folders to look into. Let’s just go nuclear and look at everything together. To do that we can use the -R switch to expand all the files and folders.

%ls -R us_jhu_data/ 
# %ls /S us_jhu_data # use this for windows
README.md                       csse_covid_19_data/
archived_data/                  who_covid_19_situation_reports/

us_jhu_data//archived_data:
README.md                       archived_time_series/
archived_daily_case_updates/

us_jhu_data//archived_data/archived_daily_case_updates:
01-21-2020_2200.csv  01-29-2020_1430.csv  ...

us_jhu_data//archived_data/archived_time_series:
README.md
time_series_19-covid-Confirmed_archived_0325.csv
...

us_jhu_data//csse_covid_19_data:
README.md                       csse_covid_19_daily_reports_us/
...

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports:
01-22-2020.csv  02-09-2020.csv  02-27-2020.csv  ...

us_jhu_data//csse_covid_19_data/csse_covid_19_daily_reports_us:
04-12-2020.csv  04-13-2020.csv  04-14-2020.csv  ...

us_jhu_data//csse_covid_19_data/csse_covid_19_time_series:
README.md
time_series_covid19_confirmed_US.csv
...

us_jhu_data//who_covid_19_situation_reports:
README.md                         who_covid_19_sit_rep_time_series/
who_covid_19_sit_rep_pdfs/

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_pdfs:
20200121-sitrep-1-2019-ncov.pdf     20200222-sitrep-33-covid-19.pdf
20200122-sitrep-2-2019-ncov.pdf     ...

us_jhu_data//who_covid_19_situation_reports/who_covid_19_sit_rep_time_series:
who_covid_19_sit_rep_time_series.csv

From the output of ls we can see that we have the following data sets:

  • archived daily case updates
  • archived time series data
  • a lookup table to convert IDs to US FIPS codes (county codes)
  • daily reports from jan 22 to april 10
  • time series data with different focuses
  • unstructured WHO situation reports
  • WHO structured time series of the situation reports

For brevity, we will only explore the following two data sets:

  • time series data with different focuses
  • daily reports from jan 22 to april 10

Let’s start with the global time series data

Start the Python engines, import libararies

Before we start we need to load the libararies we need. Below are a collection of libraries that I reach for time and time again so it’s prudent to load them first so that we don’t end up needing to load them later.

import pandas as pd
import numpy as np
import os
import plotly.express as px
from pandas_profiling import ProfileReport

Global time series data

pd.read_csv('us_jhu_data/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')\
  .head()

#'us_jhu_data//csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv' this file
# will be left to participants as an exercise.
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/6/20 4/7/20 4/8/20 4/9/20 4/10/20 4/11/20 4/12/20 4/13/20 4/14/20 4/15/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 367 423 444 484 521 555 607 665 714 784
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 377 383 400 409 416 433 446 467 475 494
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 1423 1468 1572 1666 1761 1825 1914 1983 2070 2160
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 525 545 564 583 601 601 638 646 659 673
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 16 17 19 19 19 19 19 19 19 19

5 rows × 89 columns

This file has almost exactly what we need, but it’s kind of cheating, so let’s use the other dataset, the daily reports and clean that data up for us to use in a mapping exercise.

Explore chosen dataset – Daily Reports

Let’s start by peeking into one file and having a look at what’s in there

idata = 'us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/'
pd.read_csv(os.path.join(idata,'03-13-2020.csv'))
Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
0 Hubei China 2020-03-13T11:09:03 67786 3062 51553 30.9756 112.2707
1 Guangdong China 2020-03-13T11:09:03 1356 8 1296 23.3417 113.4244
2 Henan China 2020-03-11T08:13:09 1273 22 1249 33.8820 113.6140
3 Zhejiang China 2020-03-12T01:33:02 1215 1 1197 29.1832 120.0934
4 Hunan China 2020-03-13T11:09:03 1018 4 1005 27.6104 111.7088
... ... ... ... ... ... ... ... ...
225 NaN Afghanistan 2020-03-11T20:00:00 7 0 0 33.0000 65.0000
226 NaN Monaco 2020-03-11T20:00:00 2 0 0 43.7333 7.4167
227 NaN Liechtenstein 2020-03-11T20:00:00 1 0 0 47.1400 9.5500
228 NaN Guyana 2020-03-11T20:00:00 1 1 0 5.0000 -58.7500
229 NaN Taiwan* 2020-03-11T20:00:00 50 1 20 23.7000 121.0000

230 rows × 8 columns

Check the size and column count for all the files

import glob

dataset_path = os.path.join(idata,'*.csv')

for file in glob.glob(dataset_path):
    print( pd.read_csv(file).shape) # calculate data shape (rows, columns)
(101, 6)
(105, 6)
(2883, 12)
(2911, 12)
...
for file in glob.glob(dataset_path):
    print(file # print file path
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv (101, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv (105, 6)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv (2883, 12)
us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv (2911, 12)
...
for file in glob.glob(dataset_path):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
02-26-2020.csv (101, 6)
02-27-2020.csv (105, 6)
04-08-2020.csv (2883, 12)
...
# Sort the file list

for file in sorted(glob.glob(dataset_path)):
    print(file.rsplit(os.path.sep,1)[1] # isolate file name
          , pd.read_csv(file).shape) # calculate data shape (rows, columns)
01-22-2020.csv (38, 6)
01-23-2020.csv (46, 6)
01-24-2020.csv (41, 6)
...
import os

for file in sorted(glob.glob(dataset_path)):
       print(file.rsplit(os.path.sep,1)[1] # isolate file name
             , pd.read_csv(file).shape # calculate data shape (rows, columns)
             , os.stat(file).st_size/(1024*1024)) # get the individual file size in MB
01-22-2020.csv (38, 6) 0.0015974044799804688
...
03-01-2020.csv (125, 8) 0.0073528289794921875
...
03-22-2020.csv (3417, 12) 0.3128471374511719
...

What’s the goal when preparing this dataset for vizualization?

We are going to use plotly.js to visualize the data, however that means that the data needs to be ‘clean’. Basically, the data needs to be in the format and shape that is required for visualization given that plotly won’t make assumptions about how to draw your data points. I’ve seen some beautifully picassoesque and Dr Seuss-like graphs when working with malformed data.

In this instance we want to create a graph that show’s us the number of COVID cases per country per day, and animate through the days. To do this we will need to have the data in the following format:

Country Name Country ISO Code Day Number of Cases
Switzerland CHE 2020-02-29 xxx
France FRA 2020-02-29 yyy
Switzerland CHE 2020-03-01 zzz
France FRA 2020-03-01 aaa

To get to that format we’ll have to go from whichever format we have in the chosen data set to the format above. To be able to do that we’ll need to load and harmonize the daily reports with the following steps:

  1. Load the data into memory
  2. Analyze the dataset and make decisions about what to do
  3. Consolidate data which has changed names over time
  4. Clean up the country names to identify the ISO codes for countries
  5. Visualize the data on a world map
  6. Tweak the visualization
  7. Adapt our visualization to look at cases per 100k instead of number of cases

1. Load the data set into a pandas dataframe

Load all the data into memory and see how we can combine the data into a single set for us to use with data visualization.

import glob 
# Create two lists to store our file metadata and file data
all_data = []
meta_data = []

# For every file in our data set path
for file in sorted(glob.glob(dataset_path)):
    
    # 1. Read the file to a temporary data frame
    df = pd.read_csv(file)
    
    # 2. Append a dictionary with the file meta_data into the metadata list
    meta_data.append( {  'file_name': file.rsplit(os.path.sep,1)[1]
                       , 'num_rows': df.shape[0]
                       , 'num_cols': df.shape[1]
                       , 'col_names': '\t'.join(sorted(list(df.columns)))} )
    
    # Add the file name to the loaded data
    df['source_file'] = file
    
    # 4. Add the loaded data with the file name to a list with all the loaded data
    all_data.append(df)

# 5. Create a table/dataframe out of our meta_data 
meta_data = pd.DataFrame(meta_data)

# show the metadata in jupyter notebook
meta_data
file_name num_rows num_cols col_names
0 01-22-2020.csv 38 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
1 01-23-2020.csv 46 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
2 01-24-2020.csv 41 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
3 01-25-2020.csv 44 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
4 01-26-2020.csv 47 6 Confirmed\tCountry/Region\tDeaths\tLast Update...
... ... ... ... ...
80 04-11-2020.csv 2966 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
81 04-12-2020.csv 2989 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
82 04-13-2020.csv 3002 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
83 04-14-2020.csv 3014 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...
84 04-15-2020.csv 3027 12 Active\tAdmin2\tCombined_Key\tConfirmed\tCount...

85 rows × 4 columns


pd.set_option('max_colwidth', 150)

# output result to notebook window
meta_data.groupby(['num_cols'])\
         .agg({ 'num_rows': 'sum'
              , 'file_name': sorted
              , 'col_names': set })

num_rows file_name col_names
num_cols
6 2818 [01-22-2020.csv, 01-23-2020.csv, 01-24-2020.csv, 01-25-2020.csv, 01-26-2020.csv, 01-27-2020.csv, 01-28-2020.csv, 01-29-2020.csv, 01-30-2020.csv, 0... {Confirmed\tCountry/Region\tDeaths\tLast Update\tProvince/State\tRecovered}
8 4799 [03-01-2020.csv, 03-02-2020.csv, 03-03-2020.csv, 03-04-2020.csv, 03-05-2020.csv, 03-06-2020.csv, 03-07-2020.csv, 03-08-2020.csv, 03-09-2020.csv, 0... {Confirmed\tCountry/Region\tDeaths\tLast Update\tLatitude\tLongitude\tProvince/State\tRecovered}
12 75773 [03-22-2020.csv, 03-23-2020.csv, 03-24-2020.csv, 03-25-2020.csv, 03-26-2020.csv, 03-27-2020.csv, 03-28-2020.csv, 03-29-2020.csv, 03-30-2020.csv, 0... {Active\tAdmin2\tCombined_Key\tConfirmed\tCountry_Region\tDeaths\tFIPS\tLast_Update\tLat\tLong_\tProvince_State\tRecovered}
d_data = pd.concat(all_data
                  , axis='index'
                  , join='outer'
                  , ignore_index=True
                  , sort=True)

d_data.head()
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file
0 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Anhui NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
1 NaN NaN NaN 14.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Beijing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
2 NaN NaN NaN 6.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Chongqing NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
3 NaN NaN NaN 1.0 Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Fujian NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
4 NaN NaN NaN NaN Mainland China NaN NaN NaN 1/22/2020 17:00 NaN NaN NaN NaN NaN Gansu NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv

2. Analyze the data set

pr = ProfileReport(d_data)
pr.to_widgets()
Summarize dataset: 100%|██████████| 32/32 [00:19<00:00,  1.68it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.54s/it]
Render widgets: 100%|██████████| 1/1 [00:13<00:00, 13.76s/it]

![Profile Report](/assets/images/posts/mapping-data-with-plotly/profile-report.png)

3. Consolidate data whose names may have changed over time

Some of the data columns we looked at ealier looked like they were renamed, let’s verify that for each pair and then merge the columns.


d_data[ 
    (~pd.isnull(d_data['Country/Region'])) # select rows where Country/Region is null
   &                                       # AND
    (~pd.isnull(d_data['Country_Region'])) # select rows where Country_Region is null
]

# we want to see no rows from this query.
Active Admin2 Combined_Key Confirmed Country/Region Country_Region Deaths FIPS Last Update Last_Update Lat Latitude Long_ Longitude Province/State Province_State Recovered source_file

Now that we’ve confirmed that we can merge the columns we can perform the merge operation

d_data['Country'] = d_data['Country/Region'].fillna(d_data['Country_Region']) 
#          ↑                      ↑                              ↑           
#     3. Add to       2. put it here if its null            1. take this     
#     newcolumn  

d_data['Country/Region'].fillna(d_data['Country_Region']) 
0            Mainland China
1            Mainland China
2            Mainland China
3            Mainland China
4            Mainland China
                ...        
83385    West Bank and Gaza
83386        Western Sahara
83387                 Yemen
83388                Zambia
83389              Zimbabwe
Name: Country/Region, Length: 83390, dtype: object

This worked well for the first column pair let’s do it for the others. We’ll automate the check that makes sure the columns don’t overlap and will stop execution if the columns overlap.

if len(d_data[ 
    (~pd.isnull(d_data['Province/State']))
   &(~pd.isnull(d_data['Province_State'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['ps'] = d_data['Province/State'].fillna(d_data['Province_State'])
#          ↑                    ↑                            ↑           
#     3. Add to     2. put it here if its null          1. take this     
#     newcolumn  

d_data['Province/State'].fillna(d_data['Province_State'])
0            Anhui
1          Beijing
2        Chongqing
3           Fujian
4            Gansu
           ...    
83385          NaN
83386          NaN
83387          NaN
83388          NaN
83389          NaN
Name: Province/State, Length: 83390, dtype: object
if len(d_data[ 
    (~pd.isnull(d_data['Long_']))
   &(~pd.isnull(d_data['Longitude'])) 
    ]) > 0:
    raise ValueError('Columns overlap, further investigation is needed.')


#--yes

d_data['long_val'] = d_data['Long_'].fillna(d_data['Longitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Long_'].fillna(d_data['Longitude'])
0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    35.233200
83386   -12.885800
83387    48.516388
83388    27.849332
83389    29.154857
Name: Long_, Length: 83390, dtype: float64
if len(d_data[ 
    (~pd.isnull(d_data['Lat']))
   &(~pd.isnull(d_data['Latitude'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['lat_val'] = d_data['Lat'].fillna(d_data['Latitude'])
#          ↑                   ↑                         ↑           
#     3. Add to    2. put it here if its null       1. take this     
#     newcolumn  

d_data['Lat'].fillna(d_data['Latitude'])
0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
83385    31.952200
83386    24.215500
83387    15.552727
83388   -13.133897
83389   -19.015438
Name: Lat, Length: 83390, dtype: float64
if len(d_data[ 
    (~pd.isnull(d_data['Last Update']))
   &(~pd.isnull(d_data['Last_Update'])) 
    ]) != 0:
    raise ValueError('Columns overlap, further investigation is needed.')

#--yes

d_data['updated_on'] = d_data['Last Update'].fillna(d_data['Last_Update']) 
#          ↑                         ↑                            ↑           
#     3. Add to          2. put it here if its null          1. take this   

d_data['Last Update'].fillna(d_data['Last_Update']) 
0            1/22/2020 17:00
1            1/22/2020 17:00
2            1/22/2020 17:00
3            1/22/2020 17:00
4            1/22/2020 17:00
                ...         
83385    2020-04-15 22:56:32
83386    2020-04-15 22:56:32
83387    2020-04-15 22:56:32
83388    2020-04-15 22:56:32
83389    2020-04-15 22:56:32
Name: Last Update, Length: 83390, dtype: object

With the columns merged, we can drop the old columns as they are no longer needed.

d_data.columns
Index(['Active', 'Admin2', 'Combined_Key', 'Confirmed', 'Country/Region',
       'Country_Region', 'Deaths', 'FIPS', 'Last Update', 'Last_Update', 'Lat',
       'Latitude', 'Long_', 'Longitude', 'Province/State', 'Province_State',
       'Recovered', 'source_file', 'Country', 'ps', 'long_val', 'lat_val',
       'updated_on'],
      dtype='object')
d_data.drop(columns=['Active', 'Admin2', 'Combined_Key'
                     , 'Country/Region', 'Country_Region'
                     , 'Last Update', 'Last_Update', 'Lat'
                     , 'Latitude', 'Long_', 'Longitude'
                     , 'Province/State', 'Province_State']
            , inplace=True)

In order to work with the data later on, let’s convert the updated_on column to date types from strings so that we can sort and group it properly.

# convert updated time to a datetime object to work with
d_data['updated_on'] = pd.to_datetime(d_data['updated_on'])

4. Clean up the country names to identify the ISO codes for countries

In order to add data onto the map we need to have the ISO codes for all the countries. To do that we can use the pycountry library we installed at the beginning of the course. Let’s first see if there will be any issues by looking at the country data.

set(d_data['Country'].unique())
{' Azerbaijan',
 'Afghanistan',
 'Albania',
 ...
 'Zambia',
 'Zimbabwe',
 'occupied Palestinian territory'}
import pycountry

d_data.loc[ d_data['Country']=='Mainland China', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='Macau', 'Country'] = 'China'
d_data.loc[ d_data['Country']=='South Korea', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Korea, South', 'Country'] = 'Republic of Korea'
d_data.loc[ d_data['Country']=='Ivory Coast', 'Country'] = "Republic of Cote d'Ivoire"
d_data.loc[ d_data['Country']=='North Ireland', 'Country'] = "United Kingdom"
d_data.loc[ d_data['Country']=='Republic of Ireland', 'Country'] = "Ireland"
d_data.loc[ d_data['Country']=='St. Martin', 'Country'] = "France" 
d_data.loc[ d_data['Country']=='Iran (Islamic Republic of)', 'Country'] = "Iran"
d_data.loc[ d_data['Country']=='West Bank and Gaza', 'Country'] = "Palestine"
d_data.loc[ d_data['Country']=='occupied Palestinian territory', 'Country'] = "Palestine"
d_data.loc[ d_data['Country']=='Channel Islands', 'Country'] = "UK" ## Not technically, but effectively, great tax laws
d_data.loc[ d_data['Country'].isin([ 'Congo (Brazzaville)'
                                    ,'Congo (Kinshasa)']), 'Country'] = 'Congo'
d_data.loc[ d_data['Country']=='Gambia, The', 'Country'] = "Gambia" 
d_data.loc[ d_data['Country']=='Bahamas, The', 'Country'] = "Bahamas" 
d_data.loc[ d_data['Country']=='Cape Verde', 'Country'] = 'Republic of Cabo Verde'
d_data.loc[ d_data['Country']=='East Timor', 'Country'] = 'Timor-Leste'
d_data.loc[ d_data['Country']=='Laos', 'Country'] = "Lao People's Democratic Republic" 
d_data.loc[ d_data['Country']=="Burma", 'Country'] = 'Myanmar'

# dropping disputed teritories and not teritories
d_data = d_data.drop(d_data[d_data['Country']=='Others'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Taipei and environs'].index)
d_data = d_data.drop(d_data[d_data['Country']=='MS Zaandam'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Cruise Ship'].index)
d_data = d_data.drop(d_data[d_data['Country']=='Diamond Princess'].index)
countries = pd.Series(d_data['Country'].unique())

def get_iso(country_name):
    return {'Country':country_name, 'ISO_3': pycountry.countries.search_fuzzy(country_name)[0].alpha_3}

countries = pd.DataFrame(list(countries.map(get_iso)))
d_data.merge(countries
            , on='Country'
            , how='inner'
            , validate='m:1') 
Confirmed Deaths FIPS Recovered source_file Country ps long_val lat_val updated_on ISO_3
0 1.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Anhui NaN NaN 2020-01-22 17:00:00 CHN
1 14.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Beijing NaN NaN 2020-01-22 17:00:00 CHN
2 6.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Chongqing NaN NaN 2020-01-22 17:00:00 CHN
3 1.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Fujian NaN NaN 2020-01-22 17:00:00 CHN
4 NaN NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv China Gansu NaN NaN 2020-01-22 17:00:00 CHN
... ... ... ... ... ... ... ... ... ... ... ...
83206 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv Yemen NaN 48.516388 15.552727 2020-04-11 22:45:13 YEM
83207 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv Yemen NaN 48.516388 15.552727 2020-04-12 23:17:00 YEM
83208 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv Yemen NaN 48.516388 15.552727 2020-04-13 23:07:34 YEM
83209 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv Yemen NaN 48.516388 15.552727 2020-04-14 23:33:12 YEM
83210 1.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv Yemen NaN 48.516388 15.552727 2020-04-15 22:56:32 YEM

83211 rows × 11 columns

d_data =\
    d_data.merge(countries
                , on='Country'
                , how='inner'
                , validate='m:1') 

4. a. identify and fix issues with the last-reported data column for our grouping purposes

pd.set_option("display.max_rows", 300) # increase the number of rows visible 

d_data[d_data['ISO_3'] == 'DEU']
Confirmed Deaths FIPS Recovered source_file Country ps long_val lat_val updated_on ISO_3
76227 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-28-2020.csv Germany Bavaria NaN NaN 2020-01-28 23:00:00 DEU
76228 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv Germany Bavaria NaN NaN 2020-01-29 19:30:00 DEU
76229 4.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-30-2020.csv Germany Bavaria NaN NaN 2020-01-30 16:00:00 DEU
76230 5.0 NaN NaN NaN us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/01-31-2020.csv Germany Bavaria NaN NaN 2020-01-31 23:59:00 DEU
76231 8.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-01-2020.csv Germany NaN NaN NaN 2020-02-01 18:33:00 DEU
76232 10.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-02-2020.csv Germany NaN NaN NaN 2020-02-02 18:03:05 DEU
76233 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-03-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76234 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-04-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76235 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-05-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76236 12.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-06-2020.csv Germany NaN NaN NaN 2020-02-03 20:53:02 DEU
76237 13.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-07-2020.csv Germany NaN NaN NaN 2020-02-07 16:33:03 DEU
76238 13.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-08-2020.csv Germany NaN NaN NaN 2020-02-07 16:33:03 DEU
76239 14.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-09-2020.csv Germany NaN NaN NaN 2020-02-09 06:13:11 DEU
76240 14.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-10-2020.csv Germany NaN NaN NaN 2020-02-09 06:13:11 DEU
76241 16.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-11-2020.csv Germany NaN NaN NaN 2020-02-11 19:33:03 DEU
76242 16.0 0.0 NaN 0.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-12-2020.csv Germany NaN NaN NaN 2020-02-11 19:33:03 DEU
76243 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-13-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76244 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-14-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76245 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-15-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76246 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-16-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76247 16.0 0.0 NaN 1.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-17-2020.csv Germany NaN NaN NaN 2020-02-13 15:13:11 DEU
76248 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-18-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76249 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-19-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76250 16.0 0.0 NaN 12.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-20-2020.csv Germany NaN NaN NaN 2020-02-18 17:03:03 DEU
76251 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-21-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76252 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76253 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-23-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76254 16.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-24-2020.csv Germany NaN NaN NaN 2020-02-21 23:03:13 DEU
76255 17.0 0.0 NaN 14.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-25-2020.csv Germany NaN NaN NaN 2020-02-25 21:33:02 DEU
76256 27.0 0.0 NaN 15.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-26-2020.csv Germany NaN NaN NaN 2020-02-26 23:43:03 DEU
76257 46.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-27-2020.csv Germany NaN NaN NaN 2020-02-27 23:13:04 DEU
76258 48.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-28-2020.csv Germany NaN NaN NaN 2020-02-28 00:13:18 DEU
76259 79.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/02-29-2020.csv Germany NaN NaN NaN 2020-02-29 14:43:03 DEU
76260 130.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-01-2020.csv Germany NaN 9.000000 51.000000 2020-03-01 23:23:02 DEU
76261 159.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-02-2020.csv Germany NaN 9.000000 51.000000 2020-03-02 20:33:02 DEU
76262 196.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-03-2020.csv Germany NaN 9.000000 51.000000 2020-03-03 20:03:06 DEU
76263 262.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-04-2020.csv Germany NaN 9.000000 51.000000 2020-03-04 19:33:03 DEU
76264 482.0 0.0 NaN 16.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-05-2020.csv Germany NaN 9.000000 51.000000 2020-03-05 17:43:03 DEU
76265 670.0 0.0 NaN 17.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2020.csv Germany NaN 9.000000 51.000000 2020-03-06 17:53:03 DEU
76266 799.0 0.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-07-2020.csv Germany NaN 9.000000 51.000000 2020-03-07 17:43:05 DEU
76267 1040.0 0.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv Germany NaN 9.000000 51.000000 2020-03-08 21:03:03 DEU
76268 1176.0 2.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-09-2020.csv Germany NaN 9.000000 51.000000 2020-03-09 18:13:11 DEU
76269 1457.0 2.0 NaN 18.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-10-2020.csv Germany NaN 9.000000 51.000000 2020-03-10 18:53:02 DEU
76270 1908.0 3.0 NaN 25.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-11-2020.csv Germany NaN 9.000000 51.000000 2020-03-11 19:13:17 DEU
76271 2078.0 3.0 NaN 25.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-12-2020.csv Germany NaN 9.000000 51.000000 2020-03-12 09:53:06 DEU
76272 3675.0 7.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-13-2020.csv Germany NaN 10.451500 51.165700 2020-03-11 20:00:00 DEU
76273 4585.0 9.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-14-2020.csv Germany NaN 10.451500 51.165700 2020-03-14 22:13:11 DEU
76274 5795.0 11.0 NaN 46.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-15-2020.csv Germany NaN 10.451500 51.165700 2020-03-15 18:20:18 DEU
76275 7272.0 17.0 NaN 67.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-16-2020.csv Germany NaN 10.451500 51.165700 2020-03-16 20:13:11 DEU
76276 9257.0 24.0 NaN 67.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-17-2020.csv Germany NaN 10.451500 51.165700 2020-03-17 18:53:02 DEU
76277 12327.0 28.0 NaN 105.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-18-2020.csv Germany NaN 10.451500 51.165700 2020-03-18 19:33:02 DEU
76278 15320.0 44.0 NaN 113.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-19-2020.csv Germany NaN 10.451500 51.165700 2020-03-19 20:13:08 DEU
76279 19848.0 67.0 NaN 180.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv Germany NaN 10.451500 51.165700 2020-03-20 20:13:15 DEU
76280 22213.0 84.0 NaN 233.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-21-2020.csv Germany NaN 10.451500 51.165700 2020-03-21 20:43:02 DEU
76281 24873.0 94.0 NaN 266.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv Germany NaN 10.451526 51.165691 2020-03-22 23:45:00 DEU
76282 29056.0 123.0 NaN 453.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-23-2020.csv Germany NaN 10.451526 51.165691 2020-03-23 23:19:21 DEU
76283 32986.0 157.0 NaN 3243.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-24-2020.csv Germany NaN 10.451526 51.165691 2020-03-24 23:37:15 DEU
76284 37323.0 206.0 NaN 3547.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv Germany NaN 10.451526 51.165691 2020-03-25 23:33:04 DEU
76285 43938.0 267.0 NaN 5673.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv Germany NaN 10.451526 51.165691 2020-03-26 23:48:18 DEU
76286 50871.0 342.0 NaN 6658.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-27-2020.csv Germany NaN 10.451526 51.165691 2020-03-27 23:23:03 DEU
76287 57695.0 433.0 NaN 8481.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv Germany NaN 10.451526 51.165691 2020-03-28 23:05:00 DEU
76288 62095.0 533.0 NaN 9211.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-29-2020.csv Germany NaN 10.451526 51.165691 2020-03-29 23:08:00 DEU
76289 66885.0 645.0 NaN 13500.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-30-2020.csv Germany NaN 10.451526 51.165691 2020-03-30 22:52:00 DEU
76290 71808.0 775.0 NaN 16100.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/03-31-2020.csv Germany NaN 10.451526 51.165691 2020-03-31 23:43:43 DEU
76291 77872.0 920.0 NaN 18700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-01-2020.csv Germany NaN 10.451526 51.165691 2020-04-01 21:58:34 DEU
76292 84794.0 1107.0 NaN 22440.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-02-2020.csv Germany NaN 10.451526 51.165691 2020-04-02 23:25:00 DEU
76293 91159.0 1275.0 NaN 24575.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-03-2020.csv Germany NaN 10.451526 51.165691 2020-04-03 22:46:20 DEU
76294 96092.0 1444.0 NaN 26400.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-04-2020.csv Germany NaN 10.451526 51.165691 2020-04-04 23:34:00 DEU
76295 100123.0 1584.0 NaN 28700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv Germany NaN 10.451526 51.165691 2020-04-05 23:06:26 DEU
76296 103374.0 1810.0 NaN 28700.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv Germany NaN 10.451526 51.165691 2020-04-06 23:21:00 DEU
76297 107663.0 2016.0 NaN 36081.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-07-2020.csv Germany NaN 10.451526 51.165691 2020-04-07 23:04:29 DEU
76298 113296.0 2349.0 NaN 46300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-08-2020.csv Germany NaN 10.451526 51.165691 2020-04-08 22:51:39 DEU
76299 118181.0 2607.0 NaN 52407.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-09-2020.csv Germany NaN 10.451526 51.165691 2020-04-09 23:02:19 DEU
76300 122171.0 2767.0 NaN 53913.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-10-2020.csv Germany NaN 10.451526 51.165691 2020-04-10 22:53:48 DEU
76301 124908.0 2736.0 NaN 57400.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv Germany NaN 10.451526 51.165691 2020-04-11 22:45:13 DEU
76302 127854.0 3022.0 NaN 60300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-12-2020.csv Germany NaN 10.451526 51.165691 2020-04-12 23:17:00 DEU
76303 130072.0 3194.0 NaN 64300.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-13-2020.csv Germany NaN 10.451526 51.165691 2020-04-13 23:07:34 DEU
76304 131359.0 3294.0 NaN 68200.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-14-2020.csv Germany NaN 10.451526 51.165691 2020-04-14 23:33:12 DEU
76305 134753.0 3804.0 NaN 72600.0 us_jhu_data/csse_covid_19_data/csse_covid_19_daily_reports/04-15-2020.csv Germany NaN 10.451526 51.165691 2020-04-15 22:56:32 DEU
d_data['report_date'] = \
     pd.to_datetime( 
         d_data['source_file'].astype('str')                      # convert the values in the column to string
                              .str.rsplit(os.path.sep,1, expand=True)[1]  # extract the file name from the file path
                              .str.replace('.csv','')             # remove the csv extension
       , dayfirst=False)                                          # convert the newly extracted string to date

5. Visualize the data on a map

To take the data we’ve prepared and put it onto the map we need to sum the number of confirmed cases by the date of the report and the country code.

d_data.groupby(by=['report_date', 'ISO_3'])\
      .agg({'Confirmed': 'sum'})
Confirmed
report_date ISO_3
2020-01-22 CHN 548.0
HKG 0.0
JPN 2.0
PRK 1.0
THA 2.0
... ... ...
2020-04-15 VNM 267.0
YEM 1.0
ZAF 2506.0
ZMB 48.0
ZWE 23.0

7866 rows × 1 columns


viz = \
d_data.groupby(by=['report_date', 'ISO_3', 'Country'])\
      .agg({'Confirmed': 'sum'})\
      .reset_index()

We also need to convert the report date back to string to display it. So let’s convert the report date column to string again.

viz['report_date'] = viz['report_date'].dt.strftime( '%Y-%m-%d')
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
subset = viz[viz['report_date'] == '2020-04-14']

px.choropleth(subset
             , locations='ISO_3'
             , locationmode = 'ISO-3'
             , geojson=counties
             , animation_frame='report_date'
             , animation_group='ISO_3'
             , color='Confirmed'
             , color_continuous_scale= 'Greens').write_html('choropleth_1.html', full_html=False, include_plotlyjs='cdn')