Using Dask/Dask Dashboard to play with Covid-19 vaccination data just a little bit

(conda202111base) [hadoop@tomato1 ~]$ conda list | grep dask
dask 2021.10.0 pyhd3eb1b0_0
dask-core 2021.10.0 pyhd3eb1b0_0
[root@fig1 ~]# /shared/Anaconda202111/envs/conda202111base/bin/dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://9.21.104.146:8786
distributed.scheduler - INFO - dashboard at: :8787
[root@kale1 ~]# /shared/Anaconda202111/envs/conda202111base/bin/dask-worker tcp://9.21.104.146:8786[root@onion1 ~]# /shared/Anaconda202111/envs/conda202111base/bin/dask-worker tcp://9.21.104.146:8786
(conda202111base) [hadoop@beet1 ~]$  /shared/Anaconda202111/envs/conda202111base/bin/jupyter-enterprisegateway --ip=0.0.0.0 --log-level=DEBUG > jeg.log 2>&1 &(conda202111base) [hadoop@tomato1 ~]$ jupyter notebook --gateway-url=http://beet1:8888 > /home/hadoop/ipython.log 2>&1 &
%%time
df = dd.read_csv('file:///shared/data/COVID-19_Vaccinations_in_the_United_States_County.csv',
dtype={'Administered_Dose1_Recip': 'float64',
'Administered_Dose1_Recip_12Plus': 'float64',
'Administered_Dose1_Recip_18Plus': 'float64',
'Administered_Dose1_Recip_65Plus': 'float64',
'FIPS': 'object',
'Series_Complete_12Plus': 'float64'})
group_by = df.groupby('Recip_State')
group_by_sum = group_by.Series_Complete_Yes.sum()
group_by_sum_nlargest = group_by_sum.nlargest(5)
%%time
group_by_sum_nlargest.compute()
Combined the tool-tips for multiple tasks
  1. In “Task Streams” section, each rectangle/block with different colors is one Dask task. Each horizontal line represents a core/thread that does these tasks over time. Usually “red” color represents the data transfer task which moves data around between Dask workers.
  2. There are three tasks “read-csv” reading into the data file in parallel. So three data chunks are kept in memory referred by the three tasks.
  3. After each data chunk is read, there is “series-groupby-sum-chunk” task handling this data chunk for group data by state and sum the number.
  4. After all the three “series-groupby-sum-chunk” tasks finish, one task in red color called “transfer-series-groupby-sum-agg” is started to transfer “groupby-sum” results of the 2 chunks to the worker where the third chunk result is located.
  5. On this worker, another task called “series-groupby-sum-agg” does the aggregation to get sum results for all the three chunks.
  6. Next “series-nlargest-chunk” task started to get nlargest against the chunk data. But there is only 1 chunk, so next task “series-nlargest-agg” will return with the final results to Dask scheduler.

--

--

--

Big Data Engineer at Thomson Reuters, Jogger, Hiker. https://www.linkedin.com/in/fli01/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Using News Sentiment Data for Investment Decisions

Multiple Linear Regression

Efficient yet cost-effective Data-Engineering at -Tile

3 Books for Beginning Data Science & Analytics

Digital Biomarker Discovery Pipeline

How we watched the election at Stamen: our favorite maps and charts of 2020

10 Master Python For Data Science Courses By LinkedIn Learning — Perfect For Beginners

Trump’s Twitter Network

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Feng Li

Feng Li

Big Data Engineer at Thomson Reuters, Jogger, Hiker. https://www.linkedin.com/in/fli01/

More from Medium

Adversarial Validation for Data comparison

Using Spark ML to Predict churn on a streaming music platform

Logical Operations in Python?

Use Logic to Handle Missing Data