Weekly Digest for Data Science and AI: Python and R (Volume 17)

Hello everyone! Happy to have you back, and welcome to Volume 17. This week, we have two great Python packages that have been trending recently: raster-vision and trfl. Also two great packages for the R world: precisely and DataExplorer. To receive this digest directly in your inbox each week, sign up here.

Favio Vázquez
Ciencia y Datos
7 min readOct 26, 2018

--

Table of contents:

Python:

R:

raster-vision — An open source framework for deep learning on satellite and aerial imagery.

https://github.com/azavea/raster-vision

This framework blew me away. Like this:

It’s an amazing tool for building computer vision models on satellite, aerial, and other large imagery sets (including oblique drone imagery).

As the creators state:

[Rastervision] … allows for engineers to quickly and repeatably configure experiments that go through core components of a machine learning workflow: analyzing training data, creating training chips, training models, creating predictions, evaluating models, and bundling the model files and configuration for easy deployment.

Raster Vision workflows begin when you have a set of images and training data, optionally with Areas of Interest (AOIs) that describe where the images are labeled. Raster Vision workflows end with a packaged model and configuration that allows you to easily utilize models in various deployment situations. Inside the Raster Vision workflow, there’s the process of running multiple experiments to find the best model or models to deploy.

The process of running experiments includes executing workflows that perform the following commands (depicted in the graphic):

You can do a lot of different things with the package like chip classification, object detection, semantic segmentation and much more.

Installation

pip install rastervision

Usage

Here you can find examples and more information about the package:

Check out the original repo too:

trfl — TensorFlow Reinforcement Learning

TRFL (pronounced “truffle”) is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Learning agents.

If you want to know more Reinforcement Learning, Mohammad Ashraf has an amazing series on the topic:

And the great Siraj Raval has a course in youtube about it (he’s starting it):

Installation

TRFL can be installed from pip directly from github, with the following command: pip install git+git://github.com/deepmind/trfl.git

TRFL will work with both the CPU and GPU version of tensorflow, but to allow for that it does not list Tensorflow as a requirement, so you need to install Tensorflow and Tensorflow-probability separately if you haven’t already done so.

Usage

import tensorflow as tf
import trfl
# Q-values for the previous and next timesteps, shape [batch_size, num_actions].
q_tm1 = tf.get_variable(
"q_tm1", initializer=[[1., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
q_t = tf.get_variable(
"q_t", initializer=[[0., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
# Action indices, discounts and rewards, shape [batch_size].
a_tm1 = tf.constant([0, 1], dtype=tf.int32)
r_t = tf.constant([1, 1], dtype=tf.float32)
pcont_t = tf.constant([0, 1], dtype=tf.float32) # the discount factor
# Q-learning loss, and auxiliary data.
loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

loss is the tensor representing the loss. For Q-learning, it is half the squared difference between the predicted Q-values and the TD targets, shape [batch_size]. Extra information is in the q_learning namedtuple, includingq_learning.td_error and q_learning.target.

You can find much more information about the package here:

Interested in having this digest delivered to your inbox each week? Sign up:

precisely — An R package to estimate sample size based on precision rather than power

If you don’t want this to happen to you read below

precisely is a study planning tool to calculate sample size based on precision rather than power. Power calculations are focused on whether or not an estimate will be statistically significant; calculations of precision are based on the same principles as power calculation but turn the focus to the width of the confidence interval.

precisely has functions for studies using risk differences, risk ratios, rate differences, rate ratios, and odds ratios. The heart of these calculations is the desired precision.

Installation

You can install the development version of precisely with:

# install.packages("devtools")
devtools::install_github("malcolmbarrett/precisely")

Usage

Let’s say we want to calculate the sample size needed to estimate a 90% CI for a risk difference of .1 with an absolute width of .08. Here, the risk among the exposed is .4, the risk among the unexposed is .3, and there are three times as many unexposed participants.

library(tidyr)
library(dplyr)
library(purrr)
library(ggplot2)
library(precisely)
n_risk_difference(
precision = .08,
exposed = .4,
unexposed = .3,
group_ratio = 3,
ci = .90
)
#> # A tibble: 1 x 8
#> n_exposed n_unexposed n_total risk_difference precision exposed unexposed
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 524. 1573. 2097. 0.1 0.08 0.4 0.3
#> # ... with 1 more variable: group_ratio <dbl>

We need 525 exposed participants and 1,573 unexposed participants for a total sample size of 2,098.

This package includes a Shiny app to help with calculations, which you can start with launch_precisely_app(). You can also find a live version at malcolmbarrett.shinyapps.io/precisely.

You can find much more information about the package in this vignette:

DataExplorer — Automate data exploration and treatment

https://github.com/boxuancui/DataExplorer

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

Usage

With the package you can create reports, plots and tables like this:

## Plot basic description for airquality data
plot_intro(airquality)
## View missing value distribution for airquality data
plot_missing(airquality)
## Left: frequency distribution of all discrete variables
plot_bar(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")
## View histogram of all continuous variables
plot_histogram(diamonds)

You can find much more like this in the official webpage of the package:

And in this vignette:

--

--

Favio Vázquez
Ciencia y Datos

Data scientist, physicist and computer engineer. Love sharing ideas, thoughts and contributing to Open Source in Machine Learning and Deep Learning ;).