Data Science newsletter 2017–11

Published in

Compendium

5 min readJun 11, 2018

The Data Science community group in Computas work on monthly newsletters that we will publish continously on this blog. This post will present the newsletter for November, where we have divided the content into three different sections: «Getting started», «Beginner Tutorials», and «Advanced». We hope you enjoy it!

Getting started

This section includes links to articles where one gets an overview of machine learning. No code, no math, just plain english.

Best-Ever Algorithm Found for Huge Streams of Data

“We developed a new algorithm that is simultaneously the best” on every performance dimension, said Jelani Nelson, a computer scientist at Harvard University and a co-author of the work with Kasper Green Larsen of Aarhus University in Denmark, Huy Nguyenof Northeastern University and Mikkel Thorup of the University of Copenhagen.

This best-in-class streaming algorithm works by remembering just enough of what it’s seen to tell you what it’s seen most frequently. It suggests that compromises that seemed intrinsic to the analysis of streaming data are not actually necessary. It also points the way forward to a new era of strategic forgetting.

AI Experts Want to End ‘Black Box’ Algorithms in Government

Public agencies responsible for areas such as criminal justice, health, and welfare increasingly use scoring systems and software to steer or make decisions on life-changing events like granting bail, sentencing, enforcement, and prioritizing services. The report from AI Now, a research institute at NYU that studies the social implications of artificial intelligence, says too many of those systems are opaque to the citizens they hold power over.

The AI Now report calls for agencies to refrain from what it calls “black box” systems opaque to outside scrutiny. Kate Crawford, a researcher at Microsoft and cofounder of AI Now, says citizens should be able to know how systems making decisions about them operate and have been tested or validated. Such systems are expected to get more complex as technologies such as machine learning used by tech companies become more widely available.

You can read the full report here: AI_Now_Institute_2017_Report_.pdf

Beginner Tutorials / best practices

This section includes links to tutorials / best practices you can follow up. Some have code you can follow along.

10 tips on using Jupyter Notebook

Jupyter Notebook (a.k.a iPython Notebook) is brilliant coding tool. It is ideal for doing reproducible research. Here is my list of 10 tips on structuring Jupyter notebooks, I worked out over the time.

Advice for aspiring data scientists and other FAQs

Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time).

Advanced

This section includes links to resources where you have to make a bigger effort. But it pays off.

Deep Reinforcement Learning Demystified (Episode 0)

The goal of this series of articles, is to provide an introduction about “Deep Reinforcement Learning” which is an important approach to building intelligent machines. I will cover the basics of reinforcement learning from the ground up as well as providing practical examples to illustrate how to implement various RL algorithms. No prior knowledge about reinforcement learning or deep learning are required to follow these articles, but any previous experience about machine learning will be useful. I will provide code examples in Python and will be using OpenAI gym for evaluting our algorithms.

DeepXplore: automated whitebox testing of deep learning systems

The state space of deep learning systems is vast. As we’ve seen with adversarial examples, that creates opportunity to deliberately craft inputs that fool a trained network. Forget adversarial examples for a moment though, what about the opportunity for good old-fashioned bugs to hide within that space? Experience with distributed systems tells us that there are likely to be plenty! And that raises an interesting question: how do you test a DNN? And by test here, I mean the kind of testing designed to deliberately find and expose bugs in corner cases etc., not the more common (in ML circles) definition of a test set that mostly focuses on demonstrating that the mainline cases work well.

How do you even know when you have done enough testing?

How do you know what the correct output should be for a given test case?

Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models

This is a post from last year, but still relevant.

Over the last six months, a powerful new neural network playbook has come together for Natural Language Processing. The new approach can be summarised as a simple four-step formula: embed, encode, attend, predict. This post explains the components of this new approach, and shows how they’re put together in two recent systems.