Data Science — Moving beyond Introduction

Comparing R, SAS and Python

3 min readDec 17, 2019

SAS is an enterprise product. Very stable platform and supported by the vendor. Used by large banks etc. It has about 20% market share
R is open source and built by statisticians for statisticians. It has about 50% market share and growing. R is also taught in universities and is actively used in the research community
Python is a full featured programming language. This has been widely adopted by the Data Engineering community. Most of the enhancements in the Machine Learning / Neural Network, are built as python packages first. Such packages become available in other languages also, in 1–2 months. It has about 30% market share and growing.

Using Jupyter as the programming environment

Jupyter is a web based IDE for Julia, Python and R languages.

The key difference between Jupyter and IDEs is that it allows documentation and code to reside side by side. This is important in the Data Science world, where there is no pre-defined set of steps to be executed to create a model. The output of one step needs to be analysed / visualized and based on the observations, next piece of code needs to be written. Documenting the observations is as important (if not more) as the code. Jupyter allows us to capture both these aspects elegantly and within context.

Each logical group of code / code block goes into a single cell. The rule of thumb is, if a set of lines need to work together to generate a single (intermediate but valuable) output, they should go into a single cell.

Each cell can potentially be written in different languages like Python or R. While running a cell, we can choose the Kernel to run this in.

Version control — It saves work automatically, and we can go back to a previous version of code

Kernel — The runtime environment is called kernel. If a cell / notebook execution hangs due to infinite loop or similar, we can restart the kernel.

Export — Jupyter files can be exported to python (.py) files for execution in a python runtime. All the content becomes comments in this case. It can also be exported as LaTeX / PDF files for printing

What python concepts are important for Data Science

syntax
Variable types
Concepts of packages and package managers
Data Structure — Lists, Tuples, Dictionaries, Set
Mathematical symbols
Logical symbols
Loops and Conditional statements
Functions — predefined and user defined (normal and lambda functions)
Classes
Debugging options

Feature Engineering

Feature engineering, also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. We can group the operations of feature creation into two categories: transformations and aggregations.
1. Transformation involves modifying existing attributes into a more usable format. E.g. A date can be converted to MONTH to account for seasonal patterns in data. Similarly, an Income attribute can be transformed to LOG(INCOME) to change its scale to a more uniform format.
2. Aggregation involves a one-to-many relationship to group observations and then calculate statistics

The curse of dimensionality

When the number of features (dimensions) increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.

Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

There are various dimensionality reduction techniques (like PCA) to reduce the dimensions in data, so that only the most important features are used for analysis.