This series article helps to understand distribution functions and its usage in Exploratory Data Analysis in Data Science. In next article, I’ll take you to some of the practical usages on my sample project for the terms defined here.

Exploratory Data Analysis is the combination of many small tasks like data cleansing, data munging and create visualization etc to understand the value in data. In the distribution of data, we actually try to extract value out of it. Also, distribution is important when the data is ready for analysis and we have received another set of samples and required analysis…

One of my biggest hurdles learning linear algebra was getting that intuition. Eigenvalues and eigenvectors are one of those things that pop up in a million places because they’re so useful, but to recognize where they may be useful you need intuition as to what they’re doing.

The eigenvectors are the “axes” of the transformation represented by the matrix. Consider spinning a globe (the universe of vectors): every location faces a new direction, except the poles. The eigenvalue is the amount the eigenvector is scaled up or down when going through the matrix.

Eigenvalues are special numbers associated with a…

Notebooks are great for prototyping, longer pipelines or processes.

If you are a user of PyCharm or Jupyter Notebook and an exploratory data scientist, I would encourage you to switch you to Jupyter Lab.

For Jupyter Lab installation steps go here

Below are some of the advantages that I see using Jupyter Lab over Jupyter Notebook:-

  1. The new terminal is a tab view to use compared.
  2. The ability to set out multiple windows easily, much like an IDE
  3. This will make working on a remote server so much nicer, just start Jupyter Lab and an ssh tunnel and you have…

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s).

Now I’d say if you have good data management and tagged for machine learning so give yourself a…

There are many ways to use Python’s List and Lambda. Here I am going to show some of useful tips and tricks. So lets first start with Lists>

Below we use most of the time with List.

>>> a = [66.6, 333, 333, 1, 1234.5] >>> print a.count (333), a.count (66.6), a.count ('x') 2 1 0 >>> a.insert (2, -1) >>> a.append (333) >>> a [66.6, 333, -1, 333, 1, 1234.5, 333] >>> a.index (333) 1 >>> a.index (333,2) 3 >>> a.remove (333) >>> a [66.6, -1, 333, 1, 1234.5, 333] >>> a.reverse () >>> a [333, 1234.5, 1, 333…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala. But syntax in Pyspark is different and I‘ll show you pyspark RDD operations with a different perspective to approach the problem.

Let us consider we are streaming data using Spark and we have created RDD using this streaming application want to perform RDD operations on this stream of data in particular time interval. …

There are three built-in functions in Python : lists, tuples, and strings. The three functions, str (), tuple (), and list (), convert to each other using the following example:

>>> s = ‘123456’

>>> list(s)

[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]

>>> tuple(s)

(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)

>>> tuple(list(s))

(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)

>>> list(tuple(s))

[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]

>>> “”.join(tuple(s))


>>> “”.join(list(s))


>>> str(tuple(s))

“(‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’)”

>>> str(list(s))

“[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]”

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation.

  1. Spark RDD API
  2. Dataframe API
  3. Tips and tricks on RDD API and Dataframe API.

Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject.

Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. …

At present many open source distributed processing systems such as Cloudera, Apache Storm, Spark and others support the integration with Kafka. Kafka is increasingly being favored by many internet shops and they use Kafka as one of its core messaging engines. The reliability of the Kafka message can be imagined as a commercial-grade messaging middleware solution.

In this article, we will understand Kakfa storage mechanism, replication principle, synchronization principle, and durability assurance to analyze its reliability.

As shown in the figure above, a typical Kafka architecture includes several Producers (which can be server logs, business data, page views generated by…

Mukesh Kumar

Apart from Big data as my full time profession, I am a robotics hobbyists and enthusiasts… My Web Site:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store