For taking steps to know about Data Science, in the first of the series, I shall cover briefly an introduction to Data Science, what skills you require to become a Data Scientist and briefly on Python, the general-purpose programming language that is becoming more and more popular for doing data science.
Introduction to Data Science:
“Without Data, you’re just another person with an opinion.” — W. Edwards Deming
Data is distinct pieces of information, usually formatted in a special way. Data can exist in a variety of forms — as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person’s mind. Since the mid-1900s, people have used the word data to mean computer information that is transmitted or stored.
Data in the 21st Century is like Oil in the 18th Century: an immensely, untapped valuable asset. Like oil, for those who see Data’s fundamental value and learn to extract and use it there will be huge rewards.
Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.
At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value.
Data science — discovery of data insight
This aspect of data science is all about uncovering findings from data. Diving in at a granular level to mine and understand complex behaviors, trends, and inferences. It’s about surfacing. Data science is a blend of skills in three major areas : Mathematics, Technology and Business / Strategy acumen.
Machine learning is a term closely associated with data science. It refers to a broad class of methods that revolve around data modeling to (1) algorithmically make predictions, and (2) algorithmically decipher patterns in data.
“Data analyst” and “Data Scientist” is not exactly synonymous, but also not mutually exclusive. Data science is an evolving field. For any company that wishes to enhance their business by being more data-driven, data science is the secret sauce. Data science projects can have multiplicative returns on investment, both from guidance through data insight, and development of data product.
“Information is the oil of the 21st century and analytics is the combustion engine.” — Peter Sondergaard
Skills required for a Data Scientist:
Data scientists are highly educated — 88% have at least a Master’s degree and 46% have PhDs — and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. To become a data scientist, you could earn a Bachelor’s degree in Computer science, Social sciences, Physical sciences, and Statistics. The most common fields of study are Mathematics and Statistics (32%), followed by Computer Science (19%) and Engineering (16%). A degree in any of these courses will give you the skills you need to process and analyze big data.
Technical Skills: Computer Science
2. R Programming
In-depth knowledge of at least one of these analytical tools, for data science R is generally preferred. R is specifically designed for data science needs. You can use R to solve any problem you encounter in data science. In fact, 43% of data scientists are using R to solve statistical problems. However, R has a steep learning curve.
3. Python Coding
Python is the most common coding language I typically see required in data science roles, along with Java, Perl, or C/C++. Python is a great programming language for data scientists. Because of its versatility, you can use Python for almost all the steps involved in data science processes. It can take various formats of data and you can easily import SQL tables into your code. It allows you to create datasets and you can literally find any type of dataset you need on Google.
4. Hadoop Platform
Although this isn’t always a requirement, it is heavily preferred in many cases. Having experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial. As a data scientist, you may encounter a situation where the volume of data you have exceeds the memory of your system or you need to send data to different servers, this is where Hadoop comes in. You can use Hadoop to quickly convey data to various points on a system. That’s not all. You can use Hadoop for data exploration, data filtration, data sampling and summarization.
5. SQL Database/Coding
Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL. SQL (structured query language) is a programming language that can help you to carry out operations like add, delete and extract data from a database. It can also help you to carry out analytical functions and transform database structures. You need to be proficient in SQL as a data scientist. This is because SQL is specifically designed to help you access, communicate and work on data. It gives you insights when you use it to query a database.
6. Apache Spark
Apache Spark is becoming the most popular big data technology worldwide. It is a big data computation framework just like Hadoop. The only difference is that Spark is faster than Hadoop. This is because Hadoop reads and writes to disk, which makes it slower, but Spark caches its computations in memory. Apache Spark is specifically designed for data science to help run its complicated algorithm faster. It helps in disseminating data processing when you are dealing with a big sea of data thereby, saving time. It also helps data scientist to handle complex unstructured data sets. You can use it on one machine or cluster of machines. Apache Spark makes it possible for data scientists to prevent loss of data in data science. The strength of Apache Spark lies in its speed and platform which makes it easy to carry out data science projects. With Apache spark, you can carry out analytics from data intake to distributing computing.
7. Machine Learning and AI
A large number of data scientists are not proficient in machine learning areas and techniques. This includes neural networks, reinforcement learning, adversarial learning, etc. If you want to stand out from other data scientists, you need to know Machine learning techniques such as supervised machine learning, decision trees, logistic regression etc. These skills will help you to solve different data science problems that are based on predictions of major organizational outcomes. Data science needs the application of skills in different areas of machine learning. Data science involves working with large amounts of data sets.
8. Data Visualization
The business world produces a vast amount of data frequently. This data needs to be translated into a format that will be easy to comprehend. People naturally understand pictures in forms of charts and graphs more than raw data. It is well known that, “A picture is worth a thousand words”. As a data scientist, you must be able to visualize data with the aid of data visualization tools such as ggplot, d3.js and Matplottlib, and Tableau. These tools will help you to convert complex results from your projects to a format that will be easy to comprehend. The thing is, a lot of people do not understand serial correlation or p values. You need to show them visually what those terms represent in your results. Data visualization gives organizations the opportunity to work with data directly. They can quickly grasp insights that will help them to act on new business opportunities and stay ahead of competitions.
9. Unstructured data
It is critical that a data scientist be able to work with unstructured data. Unstructured data are undefined content that does not fit into database tables. Examples include videos, blog posts, customer reviews, social media posts, video feeds, audio etc. They are heavy texts lumped together. Sorting these type of data is difficult because they are not streamlined. Most people referred to unstructured data as ‘dark analytics” because of its complexity. Working with unstructured data helps you to unravel insights that can be useful for decision making. As a data scientist, you must have the ability to understand and manipulate unstructured data from different platforms.
10. Intellectual curiosity
Curiosity can be defined as the desire to acquire more knowledge. As a data scientist, you need to be able to ask questions about data because data scientists spend about 80 % of their time discovering and preparing data. This is because data science field is a field that is evolving very fast and you have to learn more to keep up with the pace. You need to regularly update your knowledge by reading contents online and reading relevant books on trends in data science. Curiosity is one of the skills you need to succeed as a data scientist. For example, initially, you may not see much insight in the data you have collected. Curiosity will enable you to sift through the data to find answers and more insights.
11. Business acumen
To be a data scientist you’ll need a solid understanding of the industry you’re working in, and know what business problems your company is trying to solve. In terms of data science, being able to discern which problems are important to solve for the business is critical, in addition to identifying new ways the business should be leveraging its data. To be able to do this, you must understand how the problem you solve can impact the business. This is why you need to know about how businesses operate so you can direct your efforts in the right direction.
12. Communication skills
Companies searching for a strong data scientist are looking for someone who can clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales departments. A data scientist must enable the business to make decisions by arming them with quantified insights, in addition to understanding the needs of their non-technical colleagues in order to wrangle the data appropriately. As well as speaking the same language the company understands, you also need to communicate by using data storytelling. As a data scientist, you have to know how to create a storyline around the data to make it easy for anyone to understand. For instance, presenting a table of data is not as effective as sharing the insights from those data in a storytelling format. Using storytelling will help you to properly communicate your findings to your employers.
A data scientist cannot work alone. You will have to work with company executives to develop strategies, work product managers and designers to create better products, work with marketers to launch better-converting campaigns, work with client and server software developers to create data pipelines and improve workflow. You will literally have to work with everyone in the organization, including your customers. Essentially, you will be collaborating with your team members to develop use cases in order to know the business goals and data that will be required to solve problems. You will need to know the right approach to address the use cases, the data that is needed to solve the problem and how to translate and present the result into what can easily be understood by everyone involved.
Introduction to Python:
Python is a general-purpose programming language that is becoming more and more popular for doing data science. Companies worldwide are using Python to harvest insights from their data and get a competitive edge. It was created in 1991 by Guido van Rossum.
It is used for:
- web development (server-side),
- software development,
- system scripting.
What can Python do?
- Python can be used on a server to create web applications.
- Python can be used alongside software to create workflows.
- Python can connect to database systems. It can also read and modify files.
- Python can be used to handle big data and perform complex mathematics.
- Python can be used for rapid prototyping, or for production-ready software development.
- Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc.)
- Python has a simple syntax similar to the English language.
- Python has syntax that allows developers to write programs with fewer lines than some other programming languages.
- Python runs on an interpreter system, meaning that code can be executed as soon as it is written. This means that prototyping can be very quick.
- Python has a large collection of libraries viz. NumPy, SciPy, matplotlib, nltk, SymPy, etc.
- Python can be treated in a procedural way, an object-orientated way or a functional way.
Brief introduction to OOPs in Python :
Python is a multi-paradigm programming language. Meaning, it supports different programming approach. Python has been an object-oriented language since it existed. Because of this, creating and using classes and objects are downright easy.
One of the popular approach to solve a programming problem is by creating objects. This is known as Object-Oriented Programming (OOP).
An object has two characteristics:
Let’s take an example:
Parrot is an object,
- name, age, color are attributes
- singing, dancing are behavior
The concept of OOP in Python focuses on creating reusable code. In Python, the concept of OOP follows some basic principles:
· Inheritance A process of using details from a new class without modifying existing class.
· Encapsulation Hiding the private details of a class from other objects.
· Polymorphism A concept of using common operation in different ways for different data input.
A class is a blueprint for the object.
We can think of class as an sketch of a parrot with labels. It contains all the details about the name, colors, size etc. Based on these descriptions, we can study about the parrot. Here, parrot is an object.
The example for class of parrot can be :
Here, we use class keyword to define an empty class Parrot. From class, we construct instances. An instance is a specific object created from a particular class.
An object (instance) is an instantiation of a class. When class is defined, only the description for the object is defined. Therefore, no memory or storage is allocated.
The example for object of parrot class can be:
Obj = Parrot( )
Here, obj is object of class Parrot.
Methods are functions defined inside the body of a class. They are used to define the behaviors of an object.
Inheritance is a way of creating new class for using details of existing class without modifying it. The newly formed class is a derived class (or child class). Similarly, the existing class is a base class (or parent class).
Using OOP in Python, we can restrict access to methods and variables. This prevent data from direct modification which is called encapsulation. In Python, we denote private attribute using underscore as prefix i.e single “ _ “ or double “ __“.
Polymorphism is an ability (in OOP) to use common interface for multiple form (data types).
Suppose, we need to color a shape, there are multiple shape option (rectangle, square, circle). However we could use same method to color any shape. This concept is called Polymorphism.
We have now explored the various aspects of classes and objects as well as the various terminologies associated with it. We have also seen the benefits and pitfalls of object-oriented programming. Python is highly object-oriented and understanding these concepts carefully will help you a lot in the long run.
In 2012, Harvard Business Review, called Data Science as “The Sexiest Job of the 21st Century”. The scope and impact of data science will continue to expand enormously in coming decades as scietific data and data about science itself become ubiquitously available. Regardless of how things evolve, one thing is sure : Enormous amount of data will be used to drive key business decisions and skilled data scientists will be the key to unlocking the endless possibilities.