Want to become a Data scientist:

People have started to become crazy around being a data-scienctist. Everybody wants to pursue a data-science but for its not something you can only acquire by learning , its also about practice. In this write-up I want to share my experience how you can possess this skill.
Why people are excited about being a Data-scientist:
As I have heard the trend for data-science field is growing and people possessing this skillset have greater chance of making good money. Lets not leave the Ideal ones who wanted to take the existing customer experience and their business to the next level. It is also about the IT jobs are always trendy and people work on edge cutting technologies to take the software experience to the next level.
Where do you start this journey:
Regardless of being from any technology or domain background you can start the journey of being a data-scientist. You have to be sure about some prerequisites . Fact that how much time you have, because it is definitely going to take some good amount of time and you should be ready to spend that time you need to learn. Some basic statistical concepts, you need not master anything in statistics. Statistics helps to bring the features in front. A good example would be, in feature selection while build a model, you can find the correlation between the data fields and remove the common data-fields which captures the same variable in different facet. Strong in any programming language. I would prefer python do all my data science related work,
Learning from others:
Learning how others have implemented is a nice to start, take some top 10 different problems from kaggle kernels and start munching around. Read the problem definition and this help you to different between what are really a data-science problem. Some datasets produce straight insights when putting to data to a chart you will be able to get insights. You can see the same problems repeatedly solved by different kernels. where in some kernel would drop a feature when modeling, while the other would see an insight from that feature. Its all about using different probabilities of implementation to build the model.
What tools you need:
There are many tools, my suggestion for the editor you can use anaconda which comes with preinstalled packages you need to start with. I would suggest to go with ipython notebook rather than any other python editor because you need to break the datasets into small chunks and you need to run small experiments, for every aspects you are looking for in the dataset. most of data-scientists time goes in analyzing and preparing the dataset to be ready for machine learning algorithms to process.
How to start preparing the dataset:
Any dataset is not straight ready for achieving machine-learning. Machine learning models are only the end output of all the work we do. Data preprocessing is a huge task, sometimes you need to aggregate the data from different sources. Mostly the dataset are transferred around as csv but to get the csv format there is lot of processing you need to do. some of the important terms I would state you need to learn around data processing is dataframes and numpy array these two data-structures are mostly used. Machines only process 0’s and 1’s, so is the algorithms , you need to convert all the feature variables into equivalent numerical values. There are couple of techniques like dummies and onehot encoding.
Choosing the right algorithms:
There is nothing called a write algorithm, you have to choose the algorithm based out the structure and feature display of a dataset and the problem we are trying to solve. To build you first version of the model, split your dataset into a training and testing set, there are pre defined split functions available in sklearn module or you can evenly split. A general advice to follow is to randomly shuffle you datasets before splitting, This ensures that the machine learning algorithms don’t work biased. In your first run you need not know what happens in the inside of the algorithm, this you can do it in your later part.
Model I would suggest to Learn and become a Data-scientist:
It is a long road and a never ending process, as a data scientist we need to discipline ourself with early cuts and get into a iterative learning model. Don’t shop everything in every store, just shop for the problems you have. So go with the problem solving approach. Learn the statistics needed to solve the specific problem.
Bigdata and Data-science:
Bigdata and data-science is a hand in hand study. But you can still learn machine learning without knowing any bigdata tools. Bigdata is needed only in the place of heavy data processing, todays laptops and desktop machines are good for building a good model which can still solve production problems.
