How To Trigger a Data Scientist, and Why I Love Python.
Numpy is the foundational package for mathematics in Python. It’s n-dimensional arrays allow you to store data in a super convenient format. The arrays can be thought of, and referenced almost exactly like regular Python lists. Things like mean and square root which for some reason I will never understand, are not contained in the regular Python language, can be done with numpy. You can also perform math across lists, which I often find useful when trying to feature engineer things. It also makes working with matrices much simpler and quicker. Plus it’s random number generator ends up being extremely useful for all sorts of things.
Whenever you hear the term data science, it’s usually in close proximity to the term “big data”. Pandas is the standard package for reading and storing large amounts of data in Python. It allows you to use date-time objects and has super powerful groupby and manipulation tools to allow you to view and aggregate your data in any which way your heart desires. One of the biggest challenges of working with big datasets is learning how to get a feel for what your data looks like without being able to actually see it, Pandas really does a good job at helping you do that. Oh, so it’s like Excel your uncle will say the next time you attempt to explain it to him. My response usually goes something along the lines of Excel is to Pandas like a bicycle is to a race-car.
NLTK or Natural Language Tool Kit is my favorite package in my favorite part of data science. Natural language processing teaches a computer how to recognize human-talk through numbers, which is how computers understand everything. NLTK contains a fully stocked NLP toolkit with tools for tokenization, stemming, parsing, and semantic analyses of text. It also has over 50 different corpora(the real way to say plural for corpus) that contain massive amounts of text to help you train your model on whatever it is you are trying to understand. Just a quick warning all of this awesomeness comes at a memory price, the NLTK package is nearly 20 GB of storage, but well worth it even if you think you aren’t going to use it very often. An amazing study can be found here that used NLTK to predict schizophrenia with incredible accuracy.
Scikit-learn is the back-bone of machine learning in data science. It contains tools to implement both supervised and unsupervised machine learning, and includes nearly every type of model. It also contains tools for preprocessing and cross-validation, both extremely necessary but rarely spoken of parts of data science. It has a very standard way of doing things across the whole package, which makes it easy to start out on. It can, however, take a while to learn all of the incredible tools it contains. Something else to love about scikit-learn is how fantastic the documentation for it is mantained, something that is often a struggle with other packages.
While I have really only used the package a few times , it’s definitely number one on my “learn how to use really well” list. The two main Python packages used for visualizations are matplotlib and seaborn, and they both just sort of get the job done. Bokeh however makes interactive charts, that are easy to create and actually really nice. You can also easily import them to an html item, so they can be easily placed on a static site or blog. Plus looking at the hundreds of lines of html code that are produced by less then twenty lines in Python is awesome, and can’t help you but feel a little bad for developers. So the good news is I can stop screenshotting Tableau to get some nice looking graphs. Check out the gallery over here for some really cool visuals.