NoSQL, Python, and MongoDB

--

Edited by Yifei Xu

This is a recap of a FocusKPI Analytics at Work event. Join our LinkedIn group to learn more.

Speaker: Xuanfu Wu

Expertise: Data analytics, statistical analysis, machine learning, data engineering with Python, SAS, SQL, R, C++, Java and UNIX.

A data ecosystem integrates data infrastructure — promotes data sharing and consumption, analytics — a process of inspecting, cleansing, transforming, and modeling data, and applications — the way our world of data gets operationalized.

Data Ecosystem Sketch Map (image source: http://wikibon.org/w/images/0/03/BigDataComponents.JPG)

CPU sends original data to the base, also known as the storage layer, which is typically composed of HDFS (Hadoop Distributed File System) and NoSQL Database. Above is a computation or logic layer, MapReduce or Pig, which typically reorganize large raw datasets into amendable smaller datasets. In other words, Hadoop works upon MapReduce principals. At the top lays some parts of application logic or interaction, which is Hive and Cascading. Hive interacts with HDFS with hive queries, kind of like SQL to process and analyze data. Cascading is basically an API layer over Hadoop which operates over Hadoop Stack. Still, we have specialized databases such as Netezza and Greenplum for out-scale needs.

MapReduce Workflow (Image Source: https://blog.sqlauthority.com/2013/10/09/big-data-buzz-words-what-is-mapreduce-day-7-of-21/)

This might be overwhelming. Today we are going to focus on NoSQL. NoSQL, AKA not only SQL, is a non-relational database management system. With data storage costs decrease, the amount of data applications needed to store and query increased. This data could be more than tabular data, but documents, graphs, key-values as well. NoSQL handles data with high consistency, availability, and partition tolerance although it only offers limited query capabilities.

Here is an example of document type data. It is like a JSON file. The ‘firstname’ and ‘age’ part is key-value pairs, ‘skills’ part is an array of strings or could just be strings, ‘location’ part is an object as it is surrounded by curly braces {}, which has key-value pairs in it, and the last ‘hobbies’ part is an array of objects. This could be very likely the scraped data of a blogger.

First, we kick off the Mongo DB server. Notice the process name is command prompt - mongod. This is to start the Mongo DB process.

Then we go to the command prompt — mongo window to show the database we have. So now we have configured the environment, and let’s move to python to create a dataset to illustrate aggregation example in Mongo DB.

First, we need to switch to the aggregation_example database by running use aggregation_example the command in the command prompt and then we type in db.getCollectionNames() to see what tables we have. (Collection is equivalent to what we are familiar with as tables.)

Then, we create the table in python using documents like a JSON file. After running this part of code in the console and you switch back to show DBS again, you would see updated ‘demo_pymongo’ table in the aggregation_example database.

What are the indices of this table then? We can see that MongoDB automatically generated object id for us. Even the documents are exactly the same, ids are different.

Then we move on to realize the aggregation function. Note that bson stands for binary JSON, which transfers unreadable text data into dummy variables. So we build a pipeline and pass the data(document) into that pipeline with an aggregate function, then we got to know how many cats, dogs, and mice there are in the document.

Remember we said above that the data ecosystem has a logic layer called Map/Reduce, also the graph has shown the working flow? We can also employ it to realize the target. You may notice the language is totally different from SQL, but kind of like OOP (Object Oriented Programming).

However, this is not to say the package does not have a function as map/reduce. Assume we want to find the object whose key x ’s value is less than 2, observe the code, and the result. Basically, it returns the first row in the document. You may find that this has something in common with the clause in SQL. What if you change 2 to 3? Check the pics below.

Did you get this right?

NoSQL is powerful and there is no doubt it is having a great market share. However, it has disadvantages. One is that the learning curve might be stiff for new developers, the other is that it has very limited queries as we have discussed above.

If you are interested in this topic or have any related questions about our event and service, please reach out to NEWS@FOCUSKPI.COM.

--

--

Analytics News & Events Powered by FocusKPI

🔔 Each week, we will send you updates about news and events in data analytics/science. More information: https://www.linkedin.com/company/focuskpi