My Internship Experience at Couture AI

nahush kumta
cverse-ai
Published in
7 min readOct 1, 2019

The last two months (the summer of 2019) was a huge learning curve in my life and all credit for it goes to the company where I interned: Couture AI.

I learned a few aspects of machine learning, along with quite a fair bit of DBMS query optimization techniques and also quite a bit of Big Data Analytics. Not only did I gain technical knowledge but also learned a lot about work culture, professionalism and I also picked up on a lot of other soft skills too. I will be mentioning now about my work at the company.

My Projects:

At Couture AI, I worked on two main things:

1.) I made optimized dashboards containing charts that helped the company visualize and analyse their data and outputs.

2.) I set up Apache Ozone and built an API to upload and retrieve files out of it. Apart from these projects, I also picked up on a bit of web development, e-commerce, various machine learning algorithms while working alongside my mentors and peers. I will be mentioning briefly about each of my projects below.

Project 1: Best Practices in Apache Superset — An Open Source Data Visualization Tool

For making these visualizations I had to learn how to use a software called Apache Superset which provides a platform for displaying various charts of different kinds such as bar graphs, pie charts, line charts, etc. But firstly, to download this software I had to learn the basic functioning of the docker. Docker is a tool designed to make it easier to create, deploy and run applications using containers. Containers can be thought of as something that allows developers to package up an application with what they require (eg: dependencies, libraries, etc) to run them. I won’t be explaining the docker in detail here, but you can think of it as a Virtual Machine (VM) that provides a common platform for all the applications to run, thus using fewer system resources. An Image has been shown below highlighting the difference between them:

Dockers
VMs

Image1

Now coming to superset. Superset is one hell of software with many applications as many as its limitations. One can think of showing a large number of charts on his dashboard but not always the way he wants them :D. After downloading it, one can access it at the port localhost:8088. A basic dashboard of superset looks like this.

Image2

Now where my work comes in here? Well, this dashboard you see renders data from tables defined in SQL somewhere on the local system (or on a server to which the system is connected to). So if the queries that are used to run this data is not optimized, then the time for this dashboard to load could be so long that we may just lose patience and just reload the page ( and then again it takes time and the cycle continues ), which we all do. So I not only made dashboards but also optimized the speed of the rendering of data from these tables. This taught me a lot of query optimization techniques, usage of indexes and SQL caching mechanism. I have listed down some of the points I referred to while optimizing dashboards and each query:

  1. Reduce the number of subqueries in the query.
  2. Avoid the usage of distinct and like keywords unless its absolutely necessary.
  3. Index all columns that appear in the Group By and Order By clause. Also, ensure that these indexed columns don’t appear in an aggregate function. For example, if the column indexed is named ‘hi’ then try best for the query to not have an aggregate function on ‘hi’ as then the query is run without the indexing.
  4. In the case of joins, selection should precede the join and projection operation to ensure join on fewer rows and thus lesser costs being incurred.
  5. Operators such as IN, Exists, should be used appropriately. IN is usually slow, so avoiding it is preferred.

Some obstacles while doing this work was when my laptop would hang and stop doing anything cause of the amount of data these dashboards rendered. For example, there was this one dashboard which was originally supposed to display about 1,50,000 images, where each image was retrieved from a webpage while displaying on the dashboard. Seeing that displaying all of those images is quite unnecessary, we decided to reduce the number of images displayed ( cause come on, most people won’t see all the 1,50,000 images ). Also wherever possible, if the data was to be fixed for a decently long period of time (or historical data), I decided to cache the outputs of the queries used, which made the process of rendering faster.

Throughout the internship, I made many dashboards which I optimized using these techniques.

Project 2: Apache Ozone for Object Storage

In this second project, I dealt with a relatively new platform for Big Data processing named Apache Ozone. Datasets are important in data science and they need to be precise and complete enough to train mathematical and machine learning (ML) models well. So generally large datasets are used. We all know of Hadoop Distributed File Systems (HDFS) for its use in storing large amounts of big files in blocks over many Datanodes. These files can then be processed in parallel and then integrated by a Namenode. This enables to process large amounts of data quickly. Pretty neat, ain’t it! A picture should make its (HDFS) architecture clear.

Image4

For more information on HDFS, look at my friend Pratik Borikar’s blog to get a bigger picture.

Now, why Ozone? This is because HDFS works well with a large number of big files and works terribly with small files on a large scale. In fact, I have heard, in the Hadoop community, people get so infuriated if you use HDFS on small files that they will shoot you if they are brought to a notice of it (if you do commit this crime, now you know whom to hide from :D ). To overcome this problem, Apache introduced Ozone, which was designed to run concurrently with HDFS (although it can be run independently). Ozone overcomes this problem of HDFS by independently scaling the namespace server and block management layers of files independently. This namespace server and block management layer are managed by something called the Hadoop Distributed Data Store (HDDS). It is a highly available, replicated block layer storage and acts as an abstraction layer. It is basically a basic abstraction layer. One such abstraction layer is the Storage Container Manager (SCM). This SCM basically links namenodes having block information (like the blocks in HDFS) with what's called the Ozone Manager (OM). OM takes care of the ozone namespace. In short, it is like the metadata manager in ozone.

Once the Datanode is made, files can be deployed onto the cluster and scalability is ensured. OzoneFS is the name given to the Hadoop compatible file system space. The picture below will better explain this.

Image5

To be honest, the greatest problem I had dealing with Ozone was in downloading it (Ironic, yes!). Apache Ozone being very new has very few resources available for reference apart from the main documentation provided by Apache. So far we weren’t able to run Ozone concurrently with HDFS (hopefully we will). You can refer to how to download Ozone and other information at this link: https://hadoop.apache.org/ozone/docs .

Finally, we made an API so that we can upload and retrieve files from the Ozone cluster ( the information for this can be found in the above link itself under the Java API section ) so that they can be processed for the machine learning models that they were supposed to be used to train.

Apart from the above, I learnt a lot about open source contributions (while trying to understand both superset and Apache Ozone), various machine learning algorithms , some as basic as k-means and k-nearest neighbors and some a bit complex such as LDA ( this was while working with Rahul Shevade, another peer of mine whose blog you can also see ).

All in all, I would like to thank Couture AI and my University Bits Pilani for having given me this wonderful opportunity to get a good experience in the corporate world and gain more domain knowledge.

Some references are below, you can refer to them if you want further knowledge:

  1. https://www.comparitech.com/net-admin/docker-vs-virtual-machines/
  2. Image1: https://www.docker.com/resources/what-container
  3. https://superset.incubator.apache.org/tutorial.html
  4. Image2: Google Images
  5. Sites used for optimization techniques:

i. https://www.ibm.com/support/knowledgecenter/en/SSZLC2_9.0.0/com.ibm.commerce.developer. doc/refs/rsdperformanceworkspaces.htm

ii. https://beginner-sql-tutorial.com/sql-query-tuning.htm

6. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

7. Image4: Google Images.

8. Image5: Google Images.

9. https://hadoop.apache.org/ozone/docs

--

--