My first experiment with Azure ML Studo Part 1

Sambhu Surya Mohan
3 min readApr 2, 2018

--

I have actually gone through a whole lecture on Azure ML studio recently and was trying to see how well I can handle it. I had done a previous data analytics work when my company was still active. It was an analysis on the retail data of a super market. I modified the data a bit to match the real data but its not the real one. I wanted to try how fast I can replicate the work on Azure ML studio for someone who doesn’t have prior knowledge of it.

At first look it was fast , elegant and stable. Partially since the lecture was so nice it kind of made it one of the easiest tool available(You should go through the course Data science essentials ). But trying to get down to it showed me its not that easy that anyone can directly become a full fledged data scientist with it. I had to play around with it for sometimes before I could really handle it.

Essentially its just drag and drop, really simple. But in order to drag & drop we need to know what to drag. Some techniques should be strictly followed to handle your process the way we want it. There are some tasks which I found out that I wasn’t able to do right away.

  1. I wasn’t able to find a component which can filter out rows where the values are 0(It affected my division). I had a job to filter out those rows in which the cost rate is 0(A mistyping error or may be the item is free with something else). If I wanted to find out the profit percentage of individual components I can’t use the 0. That was fixed using two components “Clip Values” and “Clean missing data”
  2. I wasn’t able to find a component to select rows based on the comparison of keys of another set of data. Same example as above. I was able to filter the rows based on a rule but then to merge it I didn’t find a component for it.

Another trouble I faced was that I wasn’t able to see the scatter plot in the visualization of data. I thought it may be because of the subscription, I was using the free version. But the real reason was Azure ML studio visualization was not able to handle data of that size. I had around 700k rows, even though normalized the data wasn’t plotted. But when the number of rows reached below a level it automatically shows the scatter plot.

One more thing I noticed was that Azure ML Studio python execution is a little slow compared to how it is run locally. But the inbuilt components are fast. So if possible it is advisable to use to the max those components which is available.

I haven’t gone much distance with Azure ML studio now. I was able to finish only the pre-processing of my data.

The figure shows the high level diagram of my operations. A step by step pre-processing is as below

  1. Remove rows with cost rate 0 from the data by first resetting zero values as missing values and then clean the missing values in data.
  2. Use “Apply Math Operation” Subtract Rate from Cost rate find the profit for single product and add it to the data as a column.
  3. Use “Apply Math Operation ” to find profit percentage by dividing the Rate by Cost rate.
  4. Normalize all data except the quantity. The quantity has a drastic fluctuation since it is mixed products such as paper may go up to 500 and products such as pen may be only 1 in number. So currently keeping the quantity intact and normalizing rest of the data.
  5. The last 3 python executions is to define 3 flows for moving forward. I will be grouping the data — based on the items, based on bill id, and based on date(I may split the date again).

I will start my work with this data at a later stage and add it as the second part of my series. Thank You.

Part 2 of the experiment is available.

--

--