Data usage monitoring using Pandas and Bokeh libraries

Published in

Wenable

9 min readJun 8, 2020

Introduction:

In this blog post, I lay out my experience of using data analysis and visualization processes using the python libraries pandas and Bokeh to answer some business use cases here at Wenable. Our customers, looking for Enterprise Mobility Management(EMM) solutions for their organization look towards WeGuard, which acts as a one-stop answer for all the EMM needs.

Firstly, the customers sign-up into the WeGuard portal, and each customer is assigned a unique activation code. Following that, they register all the devices which are managed under their profile. Devices have a unique device-ID/IMEI number and are classified based on the policies, namely kiosk mode, work-managed, etc.

For example, several restaurants have a digital tablet on each table so their customers can place an order and pay the bill. Here the restaurants register their devices under kiosk mode and customize the apps they want their customers to see like a digital menu, payment app, etc. Besides, their customers cannot use any other apps like a browser, social media sites, etc. In this way, they can manage their devices and avoid any misuse of them.

Another example would be a transportation company that gives the registered devices to their employees, especially truck drivers. The company can monitor all its devices from a single platform. Some of the platform’s functionalities include geo-tracking the devices, installing/uninstalling the apps, enabling network policies to restrict connecting with untrusted public networks, allowing password policies to secure data, etc. Devices’ data is sent to the server frequently via APIs and stored in a MongoDB database. Here, the information includes the date, device id, apps, data usage, location coordinates, etc.

The idea here is to monitor the data usage of the devices(primarily mobiles) of our customers and to deploy this at the backend where the customers can check the stats. Eventually, the next step would be to notify them when the data usage of a particular device goes above the specified limit to avoid excessive billing. Other analysis includes: To visualize the top 5 apps across all devices of a customer, To visualize the top 3 apps which consume the most data per device, etc.

Note: This is not a tutorial. It’s more of documenting and sharing the results and observations of my project.

Technologies used:

PyMongo — MongoDB is one of the most popular NoSQL databases.MongoDB stores data in JSON-like documents, which makes the database very flexible and scalable. To access the database using python, we would need pymongo, which is a python based MongoDB driver.

pandas — It is a high-performance data analysis and manipulation tool, built on the numpy package. It is mainly used for structured and time-series data. Dataframes are a vital data structure of pandas, which allows you to store, slice, and dice the tabular data in rows of observations and columns of different fields.

Bokeh — It is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets.

Preparing the Data:

Before starting, I have created a conda virtual environment and installed all the dependencies in it using pip. Firstly, We import data from MongoDB, which runs locally. For this, we need to ensure that MongoDB has been installed locally, and then connect the database using a pymongo connector.

Fig 1. Snapshot of how a single document looks in this MongoDB collection

Loading the data from the required collection:

client = MongoClient('mongodb://localhost')
db = client['weguard']
dataUsage = db.deviceDataUsage

Since I consider only the app data usage for analysis, I only need the documents which contain the “appdatausagemodels” field and then extract the required subfields.

d1 =[ ]for usage in dataUsage.find({'appDataUsageModels':{'$exists':True,'$ne' : None}},{"_id": 0,"activationCode":1, "deviceId":1, "deviceDU": 1, "wegDU": 1,'appDataUsageModels':1,'deviceDUDay':1}):d1.append(usage)

The result of the above code is we get all the documents containing the field ‘appDataUsageModels’ in the form of a list of dictionaries nested within a list of dictionaries:

Now from the above, I need only a few fields from each document, namely deviceDUDay, activationCode, appID, appName, appBGDU, appFGDU, deviceId, and have it into a single list of dictionaries where each dictionary corresponds to a single record.

Hence we process the subfields and extract them into a list of dictionaries and load them into a pandas data frame to do further processing. I know using 2 for loops gives O(n²) efficiency, but since the data is not significant (hardly in GB), performance doesn’t matter much. We can use the JSON module of python to pre-process the data.

device = []
for i,v in enumerate(d1):
    try:
        for j in d1[i]['appDataUsageModels']:
            app = {}
            app['deviceDUDay'] = d1[i]['deviceDUDay']
            app['activationCode'] = d1[i]['activationCode']
            app['appID'] = j['appID']
            try:
                app['appName'] = j['appName']
            except KeyError:
                app['appName'] = ' '.join(j['appID'].split('.')[1:])
            app['appBGDU'] = j['appBGDU']
            app['appFGDU'] = j['appFGDU']
            app['deviceId'] = d1[i]['deviceId']
            device.append(app)
    except KeyError:
        Continue

The result of the above code is a list of dictionaries with the required key-value pairs as shown below. We then pass this list to a pandas dataframe.

df = pd.DataFrame(device)
df['deviceDUDay'] =  pd.to_datetime(df['deviceDUDay'])
df.set_index('deviceDUDay',inplace=True)
df.dropna(inplace=True)

The data is collected every day, We index the DataFrame on the date, i.e., on the ‘deviceDUDay’ field. Now we have the data loaded in the DataFrame ready for analysis. Below in Fig 2, we can see how it looks.

Fig2. Final Dataframe with the required fields fetched from the database

Above, the activation code is unique to each customer, appBGDU, appFGDU are the data consumed(in bytes) by apps in the device, deviceId is a unique identifier for each device.

Business Use Cases:

Using this data, we can answer several questions.

→ How much data does each device consume?

We can answer this question by applying grouping and sorting functions of pandas by using the below piece of code:

df_com = df.groupby(['activationCode','deviceId']).sum()/1000000000
df_com.dropna(inplace=True)
df_com['totalDataUsage'] = df_com['appBGDU']+df_com['appFGDU']
df_com = round(df_com.drop(columns =['appBGDU', 'appFGDU']),2)
df_com = df_com.sort_values(['totalDataUsage'])
df_com.reset_index(inplace=True)

Data frame showing data consumed by all devices in sorted order.

Fig 3. DataFrame which shows data consumed(GB) by each device

Now I want to show this as an interactive visualization. I chose the Bokeh Visualization library for this purpose. The alternative was plotly, but I decided Bokeh as I felt it integrates well with pandas by using the ColumnDataSource (It is an object where the data of a Bokeh graph is stored). It is a web-focused tool for creating interactive plots.

The python code below creates a figure and a horizontal bar graph, which shows the data usage of all the devices for a specific customer. Here the y-axis has the device numbers as shown in fig 4.

code = 'EQVAB'    # default company code
source_com = ColumnDataSource(df_com.loc[(df_com.activationCode == code)])
devices = source_com.data['deviceId'].tolist()     # list of devicesp = figure(y_range=(devices),x_axis_label ='Data Used(GB)', y_axis_label ='Devices',
           plot_height=1000, plot_width = 950,
           sizing_mode='scale_height' , title="Device Data Usage",
           toolbar_location=None, tools="")
           
p.hbar(y='deviceId', right='totalDataUsage', height=0.5, color='orange', source = source_com)

Fig 4. Bar Graph showing data Used by all devices of a customer

But we want interactivity where we can show devices filtered by customers and also for required date ranges. Hence we use the callback functions provided by Bokeh.

Firstly below, we get the list of customers and create a selection widget and start date — end date widgets. For dates, we use the DateTime module of python.

# list of customers
codes = list(df_com.activationCode.unique())# Selection widget
menu = [(x,x) for x in codes]
select = Select(title="Activation Code", value=code, options=menu)# Date Widget
datepicker = DatePicker(title="Start Date", min_date=datetime(2019,10,16), max_date=datetime(datetime.now().year,datetime.now().month,datetime.now().day)-timedelta(days=2), value=datetime(datetime.now().year,datetime.now().month-1, datetime.now().day))datepicker1 = DatePicker(title="End Date", min_date=datetime(2019,10,16), max_date=datetime(datetime.now().year,datetime.now().month,datetime.now().day)-timedelta(days=2),
value=datetime(datetime.now().year,datetime.now().month, datetime.now().day)-timedelta(days=2))

Now It’s time to use the callback function for interactivity. When the user selects the date range and the customer code from the menu widgets, the original data frame gets filtered based on the user selection and then fed to the ColumnDataSource of Bokeh, which shows the updated plots based on the user selection.

# Callback Function
 def callback(attr, old, new):
    d = datepicker.value
    d1 = datepicker1.value
    code = select.value
    
    a1 = d.strftime("%m/%d/%Y")
    a2 = d1.strftime("%m-%d-%Y")
    
    dfo1 = df[a1: a2]
    df_com = dfo1.groupby(['activationCode','deviceId']).sum()/1000000000
    df_com['totalDataUsage'] = df_com['appBGDU']+df_com['appFGDU']
    df_com = round(df_com.drop(columns =['appBGDU', 'appFGDU']),2)
    df_com = df_com.sort_values(['totalDataUsage'])
    df_com.reset_index(inplace=True)
    
    df_com = df_com.loc[df_com.activationCode == code]
    source_com.data = ColumnDataSource.from_df(df_com)
    devices = source_com.data['deviceId'].tolist()
    p.y_range.factors = devices    # Gives info about the device when we hover over it using cursor
    hover = HoverTool()
    hover.tooltips = [("Company Code", "@activationCode"),
                      ("Device ID", "@deviceId"),
                      ("Data", "@totalDataUsage{1.11} GB")]
    hover.mode = 'hline'
    p.add_tools(hover)
 
# Whenever the value changes, the callback function is triggered and thus gives the interactivity
select.on_change('value', callback)
datepicker.on_change('value', callback)
datepicker1.on_change('value', callback)layout = column(widgetbox(row(select), row(datepicker, datepicker1)), p)
curdoc().add_root(layout)

→ What are the top 5 applications which consume the most data per company?

We answer this question by applying grouping and sorting pandas functions by using the below piece of code. The code would look similar to the above one. The main difference is we use the nlargest() method to find the top 5 data usage numbers for each activation code, i.e., each customer.

dft = df.groupby(['activationCode','appName']).sum()/1000000
dft = dft.reset_index()
dft['totalDataUsage'] = dft['appBGDU']+dft['appFGDU']
dft = round(dft.drop(columns =['appBGDU', 'appFGDU']),2)
dft = dft.groupby(['activationCode','appName'])['totalDataUsage'].sum()
dft = dft.groupby(level='activationCode').nlargest(5).reset_index(level=0, drop=True)
dft = dft.to_frame()
dft.reset_index(inplace = True)
dft['totalDataUsage'] = round(dft['totalDataUsage']/1000,2)

Fig 5. DataFrame which shows Top 5 app data consumption per customer

We also find the stats of Weguard App data usage, which is a flagship Mobile Device Management solution from Wenable.

Now again, we want to show this as an interactive visualization using Bokeh.

Below python code creates a figure and a vertical bar graph, which shows the data usage of the top 5 data consuming apps across all the devices for a particular customer, where we filter the data frame based on that specific customer code.

code5 = 'EQVAB'
source_t5 = ColumnDataSource(dft.loc[dft.activationCode == code5])
app = source_t5.data['appName'].tolist()p5 = figure(x_range=(app),x_axis_label ='Apps', y_axis_label ='Data Used(GB)',
           plot_height=500, plot_width = 1000,
           sizing_mode='fixed' , title="Top 5 Apps Data Usage",
           toolbar_location=None, tools="")
p5.vbar(x='appName', top='totalDataUsage', width=0.9, color='orange', source = source_t5)

Fig 6. Top 5 Apps DataUsage per customer

Again as previously done, we want interactivity where we can show devices filtered by customers and also for required date ranges by using the callback functions provided by Bokeh.

Firstly below, we get the list of customers and create a selection, start date — end date widgets. For dates, we use the python’s DateTime module.

# list of customers
codes5 = list(df_com.activationCode.unique())# Selection widget
menu5 = [(x,x) for x in codes5]
select5 = Select(title="Activation Code",value=code5, options=menu5)# Date Widget
datepicker5 = DatePicker(title="Start Date", min_date=datetime(2019,10,16), max_date=datetime(datetime.now().year,datetime.now().month,datetime.now().day)-timedelta(days=2), value=datetime(datetime.now().year,datetime.now().month-1, datetime.now().day))datepicker55 = DatePicker(title="End Date", min_date=datetime(2019,10,16), max_date=datetime(datetime.now().year,datetime.now().month,datetime.now().day)-timedelta(days=2),
value=datetime(datetime.now().year,datetime.now().month, datetime.now().day)-timedelta(days=2))

# Callback Function
def callback(attr, old, new):
    d = datepicker5.value
    d1 = datepicker55.value
    code5 = select5.value
    
    a1 = d.strftime("%m/%d/%Y")
    a2 = d1.strftime("%m-%d-%Y")
    
    dft = df[a1: a2]
    dft = dft.groupby(['activationCode','appName']).sum()/1000000
    dft = dft.reset_index()
    dft['totalDataUsage'] = dft['appBGDU']+dft['appFGDU']
    dft = round(dft.drop(columns =['appBGDU', 'appFGDU']),2)
    dft = dft.groupby(['activationCode','appName'])['totalDataUsage'].sum()
    dft = dft.groupby(level='activationCode').nlargest(5).reset_index(level=0, drop=True)
    dft = dft.to_frame()
    dft.reset_index(inplace = True)
    dft['totalDataUsage'] = round(dft['totalDataUsage']/1000,2)
    
    dft = dft.loc[dft.activationCode == code5]
    source_t5.data = ColumnDataSource.from_df(dft)
    app = source_t5.data['appName'].tolist()
    p5.x_range.factors = app
 
# Gives info about the device when we hover over it using cursor
    hover = HoverTool()
    hover.tooltips = [("Company Code", "@activationCode"),
                      ("Data", "@totalDataUsage{1.11} GB")]
    hover.mode = 'vline'
    p5.add_tools(hover)# Whenever the value changes, the callback function is triggered and thus gives the interactivity
select5.on_change('value', callback)
datepicker5.on_change('value', callback)
datepicker55.on_change('value', callback)
 
layout = column(widgetbox(row(select5), row(datepicker5, datepicker55)), p5)
curdoc().add_root(layout)

I will conclude my article by thanking you for reading it, which is my first one. In no way, I am an expert python programmer, but I hope I sparked some interest in trying out pandas and Bokeh for your projects. I look forward to your suggestions, comments, questions :)

The article was first published here.

Data usage monitoring using Pandas and Bokeh libraries

Introduction:

Technologies used:

Preparing the Data:

Business Use Cases:

Written by Akshay Kumar Bung