What Constitutes a Perfect Data Team?

A guide to comprehend who are the members of a Data Team and what are the key roles for each of them!

Kartikay Laddha
Analytics Vidhya
6 min readSep 14, 2020

--

Data science is the most promising field in near future, with the advancement of technology and statistical models in recent times, a new data wave is knocking at our doors for a complete revolution. It relates to an interdisciplinary field of study that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. As diverse does this field sounds, its team also has to be diverse enough to carry out tasks efficiently! To understand this in a better way let’s follow the pipeline for a data science project.

The most important aspect of this job is to Understand the Business Problem at the beginning, in the meeting with clients, a data science professional asks relevant questions, understands and defines objectives for the problem that needs to be tackled. Asking various questions in order to understand the project. in a better way is one of the many traits of a good data scientist. Now they care up for Data Acquisition to gather and scrape data from multiple sources like web servers, logs, databases, APIs and online repositories and finding the right data takes both time and effort.

After data is gathered next comes Data Preparation which involves data cleaning and data transformation. Data Cleaning is the most time-consuming process as it involves handling many complex scenarios like dealing with inconsistent datatypes, misspelt attributes, missing and duplicate values and many more things. Then data is modified in the transformation step based on the mapping rule, in a project ETL tools are used to perform complex transformations that help the team to understand the data structure in a better way.

Then to understand what can be actually done with the data is very crucial and for the same Exploratory Data Analysis is being applied. With the help of EDA, defining and selection of feature variables that will be used in model development is done. Next is the core activity of a data science project which is Data Modelling. Various machine learning techniques are being applied here such as KNN, Naive Bayes, Decision Tree, Support Vector Machine, etc to the data. in order to identify the model that best fits the business model. Next, the model is trained on the training dataset and testing is done to select the best performing model. Various computer languages such as Python, R, SAS etc are used by the team to model the data.

Now come the trickiest part Visualisation and Communication in which the team meets the clients again to communicate the business findings in a simple and effective manner to convince the stakeholders, in which tools such as Tableau, Power BI, Qlik view, etc are used which can help to create powerful reports and dashboards. And finally, the model is being deployed and maintained. The selected model is tested in a pre-production environment before deploying it in a production environment and after successful deployment, the team uses dashboards and reports to get real-time analytics. Further, the team also monitors and maintains the project’s performance and this is how a data science project is completed!

Hence, Building and structuring of a good team here is very essential to meet the business need of an organisation. It is not very surprising to state that data science isn’t a single field. It is actually three different jobs with people working together to produce the final answers. These jobs can briefly be classified into three categories.

Data Engineer

Data Engineers control the flow of information as information architects, they help in building specialised data storage systems and the infrastructure to ensure that the data is easy to obtain and process which they do by maintaining the data access. Most data engineers are very familiar with SQL, which they use to store and manage big and large quantities of data. They also use some of the programming languages such as Java, Scala or Python for processing data and automating data-related tasks.

Data Analyst

Data Analysts describe the present view data, they do this by creating dashboards, Hypothesis Testing and data visualisation. They often have some background in statistics or computer science but tend to have less engineering experience than data engineers and have less math experience than machine learning scientist. Data Analysts use spreadsheets (Excel or google sheets) to perform simple analysis on small quantities of data (simple storage and analysis). They use SQL (the same language used by data engineers), for large scale analysis. While data engineers build and configure SQL storage solutions, data analysts use existing databases to consume and summarise data. Analysts also use Business Intelligence or BI Tools such as Tableau, Power BI or Looker for creating dashboards and sharing information and their analysis.

Machine Learning Scientist

Machine learning is perhaps the buzziest part of data science, it‟s used to predict and extrapolate what is likely to be true from what we already know. These scientists use training data to classify larger unrulier data, for example, machine learning can help us tell how much money stock may be worth in the next week, can help to predict which image contains a car by image processing or what sentiments are expressed using a tweet by automated text analysis or sentiment analysis. Machine learning scientist either use Python or R programming languages for creating predictive models. These both are great programming languages for data science and a candidate who knows one language can likely read code in the other language. This is to be noted that programming languages aren’t as difficult to learn as spoken languages. If someone knows how to speak Hindi, it might take them years to learn to speak Spanish. Programming languages are more similar to power tools. If we know how to use a power drill, we may not necessarily know how to use an electric saw, but we may probably learn with a little training or help.

Therefore to summarize;

Now after the roles are well defined for everyone inside the team, and once a business organisation hires some data professionals, there are three main ways a data team can be structured.

Isolated

An isolated type of data team can contain one or multiple kinds of data employees without any other team like engineer or product. This is a great structure for training new team members in quickly changing each project each member is working on.

Embedded

Alternatively, it can be helpful to use an embedded model. Where each data employee is part of a squad which also contains engineers and product managers. This model lets each data employee gain experience on a specific business project, making them a valuable expert.

Hybrid

Now the hybrid model seems similar to the embedded model, but with additional sync for all data employees across all squads. This additional layer of organisation allows, for uniform data processes and career development, regardless of which project an employee is assigned to.

--

--

Kartikay Laddha
Analytics Vidhya

Pursuing Bachelors of Technology in Data Science — Business Analytics, SVKM’s NMIMS University