My Second Quarter in the World of Data

Musings of a Newbie in a Corporate Data Science Team

Sukriti Paul
8 min readJul 13, 2020
Source: Franki Chamaki, Unsplash

Not long ago, I penned an article describing key takeaways in the first quarter of my new job. As anticipated, the learnings have transformed within a span of a quarter- I have a completely new list now! On interacting with my manager, team, and senior leaders, I’ve learned important lessons, which (I believe) could benefit every newbie pursuing a Data Analytics/ Data Science role. Therefore, I’ve compiled a list of learnings; if you breathe data, this might be of use!

Note: This article does not contain any company-specific or confidential information. My views do not represent those of my workplace.

Run It At One Go!

Source: Markus Spiske

One Well-Planned (Minimally Iterative) Step

Data and business use cases tend to become complex over time. As a consequence, there’s significant scope for errors and delays (even for something as simple as on outer join operation). A change in logic could boost your code execution time by several business hours. Inculcate the habit of allocating a considerable amount of time for building your logic to avoid repeated executions (I was advised to dedicate 2–3 hours on the logic before diving into any code).

  • Code optimization and logic optimization can be developed by brainstorming problems with teammates, and regularly referring to existing codes written by experienced team members. By no means does this imply that you should suppress the Internet Ninja in you — refer to Stack Overflow, Github Issues, etc.
  • While dealing with gargantuan databases, it becomes cumbersome and time-consuming to correct issues over many iterations. Be perceptive in assessing the glitches that could arise, and make pointers vis-à-vis crucial and commonly-used databases. Imagine getting an ‘Out of Memory Error’ after running your code the entire night!
  • Make note of ways to speed-up implementation. For example, knowing that apply() or count() is way faster than any loop-based operation on a Pandas DataFrame, could save hours during any analysis, given that you’re dealing with data comprising millions of entries.

Permissions & Documentation Everywhere!

Unlike academia, any incremental change in a learning model requires approvals across different verticals. Any change that is anticipated to boost the model performance, must be backed by data (compliance may make things more eventful 😉). Gone are the days when one could make teeny, experimental changes in the model, to discover what clicked; sound research and analysis underlie any decision-making in model design. Although the breadth for experimentation may seem limited, it is a good practice to design effective tests and invest in extensive model testing (which may include detailed reports comparing model architectures). Once the model is in production, trends like consumer behavior could take months to observe. In a way, rigorous inquiry assures that the product quality is enhanced while upgrading to challenger models.

Data Pipeline Breaks: The Onus Isn’t on the Tech/Dev Team Alone

As a part of a Data Science team, one is equally responsible for coordinating and resolving breaks in the data pipeline. Dig a little deeper: Understand the data flow; how online (implicit and explicit) data is collected from each channel; what kind of treatment or imputation is performed on raw data; how the data is stored and divided into training/testing data across several cells; how to detect anomalies in the ensemble logs; how the tech team optimizes the production code from their end; how the tech team’s tests vary from those of the DS teams; what are the plausible delays in production etc.

  • Documenting the data flow diagrammatically works wonders (given that the data could have a gazillion internal sources)!
  • Clean code, crisp user stories, and comprehensive model documentation ensure that the tech team uses/reproduces the correct logic in the deployment code.

Juggling Tasks & Projects

Abrupt changes in business plans may call for an urgent analysis or a new use case; one’s current projects could get sidelined in all possibilities. For example, boosting offline retail shopping recommendations on an e-commerce website may not be the best of ideas during a global pandemic. It’s important to adapt to sudden high-priority requirements, and shuffle between older long-term projects resiliently.

  • Regular notetaking helps in summarizing progress in each project (i.e. jotting down impact, work done so far, requirements, timelines, etc).
  • You can also maintain a separate label for e-mails regarding the ‘most recent project updates’.

At times, one may abandon some analysis and revisit it later, plausibly due to reasons like data scarcity, a lack of campaigns, issues that need to be resolved among external teams (e.g. database decommission), or some behavior/metric that has to be observed over a longer duration of time. If you’re in the habit of working continuously on a project till its completion, you might want to work on need-based task prioritization.

Your Data Should Tell a Story

Here’s a great quote by Chris Arnold (commonly known as The Data Whisperer) stated during his talk ‘Data Viz in a Box’, held at IISc Bangalore:

“ The thing with data analysts is that most of them get so engrossed in the complexities, that they fail to consider how their audience will comprehend the visual data presented. If the audience is finding it hard to understand something, it’s because the data analyst has not done a good job. It’s a case of work transfer, your viewer is doing what you should have done.”— Christopher Arnold Wells

The minute you’re tasked with a presentation, ask yourself: “Will the target audience infer fairly well without too many prompts, repetitions or back-and-forth”? Data storytellers connect the dots to make a convincing case in support of their conclusions- the content flows spontaneously, cogently, and systematically. They don’t dive into the details of each analysis; instead, they decide which piece of analysis helps in deriving actionable insights!

  • Mock presentations with team members and your manager can help identify the gaps (before facing a larger audience).
  • Keep your eyes peeled for the decks created by senior leadership.
  • Revisit and review your content after a time gap.

Who Is Your Audience?

Source: Teemu Paananen (Unsplash)

Have you come across data analysts harping on, “knowing one’s audience”? I couldn’t stress enough that every unnecessary detail on your presentation comes at a cost. And the cost is the viewer’s attention. Remember to present only what is relevant to the audience. During the course of time, data findings may have to be presented across different levels of leadership. My manager elaborated on how she and I are much closer to the data; we could delve into the nitty-gritty of the analysis. However, presenting these findings to a director or VP comes with a level of abstraction: The senior leadership will most likely be interested in the insights/ highlights/ overviews, rather than data-related nuances! Undoubtedly, learning that level of detail or abstraction comes with experience in the organization.

Data Science teams are constantly in touch with diverse teams (including tech, marketing, product development, data compliance, etc). In such a setting, it goes without saying that effective communication is an absolute necessity for getting one’s ideas/thoughts across. Many a time, what you convey signals your clarity of thought, logic, and execution.

“Everything is Available Because Y’all Have the Data”

Well, Is It?

All things don’t come on a silver platter. On the contrary, there’s quite a bit of figuring out to do! Occasionally (or maybe even oftentimes) there are no readily available resources regarding testing processes, documentation, metrics, algorithms, what have you. We’re expected to research improvements to existing solutions, propose new use cases, enhance existing algorithms, and even provide ideas to the product team. Yes, we ideate and implement. For example, your team may create a proprietary correlation metric, modify a loss function, or design slightly complex derived features! Or an e-commerce firm may try and find a new way to boost certain personalization-based capabilities for a handful of loyal customers.

Time Management (Project Planning Phase)

Ever underestimated the time required for a task, rather overambitiously? Or overcautiously allotted a lot of buffer days, should something go haywire? While playing with Big Data, it’s perfectly normal to face delays due to code execution, unexpected data ambiguities, processes depending on external teams, or the need for further analysis, etc. However, reasonable timelines are crucial in coordinating tasks across teams. For example, the marketing team of an e-commerce firm may plan a campaign for X months. During this time, the learning model has to be designed and deployed — the tech team’s UAT will depend on the model creation and validation timeline.

  • Walk your manager through a detailed timeline for the first few tasks. Ask her (1) the possible challenges that she can anticipate from her experience, and (2) if your buffer time is realistic.
  • Make note of all the instances when you crossed a deadline due to a factor that was beyond your control. This will help in future reference.
  • If the task is dependent on other employees, make sure to discuss how much time they would need for their contribution.

Learn & Discover On the Go

Learning is experiential and non-linear. After the first quarter, SQL, Python/R, and Excel become one’s bread and butter. By this time, the Data Science Tech Stack should be on one’s fingertips. As opposed to my experience in academia, I’ve seen that prolonged learning time may not count as productive business hours in an outcome-driven or result-oriented setting: Deliverables matter as much as your learning curve.

  • A curious mindset coupled with an open-minded attitude towards learning new technologies can go a long way!
  • Be flexible in adapting to new platforms. ‘Learning’ ultimately boils down to how efficiently one can apply any learning to a project.
Source: Tim Mossholder, Unsplash.

Keep Your Eyes Open

In a recent leadership webinar, a senior Customer Data Science VP highlighted why it’s important to stay updated on how other businesses are leveraging technology to enhance their products and customer experience (ethically, without a breach of confidentiality of course). Actively engaging with similar products outside of work has helped me appreciate the potential of the field. Sky’s the limit when it comes to innovation; you never know how a product feature could get you thinking. As a general user, if you observe a new feature on Spotify- one that collects user feedback in a unique way- there’s no harm in deriving some inspiration for explicit data collection for your use-case. 😉

Also, a shoutout to my managers, who raise the bar every single day, and help bring out the best in me.❤

--

--

Sukriti Paul

RA @ the Indian Institute of Science (ML/CV) || Founder @ The One in Asankhya Project || Google WTM Scholar || ACM-W Best Officer Awardee || GHCI Scholar .