THE HIGHEST STANDARD IN AI — AI GUILD #DATACAREER

The Foundation of Your Career is Data, Not Algorithms.

The Golden Age: Data Career in the 2020s (No 2).

Chris Armbruster
Published in
5 min readJan 24, 2023

--

For a while, founders and investors were betting on the algorithms, a bet they did not win. By 2020, the notion of a Data-centric AI was widespread. The prominence of the large language and image models shows data as the critical foundation.

Yet, the Internet is not your oyster or, at least, not until you understand that beyond infinite data availability, you also need to address quality and accessibility. The success of ChatGPT relies on Open AI investing in de-biasing and de-risking the data, so that racist, sexist, and other discriminatory content is available less readily. And in January 2023, if you signed up for ChatGPT, you likely got the message, “ChatGPT is at capacity right now.” Access is vital, and you cannot take it for granted.

Data Availability, Data Quality, and Data Access

As I think about data careers, I suggest that at your current or next company, the three data issues that you need to investigate are

  1. Availability: What type and quantity of data are available, and what amount is stuck in silos and legacy technology?
  2. Quality: How incomplete and biased is the data, and what strategy is in place to ensure that reproducible and reliable pipelines feed your use cases?
  3. Access: Is the data accessible to you and all the relevant people so that use cases stay in production?

“We have no data.”

Since 2017, I have supported talent in landing a first or second data role. I emphasize that a career start is more successful if you know the type of company (e.g., startup, corporate, consultancy) and domain or industry you want to join. Candidates search purposefully and prepare for interviews.

Nevertheless, sometimes I get that phone call opening with “We have no data.”

What is data availability a problem?

Companies always have data, typically lots of data. However, the data may not be suitable for analytics and machine learning. Some typical issues are

  • The company is not data literate, and it fails at ETL tasks.
  • The company deploys legacy technology and is not investing in data migration.
  • Data is siloed, and often the owners of the silos are not collaborating, so there is no shared data infrastructure.

A problematic situation: You must understand if the company is willing and able to remedy it within the next twelve months. If yes, then taking a leading role is rewarding because

  • In designing and executing ETL and related tasks, you ensure that automated pipelines emerge that serve as many use cases as possible.
  • By leading on data migration, you get to explore the types of data most relevant to the company and blueprint a data infrastructure that lasts.
  • If you can get the authority and resources to overcome or break down the data silos, you will be the avantgarde defining how to company makes and saves money with data.

“Our pipelines are refreshed manually.”

I realize that Internet companies might struggle with data analytics and machine learning.

Sometimes I get a call from a practitioner frustrated by failing infrastructure, data trash, and lack of quality management: “Our pipelines are refreshed manually.”

You can recognize the absence of a scalable infrastructure and data quality management, e.g.,

  • Data Analytics is mainly done ad-hoc, and deciders treat the analytics as another option or opinion.
  • Data pipelines are singular and monitored manually, possibly by junior staff.
  • Models don’t go to production because of data quality issues.

Quality data requires investment. Here are some questions you can ask to ascertain if a company has the will and the resources to build a scalable foundation for productionizing use cases.

  1. How is the data infrastructure designed to serve the development of data assets? What data assets are identified across the company? Which public or external assets are integrated?
  2. Are the data assets recognized as a verifiable source of truth for the company? Does the company have a strategy for achieving data literacy?
  3. How does the company manage data quality and curate or de-bias the data?
  4. Does the company have a scalable data infrastructure? If not, what is the strategy?
  5. Does the company have serviceable data in pipelines for various use cases? If not, what is the strategy?
Data Quality Management

“I have no access to the data.”

Did you know? You can get hired by a top-tier company and not have access to the data.

I have been a witness several times. A top-tier ML talent or expert in Europe gets hired by one of those companies that too many people talk about to feel more important.

A few months later, I get the call: “I have no access to the data.”

What is going on?

  • The role you get hired for and the tasks you do may diverge (e.g., simple software engineering). Companies may oversell when hiring. Also, the bigger the company, the more likely a divergence may occur as many staff members can do similar tasks.
  • Companies govern and restrict access to data on purpose. For some reason, you are not given access.
  • Access is more likely restricted when you are not at the company headquarters.

No access to the data sucks. It isn’t easy because you can do little about it except move on.

Building your portfolio through data

If you read this at the start of your data career, I have a strategic suggestion: Focus on a specific type of data in an industry or domain and build your portfolio cumulatively.

Frequently I get asked by talents what they can do to break into a data career faster, e.g.,

  • Should I do more online courses? Do the certificates count?
  • Would an in-person Bootcamp boost my chances? Does that give me access to employers?
  • Should I participate in online challenges or competitions? Does that help me to stand out?

You may have good reasons to seek further technical training. However, there is one thing that specialized training typically does not give you: Data domain expertise. Most training is not organized by domain or industry. Yet, I find that the one thing that helps you start a data career faster is if you have data domain expertise.

You may acquire that expertise during your studies. For example, if you combine studying agriculture with learning data analytics, or neuroscience with machine learning, then you have that data domain expertise already.

Suppose you do not have that yet. My most relevant insight is that you can build a cumulative domain portfolio through self-study if you consistently engage with data from a specific domain, e.g., machine data for predictive maintenance or financial data for forecasting.

In conclusion, data is the foundation of your career, and you need to look at data availability, quality, and access in your chosen industry and company.

The AI Guild’s 2000+ Specialists

The AI Guild is Europe’s leading practitioner community in Data Analytics, Data Engineering, Data Science, Machine Learning, Deep Learning, NLP, Computer Vision, and MLOps.

Do you want to progress to Senior, Lead, and Director?

It takes 60 days to build your competency profile. You can find out more by booking the first conversation to gain more insights at https://www.datacareer.eu.

The AI Guild

--

--

Chris Armbruster
Fluent in Data

Director, 2400+ Data Analytics and Machine Learning specialists | Data Leader | Keynote Speaker | Use Cases in Production