The Shift in Data Science: Skills for Today's Comprehensive Data Scientist

In the 1980s, financial firms on Wall Street recognized the exceptional problem-solving capabilities of physicists, leading to the rise of the "quant" profession. Fast forward two decades to the late 2000s, and a similar demand arose in the business sector with the onset of the big data revolution, creating a need for professionals adept in navigating large datasets for meaningful insights. This gave birth to the field of data science.

In 2018, I made the leap from academia to the corporate world while finalizing my PhD in developing advanced cancer treatment models, joining one of Australia's leading banks. Alongside me were seven other PhD candidates with specialties ranging from diabetes and machine learning to neuroscience and aerospace engineering. Ironically, all of us found roles in the bank's big data unit, a running joke amongst us even today.

Much like physicists thriving in finance, it appears that individuals with STEM backgrounds excel in data science roles. Thomas H. Davenport and DJ Patil highlighted this phenomenon in their influential 2012 article, "Data Scientist: The Sexiest Job of the 21st Century," noting the role's scientific requirements:

> “...the word ‘scientist’ fits this emerging role, as individuals are expected to design tools, gather data, conduct experiments, and communicate findings.”

They further indicated that companies are successfully recruiting from physical and social sciences, with many data scientists possessing PhDs in specialized fields like ecology or systems biology. However, the landscape of data science has evolved significantly since then, with new expectations and technologies emerging, leading to an ever-expanding skill set.

Keeping pace with these advancements can be overwhelming. I personally grappled with feelings of inadequacy during my PhD, and even six years into my engineering and data science career, I still confront the same challenges. (Rest assured, you’re not alone!)

In this piece, I will tackle three key questions: - How has data science progressed to its current state? - What essential skills should modern data scientists possess? - What does the future of analytics hold for your career?

Evolution of Data Science

The Harvard Business Review published an influential article in 2006 titled “Competing on Analytics,” authored by Thomas Davenport and Jeanne Harris, which ignited conversations about leveraging analytics for competitive business advantage. Companies invested heavily in business intelligence tools from providers like SAS, SAP, IBM, Microsoft, Tableau, Oracle, MicroStrategy, and QlikView, enabling data analysts to identify trends in historical data and make informed decisions.

However, the future of analytics lay in machine learning and predictive analytics. Organizations began investing in big data tools and the personnel capable of extracting insights from vast amounts of data. This led to the emergence of the Data Scientist (DS) role, focused on transforming businesses into data-driven entities, with the ultimate aim of basing decisions on data and providing hyper-personalized products and services.

The Emergence of the "Sexiest Profession"

The term “data science” was first introduced by DJ Patil and Jeffery Hammerbacher in 2008, who led data efforts at LinkedIn and Facebook, respectively. Patil later became the first Chief Data Scientist of the U.S. in 2015, and Hammerbacher co-founded Cloudera.

In 2009, Hal Varian, Google's Chief Economist, made a prescient remark:

> “I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”

By 2012, Patil and Davenport asserted that the unique blend of scarce, highly sought-after skills would render data scientists the most desirable professionals of the 21st century:

> “Data scientists are difficult and costly to hire, and with the competitive market for their services, they are hard to retain. The combination of their scientific background and computational skills is rare.”

Facebook's Impact on Data Science

In 2012, Facebook established its own data science team, quickly showcasing the immense value of data science when expertise and autonomy converged. Led by Cameron Marlow, a 35-year-old MIT PhD, the team of 12 researchers mined the growing social data pool, paving the way for many of the platform's future features.

The team identified user patterns that allowed Facebook to recommend potential friends and made numerous enhancements to the site through A/B testing, now a standard tool for data scientists. Remarkably, they discovered that, contrary to the belief of "six degrees of separation," the average user was only four degrees away from another, illustrating the interconnectedness of individuals on social media.

The team even devised a method to gauge a country's "gross national happiness" through sentiment analysis of citizen posts, analyzing word frequencies that indicated positive or negative emotions.

Hadoop: Making Big Data Accessible

Facebook had an advantage in rapidly experimenting with ideas on a large scale, a luxury not often afforded to academic researchers or data scientists at other firms. However, the team faced significant challenges in processing massive datasets at a granular level, which required innovative frameworks and tools from Silicon Valley's big tech sector.

Inspired by early projects at Google, Yahoo played a pivotal role in developing Apache Hadoop, an open-source big data framework that enabled daunting computing tasks by distributing data across numerous low-cost machines. Over time, engineers from Amazon, Facebook, eBay, Google, LinkedIn, Microsoft, Twitter, and Walmart refined the Hadoop ecosystem.

When I began my journey at the bank in 2018, we were just starting to explore big data technologies, building a data lake on a Cloudera-provided Hadoop stack. Cloudera, founded in 2008 by former employees from Facebook, Google, Oracle, and Yahoo, packaged Hadoop for enterprise adoption.

Our data lake quickly filled with data from numerous engineering teams, which constructed pipelines to pull data from various bank sources and store it as Hive tables. Developed by Facebook engineers in 2010, Hive allowed data analysts to query large datasets using SQL, while Spark, released in 2014, enabled data scientists to manipulate data using Python.

The Influx of Junior Data Scientists

By 2015, technologies like Hadoop, Hive, and Spark made big data analytics feasible for organizations globally. Yet, challenges persisted:

Is your company’s data prepared for data science?
Does your organization know how to manage its data science talent effectively?
Does your firm possess the necessary data science expertise?

The first two challenges are crucial for establishing a successful data science foundation. Without quality data and a solid analytics strategy, leaders struggle to unlock the full potential of their data scientists.

Despite these hurdles, companies began hiring data science talent aggressively, figuring out how to utilize them later due to their scarcity. This scarcity stemmed from the limited number of formal data science training programs available.

In the 1980s, universities started producing financial engineering graduates to fulfill Wall Street's demand for quants. The 1990s saw a rise in search engineers, prompting universities to adjust their computer science programs accordingly.

By the late 2010s, with the surge in student interest for analytics careers, universities scrambled to adapt their mathematics and statistics programs to include data science courses and degrees.

In 2015, I was teaching calculus, linear algebra, and statistics to math and stats majors at a prominent Australian university. A few years later, just before transitioning to industry, I found myself teaching the same subjects under the data science banner.

By 2019, educational institutions and MOOCs were producing a significant influx of data science talent, resulting in a market flooded with junior data scientists. Vicki Boykis offered practical advice during this time:

> “Don’t do what everyone else is doing, because it won’t differentiate you. You’re competing against a saturated industry, making things harder for yourself.”

She noted the disparity in job postings: about 50k data science positions compared to 500k for data engineering and 125k for data analysts.

Boykis suggested that starting in roles like junior developer or data analyst could provide a more effective entry point into data science than competing for the same limited positions.

However, formal training programs often lack the ability to impart the essential domain knowledge and stakeholder management skills necessary for success in organizational settings. As junior data scientists filled the market, the demand for experienced, senior data scientists continued to grow.

I was one of those junior data scientists, armed with technical skills but lacking industry experience. After joining my bank in 2018, I quickly immersed myself in large data projects, learning soft skills through experience and identifying gaps in my knowledge, which I addressed through online courses.

Emerging Skills — Data Engineering, MLOps & GenAI

By 2020, it became evident that data science was evolving towards a more engineering-focused approach, straddling the realms of data engineering and MLOps within the data value chain. Data scientists needed high-quality data, a critical gap that many organizations realized hindered their transformation into data-centric entities.

> “Without good data, AI is dead.”

This realization led to the emergence of Data Engineers (DE), responsible for constructing data pipelines to source, process, and prepare large datasets for data scientists.

At my bank, our data science initiatives faced challenges due to the lack of reliable, accessible data. Following industry trends towards productizing data, our DEs shifted focus from creating one-off bespoke pipelines to developing strategic, reusable, and trustworthy data products.

On the flip side, data scientists were tasked with deploying their models in production environments to serve actual users, monitor performance for any drift, and maintain the models throughout their lifecycle. This need birthed the role of Machine Learning Engineer (MLE), one of the most sought-after positions alongside AI engineers.

MLEs handle the operational aspects of machine learning, known as MLOps. With a blend of machine learning expertise and software engineering skills, MLEs refine and optimize code to ensure models provide tangible value by being deployed effectively.

The current major focus in tech is Generative AI. In 2017, researchers from Google introduced a transformative architecture called the transformer. A year later, OpenAI utilized this framework to develop their groundbreaking Generative Pre-trained Transformer (GPT) models.

Their ChatGPT product, built on GPT 3.5, marked a pivotal moment in AI, attracting over 100 million users within two months of its release in early 2023. Since then, organizations worldwide have raced to experiment with and implement Generative AI use cases, placing significant pressure on data scientists and machine learning experts to rapidly adapt to new concepts and tools.

In just a few years, many data scientists who once trained models using sklearn on clean datasets in Jupyter Notebooks found themselves navigating uncharted territories. They were now tasked with identifying promising Generative AI applications with limited resources, adapting large language models for their companies using techniques like fine-tuning or Retrieval Augmented Generation (RAG) with cloud platforms such as Microsoft Azure ML or AWS Sagemaker while managing AI hallucinations and deploying their models for real-world usage.

At our bank, we have been exploring Generative AI use cases through a combination of centralized planning and decentralized hackathons. These hackathons provide our team with opportunities to brainstorm potential applications and prototype working products in an open environment. The resulting ideas are then presented to our Chief Technology Office, which outlines a strategic roadmap for AI implementation, the requisite data infrastructure, and necessary safeguards. They ultimately determine which Generative AI projects proceed to full development and deployment.

The Modern Data Scientist

Back in 2018, Emmanuel Ameisen reflected on the hiring landscape:

> “Hiring Managers across the valley frequently express that while there’s no shortage of individuals capable of training models on datasets, they need engineers who can build data-driven products.”

Fast forward to today, and most organizations and data science leaders are searching for a particular type of professional: the end-to-end data scientist.

These specialists not only possess a solid understanding of core data science and modeling but also excel in soft skills and engineering capabilities. However, such individuals are exceedingly rare. Most organizations would be pleased with a seasoned data scientist who has some level of end-to-end expertise.

Eugene Yan outlines four reasons why end-to-end data scientists hold significant value:

1. Acquire Contextual Understanding

When data scientists concentrate solely on modeling, they risk developing tunnel vision that leads to subpar results. Without insight into the data value chain, they may overlook key factors impacting outcomes. For instance, if our Mortgages team observes a drop in loan applications and requests a new marketing model, we would not rush in. Several factors could be influencing this:

Products: Are the new mortgage offerings appealing? Is competition arising from emerging fintech companies? What about the broader economic context?
Data: Are our data pipelines functioning correctly? Is the data quality sufficient?
Models: Are existing models performing as expected? Is there any drift in the data or model?

Most of these issues are not strictly machine learning challenges. As Yan notes:

> “More often than not, the problem — and solution — lies outside of machine learning.”

Additionally, having visibility downstream is crucial to ensure that the ML solution aligns with the engineering and product constraints of the business.

Our bank recently launched a Digital Instant Mortgage product that allows our 11 million customers to secure multi-million dollar home loans in minutes. Data scientists needed to collaborate closely with infrastructure and product teams to ensure that their solutions were technologically viable and integrated seamlessly with our online banking application.

2. Reduce Communication Overhead

More participants in a project often result in increased communication overhead. More time is spent aligning ideas and negotiating details, leaving less time for productive work. The COVID-19 pandemic has made this even more challenging, as remote collaboration can complicate technical discussions.

Communication overhead increases almost exponentially with team size. Richard Hackman, a Harvard psychologist, noted that as team size grows, the number of relationships increases significantly, leading to more time spent on coordination and higher chances of misalignment.

In the data science workflow, this is evident when data engineers (DE), data scientists (DS), and machine learning engineers (MLE) must collaborate. DE and DS need to align on data cleaning and feature engineering, while MLE and DS must discuss model deployment and maintenance. In larger organizations, multiple stakeholders, such as Service Operation Managers, Scrum Masters, and Product Owners, may also become involved, complicating communication.

A data scientist with robust engineering skills can significantly mitigate this communication overhead.

3. Foster Ownership and Accountability

Dividing the data science process among various roles can lead to a lack of accountability, as individuals may focus solely on their specific tasks. This often results in the classic "Throw It Over the Wall" scenario, where DE hands off databases to DS and DS sends scripts to MLE without sufficient context.

When issues arise, this can lead to finger-pointing as team members deflect responsibility. Research has shown that the presence of others can lead to Diffusion of Responsibility, where individuals are less likely to take action, and Social Loafing, where group work results in diminished effort.

An end-to-end data scientist, equipped with strong engineering skills, is empowered to manage the entire data science process from start to finish. This enables them to engage with customer needs, oversee model deployment, and utilize appropriate metrics to assess project success as a comprehensive, data-driven product.

4. Accelerate Learning and Iteration

As Yan succinctly puts it:

> “With greater context and lesser overhead, we can now iterate, fail (read: learn), and deliver value faster.”

This holistic perspective fosters a culture of rapid learning and innovation, directly contributing to a firm’s capacity for innovation.

Becoming more end-to-end can enhance a data scientist's motivation and job satisfaction, providing greater autonomy, opportunities for skill enhancement, and a sense of purpose stemming from a direct impact on work outcomes.

Skills of an End-to-End Data Scientist

What does it take to become more end-to-end? I categorize the skills into three tiers:

Tier 1 Skills

These foundational skills are typically acquired through formal education, whether at universities or through online courses and bootcamps.

Programming: Especially in scripting languages such as SQL for data cleaning and Python/R for prototyping ML models.
Data Analysis: Understanding data, visualizing it, and employing statistical tools like A/B testing to make data-driven conclusions.
Machine Learning: Engineering features, training and optimizing models, and selecting appropriate evaluation metrics.

Tier 2 Skills

These skills are typically developed through practical experience as a data scientist in the industry.

Soft Skills: Engaging with stakeholders, facilitating communication across teams, and effectively communicating insights.
Product Skills and Business Acumen: Understanding customer issues and crafting requirements to maximize the impact of solutions.
Domain Expertise: Acquiring knowledge of industry trends, business processes, and metrics relevant to specific domains, along with developing intuition for effective ML techniques.

Tier 3 Skills

These are advanced skills that bring you closer to being an end-to-end data scientist and can help pivot into new roles:

Data Engineering: Mobilizing large datasets from various sources, building scalable pipelines, and integrating data into a data warehouse or lake. DEs are responsible for cleaning and preparing data for use.
MLOps: The practice of deploying, monitoring, and maintaining ML models in production. MLEs must be familiar with containerization tools like Docker, orchestration frameworks like Kubernetes, CI/CD pipelines, and principles of scalability.
Software Engineering Skills: Writing clean, maintainable, and efficient code, including knowledge of design patterns and testing methodologies.
Generative AI: According to Gartner, Generative AI is now the most sought-after solution in enterprises, with the most common application being a ‘private ChatGPT’ for employees and customers. Data scientists and ML practitioners are increasingly required to learn fine-tuning and RAG techniques on foundational models. Effective fine-tuning requires significant amounts of relevant data, which often surprises organizations, leading to disappointing results. RAG remains popular for enterprise Generative AI adoption, as it mitigates hallucinations and ensures traceability—essential in regulated industries like banking and healthcare.

Conclusion

Since being labeled the ‘sexiest job of the 21st century’ in 2012, data science has navigated through each phase of the Gartner hype cycle.

What lies ahead? Here are two key pieces of advice:

Firstly, as the enterprise data and technology landscape evolves, new skills and technologies will continually emerge for data scientists to explore. Understand that very few will achieve true end-to-end expertise—a rarity far exceeding the original data scientist. That’s perfectly acceptable.

I hope this realization encourages you to focus on gradually acquiring engineering skills while refining your business acumen, domain knowledge, and soft skills.

To catalyze your progress, consider these suggestions: - Volunteer for challenging projects at work. - Pursue personal end-to-end projects. Brainstorm business ideas, gather data, and develop a deployed application. The entrepreneurial skills gained will be invaluable. - Join a startup-like team within your organization that requires members to wear multiple hats and iterate on products rapidly. This experience will help you internalize the 80–20 rule and prioritize high-impact tasks.

Overall, adopt a growth mindset and seek opportunities to expand your skill set beyond core competencies.

Secondly, be prepared for the possibility that data science may evolve drastically in the next decade. Widely adopted technologies tend to become more user-friendly and democratized. Self-service BI tools like Power BI and Tableau are making insights more accessible, while no-code ML platforms such as Alteryx and Dataiku empower everyday knowledge workers to perform advanced analytics.

Every Wednesday afternoon, I facilitate a Dataiku training session for users at our bank. A data scientist from Singapore regularly demonstrates how to conduct natural language processing (NLP) using the Dataiku Data Science Studio platform.

Three years ago, they showcased using the NLTK Python package in a Jupyter Notebook integrated with Dataiku. Two years later, they demonstrated tokenization and stemming using codeless Dataiku Recipes, enabling non-coders to leverage NLP. Last year, they announced the latest GenAI integration, allowing even entry-level Excel users to utilize NLP.

Interestingly, a representative from Dataiku predicted at our firm’s annual data conference that the current all-in-one ML platforms would soon become obsolete, with knowledge workers querying AI for insights using natural language, effectively abstracting away the underlying calculations.

Things are evolving rapidly in this field, right?

Does this mean it's pointless to enhance your skills in data science? Software engineers face similar questions in the age of GPT-4 and GenAI copilots capable of writing entire applications in seconds.

My perspective: don’t panic. Data science and engineering skills are fundamentally about problem-solving, making them highly transferable. While I can't predict precisely what new careers will emerge, it’s likely that the role you occupy in a decade doesn’t exist today.

Focus on what you can control, invest in yourself— the best investment—and enjoy the journey.

Connect with me on Twitter and YouTube for more insights.

rhondamuse.com

The Shift in Data Science: Skills for Today's Comprehensive Data Scientist

Evolution of Data Science

The Emergence of the "Sexiest Profession"

Facebook's Impact on Data Science

Hadoop: Making Big Data Accessible

The Influx of Junior Data Scientists

Emerging Skills — Data Engineering, MLOps & GenAI

The Modern Data Scientist

1. Acquire Contextual Understanding

2. Reduce Communication Overhead

3. Foster Ownership and Accountability

4. Accelerate Learning and Iteration

Skills of an End-to-End Data Scientist

Tier 1 Skills

Tier 2 Skills

Tier 3 Skills

Conclusion

Recommended Readings on AI, ML & Data Science

Share the page:

Recent Post:

Passionate Pursuit of Fitness: Transform Your Health Journey

The Beautiful Union of Love and Inspiration in Relationships

# Reflecting on Three Years of the Pandemic: Insights from Youth

Crafting Engaging Browser Games Using Python and Pygame

The Digital Union: Marrying Algorithms in a Modern Era

The Misconception of the Scientific Method: A Critical Analysis

The Extraordinary Tale of Head Transplants in Medicine

Insect Brains: The Blueprint for Future AI and Machine Learning