Data science workflows are where the rubber hits the road. If the data scientist's workflows don't support operations, the workflows are stranded and of little to no value. Operationalizing Machine Learning should have equal priority to model building.
Reposted from Neptune.AI, Kurtis Pykes
What is a Data Science workflow?
“Many Data Scientists spend the early stages of their careers obsessing over Machine Learning Algorithms and obsessing over the state of the art, however, as time goes on many begin to realize that their attention should be diverted to what is known as soft skills. ”
Since the 2012 article released by Harvard Business Review called Data Science the sexiest job of the 21st century, the field has taken all industries by storm. Data Science jobs never fail to reach the top-ranked best job listings – It’s no surprise that one job listing could receive 200-300 applicants per day. An article by Forbes stated that “From 2010 to 2020, the amount of data created, captured, copied, and consumed in the world increased from 1.2 trillion gigabytes to 59 trillion gigabytes, an almost 5000% growth”. At this current rate, we are completely drowning in data, and businesses across various industries have been looking for ways to capitalize on its presence.
To some, it may be like deja vu, but to others that lived through the boom of the software engineering days, they may be seeing a familiar pattern forming. Essentially what happened was that everyone doing software engineering wanted to maintain, or develop, their competitive edge in the market so they were determined to deliver high-quality products into the marketplace, which in turn led to the revolution in regards to methods and the tools being used – some examples include Agile, DevOps, CI/CD, and much more.
However, where things differ between software engineering and Data Science is that a software engineer essentially learns in order to build whereas, in the case of a Data Scientist, the learning typically comes after building. Nonetheless, that is not to say that Data Science does not have anything to learn from its cousin, software engineering. In fact, the reality is quite the opposite.
Read next
What is a Data Science workflow?
Generally, a workflow describes the way people perform tasks to get the work done. To illustrate what a typical workflow looks like, we’d list a series of steps that ought to be completed sequentially using a diagram or a checklist.
“A workflow consists of an orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information It can be depicted as a sequence of operations, declared as work of a person or group, an organization of staff, or one or more simple or complex mechanisms.” [Source: Wikipedia]
This is the essence of a Data Science workflow; The Data Science workflow expounds the different steps taken within a Data Science project. Since the purpose of a workflow is to illustrate how to get stuff done, using a well-defined Data Science workflow is extremely useful given that it serves as a reminder to all team members of what work has been done, and what is yet to be done.
The development of a workflow
The simple answer to ‘where does the Data Science workflow come from?’ is software engineering, but as most things, it’s really not that simple. Firstly, Software Engineers are engineers, therefore whenever they learn something new, the goal is to build. On the other hand, Data Scientists are scientists before they are engineers (that’s if they have any engineering capabilities), therefore, building only comes after they’ve learned.
Thence, the relationship a Data Scientists would have with code is different to that of an engineer. It’s rare that a Data Scientist would be thinking about coding best practices while they are experimenting – instead, they’d wish to learn something valuable from the insights of the experiment. Therefore, Data Scientists leverage code to derive insights from data and arrive at answers to interesting questions that were formulated at the beginning of the project.
Despite the discrepancies, the best practices used for Data Science teams were actually borrowed from the software development best practices. Although there are multitudes of development workflows, a commonality among them is that they all typically include steps to define specifications, writing code, code reviews, code testing, code integrations, and deployment of the system to a production environment so that it could serve a purpose for the business.
In the same way, Data Science workflows have commonalities of their own.
General aspects of a Data Science workflow
Due to the nature of Data Science problems – we don’t know the end from the beginning – it’s very hard to define a concrete template that ought to be used universally when approaching Data Science problems. Depending on the problem and data, the roadmap of how you’d want to approach a task may vary, hence it’s down to the team to define a structure that’s suitable.
Nonetheless, we do witness very common steps when approaching many different problems, regardless of the dataset. Let’s take a look at these steps.
Note: In no way is the process defined below linear. Data Science projects are quite iterative and many stages are repeated and/or revisited.
Defining the problem
Defining a problem is not as easy as it may seem since there are many factors to consider to ensure the correct problem is being tackled. Questions to consider whilst defining the problem are as follows;
What problem are we trying to solve?
What challenges are our customers facing when using our product/service?
What insights would we like to know more about?
What problems are we currently facing?
The ability to state clearly what the problem is more of an art, but it’s an essential first step before conducting any Data Science project. Without a compass that all members of the team are following, it’s really easy to spend lots of time doing lots of things without making much progress towards adding business value.
Read also
Data collection
Data is where most bottlenecks occur in industrial Data Science projects; It’s quite rare that all of the data we need is readily available, therefore, it’s important we know some techniques to acquire data. This phenomenon is known as Data Acquisition.
According to wikipedia, Data Acquisition is described as the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer [Source: Wikipedia].
There are a variety of ways to acquire data, here’s some ideas:
Public Data
Data Scraping
Product Intervention
Data Augmentation
Essentially, Data can come from a variety of sources – there are some gotchas but detailing them is beyond the scope of this article.
Data exploration
Once the data has been collected is made accessible to the Data Scientists it’s important that time is spent becoming acquainted with the data.
During this phase, it’s important to develop hypotheses about the data, whilst searching for patterns and anomalies. You also want to determine the type of problem being solved, i.e. is this a supervised learning task or an unsupervised learning task? Is it a classification task or regression task? Are we trying to predict something or are we inferring something?
Supervised Learning involves building a model that learns the function which maps an input to an output based on examples of input-output pairs.
Unsupervised Learning involves building a model that learns patterns from unlabeled data.
Classification is a form of supervised learning which refers to a modeling problem where the output of the model is a discrete label.
Regression is a form of supervised learning which refers to a modeling problem where the output of the model is continuous.
The main gist is that we wish to understand our data well enough to develop hypotheses that we could potentially test when we get to the next phase of the workflow – modeling the data.
Modeling
Once, we’ve explored the data comprehensively, we would have a much better idea of the type of problem we are faced with and hopefully would have generated some hypotheses in the previous stage that we could try out.
Since Data Science is a science, it’s likely that we are going to have to test a variety of various solutions before we could conclude on how we wish to proceed with our project. Each experiment or iteration would involve 3 stages;
Building involves learning and generalizing a machine learning algorithm using training data.
Fitting involves measuring the machine learning models ability to generalize to never seen before examples that are similar to the data it was trained on.
Validation involves evaluating a trained model using testing data which comes from a different portion of the training data.
Might be useful
Read how you can compare multiple runs in Neptune using in-depth analysis in the specialized view.
Communicating the results
Many Data Scientists spend the early stages of their careers obsessing over Machine Learning Algorithms and obsessing over the state of the art, however, as time goes on many begin to realize that their attention should be diverted to what is known as soft skills.
Communicating your results clearly is one of the most important skills to possess as a Data Scientist because you will be doing a lot of it. In this phase, the Data Scientist would be required to communicate the findings, results, and/or story back to various stakeholders. For the most part, these stakeholders aren’t always people that are fully ingrained in Data Science, hence being able to alter your message to be understanding for your audience is a very important part of the Data Scientists workflow.
Existing workflows
Data Science workflows are not a new feat in the field, in fact, there are many frameworks readily available for teams to select from.
コメント