How and Why Data Scientists Are Embracing DevOps
Instead of being relegated to the drawing board, data scientists are becoming more involved in the implementation of their solutions through DevOps collaborative working.
Why Bring DevOps to Data Scientists
We’ve seen how DevOps – a contraction of Development and Operations – has entered software development to ensure that what runs on a developer’s machine works just as well in production. With DevOps, the software goes through an automated cycle of design, deployment, and testing until it is implemented without problems and with minimal time between coding and deployment.
Data scientists, however, have been “siloed”, or kept separate, from this cycle. Their role has been seen as purely model development, with the problem-solving of its actual application falling to the developers. This meant that different skills and environments were distinct from each other, resulting in poor collaboration and delays between results discovery and results sharing as the application developed in a linear fashion from theory to practice.
DevOps in Constructing a Data Science Question
To utilize a DevOps approach, there must be a process of deployment and testing. Identifying a goal is a natural place to start when forming a data science question.
Once there’s a goal to reach, then a model can be developed, and the results of its performance can be collected and analyzed. But the development of the software often relies on additional data, and the mining of this data can become more difficult as it the program’s needs change. After all, a model that works well, in theory, will not necessarily perform as robustly when exposed to live data and interactions.
Dealing with Data Restrictions
Gaining access to the relevant information can be difficult when the data is unfiltered or not formatted correctly. Some data may not be stored after a certain time, correlated with another value as you would like, or even collected in the first place. Restricted data can make it difficult to formulate a question in the first place, and it certainly complicates further development.
A DevOps approach would see data being sourced with the appropriate toolset, such as a flexible data management platform, which will absorb all the available information and sort it into more useable data.
Automation of Testing
The data science question can be refined as it is implemented and tested. Once this continuous integration is being put in place and robustly tested, automation can begin to be used, which will speed up the entire process. New features are added incrementally, and once they have passed testing to be successfully implemented the next “block” of the software can be considered.
Preparing the Model for Real-World Application
Once the program has been tested and works well under test conditions, it can be deployed in preparation for real-world application. At this point, the program would typically be handed over from the data scientists to the engineers, but in a DevOps approach, data scientists remain a part of the process. The application must be capable of scaling and working beyond its test parameters to handle the real data it will, or could, be processing.
Following its implantation, there must then be an evaluation of its effectiveness. Did it react well to large amounts of data? Did it perform properly in real time? Is there any new data that wasn’t accounted for?
Benefits of a DevOps Approach
Speed has become a motivating factor, with companies wanting their developments online and being useful, rather than languishing in labs, wasting money, falling behind the competition, and still constrained by test conditions. Consumers now expect upgrades and new features, and if they feel that a company is falling flat and failing to keep up with demand, it can become a disaster.
Of course, developers are understandably wary of implementing new features without assurances that it won’t fail under pressure, but that is why automated continuous testing is so valuable. With these incremental releases, each version is so subtly different to the one before that it contains very few changes, which reduces risks to the system and makes it easier to fix errors if they do occur.
Drawbacks to Using DevOps
The amalgamation of development and operations can be a tricky shift to navigate, and a company might fight it difficult to accommodate this new way of working when a separatist approach is the standard. Interdisciplinary working can also cause friction, particularly when developers favour innovation and operations prefer standardized, trusted methods.
Different frameworks also need to be considered: For example, if a data scientist is using R, this framework is intended for research, whereas a programmer would be able to translate it into something production-ready, such as the .NET Framework.
Uniting Data Scientists with DevOps
While data scientists are often perceived as building the foundations for the engineers to erect the walls, DevOps brings data scientists into more rigorous testing of their applications, where they can see it unfold piece by piece on a live server.