In this blog, I outline briefly:
- Common Applications of Data Science
- Definitions: Machine learning, deep learning, data engineering and data science
- Why Java for data science workflows, for both production and research.
Common Applications of Data Science
The blogosphere is full of descriptions about how data science and “AI’ is changing the world. In financial services, applications include personalized financial offers, fraud detection, risk assessment (e.g. loans), portfolio analysis and trading strategies,
but technologies are relevant elsewhere, e.g. customer churn in telecomms, personalized treatment in healthcare, predictive maintenance for manufacturers, and demand forecasting in retail.
These applications outlined are largely not new, nor are "AI" algorithms like neural networks. However, increasingly commoditized, flexible and cheaper hardware with readily available algorithms and APIs have lowered barriers to data-compute intensive approaches
common to data science, making the use of "AI" algorithms much more straightforward.
Key Definitions: Machine Learning, Data Science, etc
For practitioners, definitions are well understood. For those less familiar and curious, here are some quick definitions and introductions to baseline everyone.
At their heart, data science workflows transform data, from heterogenous sources of information, through models and learning, to derive information from which “useful” decisions can be expedited. Decisions may be automated (e.g. an online
search or a retail credit fraud check) or inform human decisions (e.g. portfolio manager investment decisions or a complex corporate lending negotiation).
Some see a distinction between Data Science and Data Engineering, but both serve two sides of the same coin, as U2 put it once, “we’re one but we’re not the same.” I was recently pointed to
this table, which I adjusted a tad below and I’d argue that developers/DevOps should be called out too as a distinct column.
same article, a commentator observed:
"Most cloud-native-type companies need five data engineers for each data scientist to get the data into the form and location needed for good data science," said Jason Preszler, head data scientist at Karat, a technical hiring service. "Without both roles,
the data [that] companies are easily collecting is just sitting around or underutilized."
I’ve seen exceptional domain-specialists-come-data scientists also be CTO-like unicorns, bridging the gap between algorithm, implementation and business insight. I've also seen enterprise architects and CTOs, particularly those gifted with both soft skills
and AI-focused STEM PhDs, drive algorithmic research, a chance perhaps to relive their university days. Their direction in turn helps specialists deploy individual tasks, such as algorithms and research, data munging and warehousing, software and application
development right through to business-level reporting and, if applicable, automated activity execution.
Now let’s briefly examine some key algorithmic terminologies, important because we’ll return to them later in the article when exploring emerging Java capabilities:
Machine Learning: "The field of study that gives computers the ability to learn without being explicitly programmed” - Arthur Samuel (1959)
The field subdivides in multiple ways.
Machine Learning itself
uses labeled training data to predict future values, essentially learn from example. Supervised (which trains a model on known inputs and outputs) and unsupervised learning (finds hidden patterns or intrinsic structures in input data) can
In deep learning, a computer model learns to perform classification tasks directly from images, text, signals or sound. Models are trained by using
a large set of labeled data and neural network architectures that contain many layers, like below.
Among various deep learning algorithms, I’ve interacted with two significantly in my financial services life
- Convolutional neural networks which extracted image features, in my case cars in out-of-town car parks/malls from satellite images, forming the basis of an "alternative"
RetailWatch car count data-set to give daily insights into retail performance.
- I showed demos using
Long-Short-Term Memory Nets to drive sentiment classification from news, tweets and earnings announcements to underpin alternative data-driven trading strategies.
Reinforcement Learning: This approach utilizes a human-like trial and error “agent-based” approach to reinforce paths that work and discard paths that don’t. Such approaches are popular in search, retail and trading strategies, as they can
mimic complex human behavior. They are also applied in ADAS (Advanced-Driver Assistance Systems) applications, intersecting well with the human-machine interface on which such systems depend.
Why Java in Your Data Science Workflows?
All languages are beautiful, their individual beauty often lies in the eye of the beholder. Open source languages Python and R since 2010-15 have dominated upstream Data Science, prior to that the commercial language MATLAB in which many game-changing early
neural nets algorithms were implemented. Views differ on how far Python and R extend into the enterprise stack. In research, R has a rich statistical library ecosystem while key libraries like Tensorflow, PyTorch and Keras are accessible from Python, facilitated
by the SciPy stack and Pandas. However, other languages are coming to the fore, including Java, C++ and .NET.
Gartner machine learning guru, Andriy Burkov, eloquently writes:
"Some people working in data analysis think that there's something special about Python (or R, or, Scala).
They will tell you that you have to use one of those because otherwise, you will not get the best result. It's not true. The choice of language should be made based on two factors: 1) how well your final product will integrate with the existing ecosystem
and 2) the availability of production-grade data analysis libraries for each language.
Currently, almost any popular language has one or more powerful libraries for data analysis. Java is an excellent example, where the development of everything hot is happening right now because of a multitude of existing JVM languages. C++ historically has
a huge choice of implemented algorithms. Even proprietary ecosystems such as .NET today contain implementations of most of the state-of-the-art algorithms and learning paradigms. So, if someone tells you that only Python is the way to go, I would be skeptical
and look for someone who embraces diversity."
Great advice. Two key points primarily from the Java perspective:
i) Data science algorithms “upstream” particularly for statistics, machine learning and deep learning methodologies (neural nets), hitherto the province of Python, R and MATLAB, are more readily available across more languages. In Java,
for example the following frameworks are emerging:
- DeepLearning4J includes a Toolkit for building, training and deploying neural networks.
RL4J extend with reinforcement learning targets image processing and includes Markov Decision Processes (MDP) and Deep Q Network (DQN) methods
- ND4J: Key scientific computing libraries for JVM use, modeled on NumPy and core MATLAB, including deep learning capabilities.
- Amazon Deep Java Library: Develop and deploy machine and deep learning models, drawing on MXNet, PyTorch and TensorFlow frameworks.
other capabilities make Java accessible to developer-savvy scientific programmers.
Note that commercial "upstream" environments such as SAS, KNIME and RapidMiner offer data science platforms with strong Java foundations. MATLAB too has historically integrated well with Java for application development and API connectivity, a theme in Yair
Altman’s Java/MATLAB aging classic,
Undocumented MATLAB. The
MATLAB Production Server is one of several vehicles to deploy MATLAB algorithms into Java enterprise applications. In your Java code, you can define a Java interface to represent the deployed MATLAB function, instantiate a proxy object to communicate with
the Production Server, and thus call the MATLAB-generated function.
You can also
interface and deploy open source R code to Java in many ways, including via the package
In short, there are increasing capabilities to code (production-ready-ish) data science algorithms in Java and if not in Java then call other languages from Java. Python (with NumPy; SciPy; Pandas), R and MATLAB will surely remain algorithmic domain leaders
given their matrix algebra, tech computing and statistical foundations, but Java and other languages are increasingly compelling.
A quick nod to C++. Remember that key “Python” libraries have strong C++ foundations including
Tensorflow and PyTorch. Away from algorithms and toward data engineering, Python Pandas lead
Wes McKinney, for example, has highlighted the relevance of C++ to the multi-platform Arrow Project.
ii) Data science enterprise architectures “downstream,” particularly those focusing on secure data throughput, are often Java-based and/or underpinned in platforms or languages (e.g. Scala or Clojure) using the Java Virtual Machine [JVM],
- Hadoop: Distributed storage and processing of big data using the MapReduce programming model
- Spark: Where Hadoop tends towards batch, Spark performs batch and streaming.
- Kafka: Messaging and Streaming
- Cassandra: NoSQL Database
- Neo4J: Another popular NoSQL Database
- Elasticsearch: A search engine based on the Lucene library, providing a distributed full-text search engine with an HTTP web interface and schema-free JSON documents.
When maintained by careful
JVM tuning or by swapping in a high performance JVM, these already high performance applications become even more performant and glitch-free.
Java excels in distributed environments. Secure data handling, manipulation, transfer and connectivity are among its natural strengths, benefiting too from a coordinated strategy around security, enforced over the years by Sun Microsystems, Oracle and now
the vibrant OpenJDK organization. The cross-platform approach underpinned by the JVM, i.e. develop once, deploy anywhere, facilitates enterprise development. Key projects,
for example, Project Panama, enhance ease of access, and will bring compute-intensive deep learning-friendly CUDA and OpenCL-based libraries and GPU hardware within easier reach.
In conclusion, Java is prominent in enterprise architectures, but increasing in versatility in “upstream” data science-enabling algorithmic capabilities. It will operate in conjunction with Python, R, MATLAB, C++ and others and not instead
of them, but possibilities are increasingly available to use Java across all aspects of data science workflows.
As an MBA Student wrote in response to
Burkov's post referenced earlier, "I started learning Java last year and I am beyond excited about learning how to do advanced analytics in this language! I am currently taking Java based courses in ‘Data Structures and Algorithms’ & ‘Mathematics in Computing'"