Organizations are, on average, using 976 individual applications (compared to 843 a year ago). Yet only 28% of these applications are integrated, indicating there is still an enormous opportunity to improve the digital experience.
The main culprit behind this low number are legacy applications. Due to historical reasons related to the adoption of open-source software and open standards, many old applications implemented custom interfaces that, at some point, would become widespread. With mass adoption of open source software within the last 10 years and the resulting standardization across the field, most modern data applications are pluggable by default – they expose common, highly standardized and very popular interfaces such as REST APIs, JDBC, and ANSI SQL. They also show data in standard formats – typically JSON. Many high-quality, free tools exist that help upload, download, and work with data coming from these interfaces in those formats.
This situation was not the case 20+ years ago when they were building the first applications such as SAP or the Oracle database. In those dark ages of the data world, vendors used custom interfaces and often exposed data in custom or obscure formats. To work with this data, you either had to purchase a targeted, dedicated solution or build your own from scratch – and you could not rely on any of the standard tooling or building blocks of today. There were no giants yet on whose shoulders you could stand. Not to mention, maintenance was also challenging, as each version of the software could include changes to its interface.
Client’s SAP Connectivity Challenge
One of the most widely used applications in the enterprise landscape is SAP. Its ecosystem consists of dozens of products that offer similar functionality, yet require very different connectivity setups. Many of them lock you into webs of applications that are tightly integrated or even dependent on each other, which makes it hard to incorporate them with the rest of the IT infrastructure (or even other SAP products), and require you to purchase additional solutions.
When coupled with the vendor’s traditional book format documentation, the approach above creates a situation where it may take weeks or even months to properly research which products are the best while fitting the overall company’s technology stack IT strategy. Integrating these products into the client’s infrastructure might be more complicated than developing an integration solution from scratch..
Our client struggled with the same challenge.
- Finance wanted to analyze data from SAP; and
- then combine that data with data coming from other, non-SAP sources (ie., eliminate the data silo problem of SAP).
When we came in, the client had been evaluating several options. We committed to providing the complete Minimum Viable Product (MVP), including several data extracts, a data lake implementation, data modeling, and visualization within three months.
The solution consisted of several modern technologies and techniques:
- Orchestration layer (ELT pipelines etc.) written in Python;
- A data lake to serve as the company’s centralized data storage;
- A database to enable fast dashboarding.
1. Orchestration layer
We use Python for the orchestration layer for several reasons:
- Automation: it follows the IaC (Infrastructure as Code) principle;
- Flexibility: it allows expressing any possible business logic;
- Usability: it’s very abstract (natural language-like), requiring only basic knowledge of computer science or programming;
- Learnability: Many analysts already use it, so it makes it much easier to implement self-service ELT.
2. Data lake
As for data lake, it’s the standard solution to get rid of data silos. The main benefits are summarized below:
- Capacity: data lakes are basically infinite storage;
- Pluggability: data lakes integrate with vast amounts of software. There are many ways to get the data in, out, or to work with the data within it;
- Flexibility: data lakes can store data in any format – be it standard data formats such as CSV or Parquet or non-structured data, e.g., PDF or HTML documents;
- Reliability: well set up data lakes offer essentially 100% reliability.
3. Analytical database
While data lakes can store any kind of data for multiple purposes, fast handling of analytical data requires purpose-built solutions. Modern lakehouse solutions allow data visualization without moving it outside the data lake. However, implementing a lakehouse requires several additional components. It also takes significant engineering efforts (even with popular solutions such as Databricks). If you build your data infrastructure and related processes from scratch, it’s probably a good idea to use the lakehouse architecture. In many cases, though, simply adding an OLAP database for final (filtered and aggregated) data will be much faster, cheaper, and require far less maintenance.
Our journey to the solution
We approach our projects in an agile manner. We focus on building a Minimum Viable Product (MVP) first and utilizing our existing architecture and processes.
This way, we can do two things:
- Build bold and robust data products quickly. We work in close collaboration with the client, allowing us to incorporate their feedback whenever needed.
- Make opinionated choices & use recipes, which allows us to build off of our existing solutions and processes to move efficiently and effectively.
WHAT ARE RECIPES AND WHY DO WE RELY ON THEM?
- Process-oriented mindset.
We split tasks in two: finding a connectivity method to build our prototype and ensuring that this method was at least relatively viable. Having the correct contact points on the client’s side (for business logic and technical support) was also crucial in ensuring that we could move quickly.
The following SAP blog article was an important starting point.
Within an hour we started prototyping the solution. Since we have built an in-house Python connector library, it was easy for us to incorporate the new connector into it. It took us two hours to implement the connector and add a simple SQL interface on top of it. We continued writing the first data extracts and models, adding functionality, tests, and fixing bugs as they came.
During the first few days, we discovered some limitations of the RFC solution. For example, it does not allow filters to have more than 75 characters. Another limitation is that it is not possible for any row in the result to contain more than 512 characters.
However, thanks to our experience handling data programmatically, it was easy to overcome both limitations, not with workarounds but with general and reliable solutions. In particular, by implementing client-side filtering, we fixed the filtering issue. This implementation requires sending as many filters as possible to the pyRFC connector. We apply any filters that don’t fit to the partly filtered data after the client receives it.
For the 512 character limit, we use table metadata to divide the table into blocks and internally create a separate query for each block. Then, the blocks concatenate on the client-side. In our solution, we have abstracted this entire process away from the analyst. They specify their SQL query as expected, and the library does any special handling of the request/data internally.
After building the prototype, we researched the space further and came across a fantastic summary of SAP’s products and connectivity methods. Thank you, Kai Waehner! Thanks to the rapid prototyping of the pyRFC solution and the tremendous research done by Kai, we were able to assess the landscape quickly and conclude that for this use case, the pyRFC approach would be far quicker and cheaper to implement than other solutions.
True, it is sometimes considered a legacy solution. It can be hard to debug and has several poorly documented limitations. The code execution environment also has to be specifically customized to run the RFC connector: SAP RFC requires the installation of a custom proprietary driver into the environment. However, it gets the basic job of reading data from SAP done, and it does it fast and reliably enough that we decided we can add the missing functionalities on our end. We filled the gaps in functionality, codified and automated the environment, and abstracted away from the interface with simple python and SQL, making it an analyst-friendly solution that is also easy to deploy and operate.
With our practical approach, we had a working connector prototype within two hours, and we could spend the rest of the time iterating on it for the full MVP solution. We spent another few weeks adding tests, and features, getting it production-ready, and producing analytical insights and data models. These actions all happened concurrently from the moment we were able to establish connectivity. Though the MVP infrastructure and analytics ultimately took three months to deliver, we could provide a robust and feature-rich analytics solution in that time, including infrastructure, code, data models, dashboards, and documentation.
Follow these 4 steps to overcome SAP integration challenges.
- Find a way to prototype a solution
- Build the prototype, evaluating it continuously against real-world scenarios, and improving usability for end-users as you go
- Evaluate the prototype against other possible solutions and determine whether it’s worth it to pursue them (usually they offer marginal improvement for a lot of extra work)
- Integrate the prototype into your MVP by adding components such as security, monitoring, etc.