Discover more from Fogplug by Thor Ingham
Episode 1: Integration. Is it Data or Application orientated ?
Welcome to this episode the first episode from fogplug.
Where we will talk about integration as always. We will talk about the difference between application integration and data integration.
We'll start by defining application integration. This refers to the process of combining multiple software applications, systems, and data sources into a single unified system to improve efficiency and data sharing. This will often involve connecting disparate applications using APIs or middleware to enable the exchange of data and the information between.
The goal of application integration is to eliminate the need for manual data entry, reduce data duplication and errors, and enable a more seamless flow of information across an organization and to and from other organizations.
There are typically, Two technical terms that people will refer to when talking about application and data integration. Some will throw out the term esb, which stands for Enterprise Service Bus.
This is more a pattern for application integration. It's not really a tool. An ESB as a pattern is something you can choose to apply to your architecture or not. An application integration tool should support ESB as a pattern amongst others, but you choose which pattern to apply.
So if we look at this ESBs and ETL tools, they are two different technologies with a distinct purpose and use case. For the sake of this podcasts. and blog I will refer to message orientated middleware and data orientated middleware. Now, message orientated middleware is what people typically refer to as an esb, and data orientated middleware would be what some people would refer to as an ETL tool. When we typically hear the definition. Etl ETL stands for extract, transform, load.
So message orientated middleware. Provides a communication layer between applications and services within an organization. It enables them to exchange data and messages in a loosely coupled manner. It acts as a central hub to route, transform, and manage data and messages promoting service integration and reuse.
This is also known as service orientated architecture or SOA. Message orientated middleware typically consists of an integration tool, A message queue could also have Kafka to provide. Event streaming capabilities. To empower an event driven architecture
there are various tools. that integrators have in the tool belt. So it shouldn't come as a surprise to anyone that it's difficult to differentiate application integration from data integration. So if we look at data orientated middleware, this on the other hand, Is a process used to extract data from multiple sources, then transform into a desired format and load it into a target system.
This would typically be a data warehouse, a data warehouse used for reporting and analysis. Data orientated middleware is mainly focused on the integration of data from different systems and it's preparation for analysis. As I mentioned, this is what people typically refer to as an ETL tool.
To quickly summarize, we would use message orientated middleware when we need to integrate different applications and services within an organization. To facilitate communication and data exchange with other organizations or applications within an organization. And we would use data orientated middleware when we need to extract data from multiple sources, transform it and load it into a target system for reporting and analysis.
Both approaches have their own strengths and weaknesses, and the choice between them will depend on the specific requirements of a given project. Sometimes your project might be pressed for time and the application integration team might have a huge backlog of work to do, and you might have to go to the data integration people solve your project requirements with their tools. The most important thing in my opinion is just document where your data flows, document how it's flowing, which tools are used, and document the transformations and use. There will be a time when you might want to move an implementation either from the data or from the application integration team because it would be better suited in one team or the other. So documentation is the key here. Even though we have two separate types of tools to solve, more or less the same challenges, they aren't mutually exclusive. And this is what makes it difficult when deciding, which team should do what, but I'll try and go through some examples of when it's , appropriate to use one or the other.
So when could it be more appropriate to choose a data orientated approach? One example would be when the main focus is on data integration and preparation for reporting and analysis. Another example would be if you're dealing with large amounts of data from multiple sources and they need to be transformed and loaded into a centralized data repository.
Another example is when the data integration requirements are well defined and stable, meaning you're not really expecting changes to this data flow, and your focus is mainly on batch processing. Typically, this is an indicator that you're looking at data orientated middle-ware.
When could it be more appropriate to choose a message orientated approach? If your main focus is on real time communication and integration between applications and services within an organization, or even exchanging messages between organizations, it could be more appropriate to choose a message orientated approach.
if there is a need for a flexible and scalable communication infrastructure that can handle a high volume of messages and data, or if your integration requirements are dynamic as in changing, there could be a need for a more agile approach to integration, and in my opinion, application integration tools are well located for this scenario.
So to quickly summarize, if your project focuses more on data integration and preparation for analysis an ETL approach may be a better choice. And if your project focuses more on real time communication and integration between services applications, then message orientated middleware is probably your better choice.
However, it's not a clear cut choice.
When there's a large volume of data that needs to be moved quickly. An ETL can be a more effective solution because it can handle batch processing and parallel processing to speed up the date the transfer. This isn't to say that you can't do this in an application integration tool, but typically it's a lot easier to perform parallel processing in an ETL tool.
If you have complex data transformations, then an ETL tool can usually handle this easier than an application integration. For example, if your data needs to be cleaned or cleansed, reformatted and aggregated ETL tools are surely going to be a better option for you.
Again, you could do it with an application integration tool.
So if you're integrating with legacy systems, old systems, without APIs, old systems with a database, then ETL could be more effective for integrating because ETL tools typically have great mechanisms for communicating directly with databases. However, if your legacy system is file based, then it's quite likely that your application integration team have connectors for files.
That's not to say they can't connect to databases, but again, it depends on what you're facing. If your project is about data warehousing and business intelligence, then it's usually a good idea to go with the data oriented approach. An ETL tool would be well suited for data warehousing and business intelligence use cases. , especially when we're talking about large quantities of data that need to be transformed before analysis.
Now, it's worth noting that these are not mutually exclusive tools. Most organizations, they choose to use a combination of both to meet the specific needs. What I try to get organizations to do is to create. A center of excellence for integration, A place where projects can come and get expertise to solve their integration challenges.
Now in this center of Excellence, one will try to determine. What type of approach is correct for any given project? You would typically start out by looking at the characteristics for a project to see if it's a clear cut chase of let's do a message orientated approach, or let's do a data orientated approach.
More often or not, we will solve a project's requirement for data in both tools. So we'll use the message orientated approach for the part of the project where that is appropriate, and we'll use the data orientated approach for the pieces that are more clear cut for that approach.
Now, the point here is your projects need to have somewhere to go to get answers to their requirements. It's not easy for projects in an early stage to determine that. Ah, we need integration. We're gonna go to the application integration team. That will often result in integration's being delivered on an incorrect tool. And over time you'll see that you are not getting the benefits of the tools that you were hoping because you can do anything in either side or in either tool, but it will be more efficient if you have a mechanism in place to apply the correct set of tools.
An application integration approach is usually. Message orientated, and some people will refer to it this as we have an enterprise service bus or an esb. This isn't really a tool. It's more a pattern that you can choose to implement in your message orientated middleware. However, for the sake of conversation, we can say that an enterprise service bus is better for providing a centralized infrastructure for communication and data exchange between different applications in a service orientated architecture or an SOA.
Now, the job of an ESB pattern is to act as an intermediary between applications. It will typically support many protocols, many data formats, and many message exchange patterns, while it also provides features like routing transformation, security validation, error handling. This is what I call the message orientated approach.
When you do application integration in a message orientated approach, you don't have to apply an ESB pattern. A lot of folks out there, they have a history with ESBs as being this
Enterprise bus, which is slow prone to error, is difficult to change. Basically a black box where you put all your spaghetti inside, and that's not really the application integration tool's fault. It's more the fault of how it's been used these days, and we're now in 2023. We typically use message orientated middleware to create microservices, which is a different discussion, is actually a subset of a service orientated architecture and microservices are all about moving messages, transforming messages, and hopefully doing this in a secure and sustainable way. To me that's an integration problem. Modern integration tools, they work very well in solving this challenge, it's basically what they've done for the last 30 years.
On the other side of the table, if you may, we have the data orientated tools they are typically referred to as ETL or ELT tools. Now ETL stands for extract, transform, then load and ELT is extract, load, then transform.
The main difference being that ETL tools. They extract the data and transform it before loading. While an ELT approach will extract the data, load it into a target system, and transform it inside the target system using the facilities of the target system, these types of tools are usually better at extracting data from multiple sources. Transforming it into the desired format, and loading it into a target system, especially when the target system is a data warehouse, and especially when your data is going to be used for all analysis and reporting.
So a data orientated approach is typically used for large scale data migration aggregation and data warehousing. However, it's not designed to provide communication and integration between applications in real time. That's where I would draw the line on when to use one or the other,
so while both a message and data orientated approach. Have similar capabilities. The capabilities that are shared are they can both extract, transform, and load data. The point I'm making is that a message orientated middleware will be better suited for real-time communication and data exchange between applications while the data orientated approach is better suited for batch process data warehousing, aggregation of data and analysis and reporting scenarios. Now we'll get into the details. Moving on.
If we look at the data orientated middleware, we can say that it's more appropriate to use. an ELT when your data volume is large. The reason is that an ELT tool can handle big data better because it uses the processing power of your data warehouse, or cloud storage
if you have complex transformations of data. Multiple sources, you're doing aggregation. A data orientated tool like an ELT is typically a more suited tool to implement these scenarios in the reason being that these tools can leverage the processing power of your sql.
Elts can handle complex transformations better because they can leverage the processing power and SQL capabilities of your target system. Remember, elts extract, load, then transform. So you're doing your transformations at your target system and because your target system is most likely a data warehouse or an analysis tool, Is very well suited for complex transformations.
If you have real time analytics and you're using the data orientated approach, I would argue that doing the extract, load and then transform this is better because the faster you get your data into your target system, the faster you can do your real time analytics on it. If you're transforming it on theway which would be the result if you used an ETL tool, which is extract, transform, then load, then you'll be doing your analytics on all the data.
Now, ELT, ETL, I have noticed. The developers out there, they will prefer to use etl, especially if the source systems have limited processing power. Or if the data requires extensive cleansing before it's loaded into the target system, then you will be using your ETLs engine to do the transformations of the data not all cloud data warehouse solutions provide great cleansing capabilities yet, so that's something to be aware of. If you're cleansing the data, you might want to use an ETL approach as opposed to an ELT approach. And data cleansing most likely rules out the usage of a message orientated middleware.
So earlier on I briefly mentioned the term SOA or service orientated architecture. Service orientated architecture. An S SOA , is a widely used approach for building and organizing software systems in a modular, scalable, and flexible manor. An SOA allows for the integration of independent applications services where each of them will have a well-defined interface, and by integrating different services and applications, we can create larger and more complex solutions. The term SOA has fallen out of fashion in recent years. However, the principles and practices behind it, they are highly relevant and widely used in modern software development. ,
Martin Fowler has a paper where he talks about microservices as a subset of an SOA, which I highly recommend.
So an SOA and microservices have a couple of things in common. , when you exchange messages, you will typically want to have a contract that defines the data being exchanged. So you need to know that this is the type of data I'm expecting to receive and this is the type of data that I can provide you. We need contracts to control and manage the flow of data . This is also a really good way of creating tests so we can have test cases against these contracts, and if we change a service and the test breaks, it means we've changed the contract and we can't change the contract without notifying our subscribers.
To put it all into context, we have some commonly used integration patterns.
One pattern would be batch processing. This is a simple pattern that involves collecting data in batches and processing it in one go. . This is best used when there's a low volume of data and no real time requirements.
Another pattern could be real time streaming. This is a pattern that involves sending data in real time as soon as it becomes available, is best used when data needs to be processed as soon as it's generated, and low latency is a requirement. Then we have message queing. This is a pattern that involves sending messages into a queue, and then we process them as soon as resources are available.
This is best used when there's a high volume of data and also real-time requirements. However, We can't guarantee that our receivers are always there. So by using message queuing, we achieve what is known as temporal loose coupling, which is great cause it means we grab a piece of data, we do something with it, and when we're done we put it on a queue and whenever the receiver or receivers are ready, they can consume it in a timely fashion.
Another pattern is to apply what we call a CDC solution or change data capture. This is a pattern that involves capturing changes to data in a database.
A CDC solution is a non-intrusive way of listening to changes in a database. . It works by listening to the changes in a database log file so it doesn't listen. By creating triggers in the database, creating triggers in a database, and creating an integration table or a shadow table will typically increase the number of cursors you need in the database, which often causes problems with performance.
When we use CDC, which operates on the file system level and looks at the log files, we can achieve the same thing as using triggers. The problem with this approach is it will also bypass your applications, set of business rules, so to speak, because you will be getting all the events. Regardless of what your application might do with these events to end up in a final state.
It's a good way to get data quickly from an underlying database, but it also requires a great deal of insight into how an application works.
In the end, the best integration pattern for a specific scenario will depend on your specific requirements and constraints and, software systems involved. And as you heard, each type of tool will fit differently and the caveat here is if you use the wrong tool, I can guarantee you that the tool will not deliver on the promise of increased productivity.
In summary, message orientated middleware, which is also referred to incorrectly as an enterprise service bus. Remember, an enterprise service bus implementation is a pattern. It's used in service orientated architectures and microservice architectures. It plays nice with data orientated middleware. and data orientated Middleware is also referred to as ETL or ELT.
They are all popular approaches for integrating and processing data in modern enterprise architecture. Each of them have their own advantages and disadvantages, and the best approach depends on the specific requirements of a project.
I hope that by listening to this talk today, you have a better understanding of. When to use the message orientated approach and when to use the data orientated approach. So when do you go to the application integration team and when do you go to the data integration team to solve your project requirements?
Hopefully you'll be in a better position to have the discussion with your architects and developers. When trying to decide where your projects requirements would be best solved.
That's it for this episode. I hope to have another episode ready within the next two weeks. And I would very much appreciate the comment like, or even a subscribe to the podcast series. Until next time. Thank you for listening. Happy integrating.