In a recent post, I talked about the merits and challenges of using Kibana and the ELK Stack for corporate “big data” management. I thought I would follow that up with a related commentary on Kafka.
Kafka is an Open-source big data technology from the Apache stable. It provides real-time big data streaming for a variety of data processing purposes. Its part of the data pipeline for many big data projects. Kafka comes with built-in high availability and scalability features that make it particularly suitable for larger scale, mission critical projects. The other attractive part of it is how data is fed into the queue – it provides what’s called asynchronous queuing for log data. In plainer English that means that your API platform e.g. booking engine, CRS, PSS etc can log information about a request or a response without having to wait for that data to be written to a data store which, in computing terms, is very slow. In turn that means that the impact of logging on your API response times get minimised and you can offset most of the added workload with some extra CPUs.
The Triometric team have just completed a Kafka based project with a major travel organisation, so I can confirm that technology does provide an alternate data capture mechanism. Whilst Kafka will never offer the ‘zero impact’ of Trio’s own network based traffic capture, it is in the low impact category and is dramatically better than most logging approaches.
It’s not surprising therefore that a brief search on Kafka usage throws up a list of organisations to demonstrate that Kafka has wide scale adoption amongst major organisations, including large travel sector companies such SkyScanner and Hotel.com.
So, what is Kafka being used for? Some examples are technology centric such as internally connecting micro-components in an event processing architecture whilst some of the more recognisable B2C organisations use it for more business centric outcomes such as click stream analytics for websites. Either way, there is not one mention of APIs.
That’s not that surprising. The Trio project mentioned above is undoubtedly a very early project. It’s about consuming hundreds of millions of discrete API requests and responses from a scaled up Kafka queue with multiple feeding systems, matching all the separated requests with their responses, processing the XML content to show what products/services were asked for and what the resultant offer was. In short delivering real-time business reporting and alerting for an API. The great thing about this project is that the development of a “Kafka connector” with request/response matching technology has allowed us to deploy our very mature real-time, fully XML aware, transactional data analytic technology as the meaningful, ROI generating part of a Kafka pipeline. That’s definitely something worth writing about. This is approach is now available as part of the Trio Data Engine.
(This article first appeared on my profile on LinkedIn)