Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership. In DZone’s 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business. This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
One of Apache Kafka’s most known mantras is “it preserves the message ordering per topic-partition,” but is it always true? In this blog post, we’ll analyze a few real scenarios where accepting the dogma without questioning it could result in unexpected and erroneous sequences of messages. Basic Scenario: Single Producer We can start our journey with a basic scenario: a single producer sending messages to an Apache Kafka topic with a single partition, in sequence, one after the other. In this basic situation, as per the known mantra, we should always expect correct ordering. But is it true? Well… it depends! The Network Is Not Equal In an ideal world, the single-producer scenario should always result in correct ordering. But our world isn’t perfect! Different network paths, errors, and delays could mean that a message gets delayed or lost. Let’s imagine the situation below: a single producer sending three messages to a topic: Message 1, for some reason, finds a long network route to Apache Kafka Message 2 finds the quickest network route to Apache Kafka Message 3 gets lost in the network Even in this basic scenario, with only one producer, we could get an unexpected series of messages on the topic. The end result on the Kafka topic will show only two events being stored, with the unexpected ordering 2, 1. If you think about it, it’s the correct ordering from the Apache Kafka point of view: a topic is only a log of information, and Apache Kafka will write the messages to the log depending on when it “senses” the arrival of a new event. It’s based on Kafka ingestion time and not on when the message was created (event time). Acks and Retries But not all is lost! If we look into the producing libraries (aiokafka being an example), we have ways to ensure that messages are delivered properly. First of all, to avoid the problem with the message 3 in the above scenario, we could define a proper acknowledgment mechanism. The acks producer parameter allows us to define what confirmation of message reception we want to have from Apache Kafka. Setting this parameter to 1 will ensure that we receive an acknowledgment from the primary broker responsible for the topic (and partition). Setting it to all will ensure that we receive the ack only if both the primary and the replicas correctly store the message, thus saving us from problems when only the primary receives the message and then fails before propagating it to the replicas. Once we set a sensible ack, we should set the possibility to retry sending the message if we don’t receive a proper acknowledgment. Differently from other libraries (kafka-python being one of them), aiokafka will retry sending the message automatically until the timeout (set by the request_timeout_ms parameter) has been exceeded. With acknowledgment and automatic retries, we should solve the problem of the message 3. The first time it is sent, the producer will not receive the ackTherefore, after the retry_backoff_ms interval, it will send the message 3 again. Max In-Flight Requests However, if you watch the end result in the Apache Kafka topic closely, the resulting ordering is not correct: we sent 1,2,3 and got 2,1,3 in the topic… how to fix that? The old method (available in kafka-python) was to set the maximum in-flight request per connection: the number of messages we allow to be “in the air” at the same time without acknowledgment. The more messages we allow in the air at the same time, the more risk of getting out-of-order messages. When using kafka-python, if we absolutely needed to have a specific ordering in the topic, we were forced to limit the max_in_flight_requests_per_connection to 1. Basically, supposing that we set the ack parameter to at least 1, we were waiting for an acknowledgment of every single message (or batch of messages if the message size is less than the batch size) before sending the following one. The absolute correctness of ordering, acknowledgment, and retries comes at cost of throughput. The smaller amount of messages we allow to be “in the air” at the same time, the more acks we need to receive, the fewer overall messages we can deliver to Kafka in a defined timeframe. Idempotent Producers To overcome the strict serialization of sending one message at a time and waiting for acknowledgment, we can define idempotent producers. With an idempotent producer, each message gets labeled with both a producer ID and a serial number (a sequence maintained for each partition). This composed ID is then sent to the broker alongside the message. The broker keeps track of the serial number per producer and topic/partition. Whenever a new message arrives, the broker checks the composed ID, and if, within the same producer, the value is equal to the previous number + 1, then the new message is acknowledged. Otherwise, it is rejected. This provides a guarantee of the global ordering of messages allowing a higher number of in-flight requests per connection (maximum of 5 for the Java client). Increase Complexity With Multiple Producers So far, we imagined a basic scenario with only one producer, but the reality in Apache Kafka is that often the producers will be multiple. What are the little details to be aware of if we want to be sure about the end ordering result? Different Locations, Different Latency Again, the network is not equal, and with several producers located in possibly very remote positions, the different latency means that the Kafka ordering could differ from the one based on event time. Unfortunately, the different latencies between different locations on Earth can’t be fixed. Therefore, we will need to accept this scenario. Batching, an Additional Variable To achieve higher throughput, we might want to batch messages. With batching, we send messages in “groups,” minimizing the overall number of calls and increasing the payload to overall message size ratio. But, in doing so, we can again alter the ordering of events. The messages in Apache Kafka will be stored per batch, depending on the batch ingestion time. Therefore, the ordering of messages will be correct per batch, but different batches could have different ordered messages within them. Now, with both different latencies and batching in place, it seems that our global ordering premise would be completely lost… So, why are we claiming that we can manage the events in order? The Savior: Event Time We understood that the original premise about Kafka keeping the message ordering is not 100% true. The ordering of the messages depends on the Kafka ingestion time and not on the event generation time. But what if the ordering based on event time is important? Well, we can’t fix the problem on the production side, but we can do it on the consumer side. All the most common tools that work with Apache Kafka have the ability to define which field to use as event time, including Kafka Streams, Kafka Connect with the dedicated Timestamp extractor single message transformation (SMT), and Apache Flink®. Consumers, when properly defined, will be able to reshuffle the ordering of messages coming from a particular Apache Kafka topic. Let’s analyze the Apache Flink example below: CREATE TABLE CPU_IN ( hostname STRING, cpu STRING, usage DOUBLE, occurred_at BIGINT, time_ltz AS TO_TIMESTAMP_LTZ(occurred_at, 3), WATERMARK FOR time_ltz AS time_ltz - INTERVAL '10' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = '', 'topic' = 'cpu_load_stats_real', 'value.format' = 'json', 'scan.startup.mode' = 'earliest-offset' ) In the above Apache Flink table definition, we can notice: occurred_at: the field is defined in the source Apache Kafka topic in unix time (datatype is BIGINT). time_ltz AS TO_TIMESTAMP_LTZ(occurred_at, 3): transforms the unix time into the Flink timestamp. WATERMARK FOR time_ltz AS time_ltz - INTERVAL '10' SECOND defines the new time_ltz field (calculated from occurred_at) as the event time and defines a threshold for late arrival of events with a maximum of 10 seconds delay. Once the above table is defined, the time_ltz field can then be used to correctly order events and define aggregation windows, making sure that all events within the accepted latency are included in the calculations. The - INTERVAL '10' SECOND defines the latency of the data pipeline and is the penalty we need to include to allow the correct ingestion of late-arriving events. Please note, however, that the throughput is not impacted. We can have as many messages flowing in our pipeline as we want, but we’re “waiting 10 seconds” before calculating any final KPI in order to make sure we include in the picture all the events in a specific timeframe. An alternative approach that works only if the events contain the full state is to keep a certain key (hostname and cpu in the above example) the maximum event time reached so far, and only accept changes where the new event time is greater than the maximum. Wrapping Up The concept of ordering in Kafka can be tricky, even if we only include a single topic with a single partition. This post shared a few common situations that could result in an unexpected series of events. Luckily options like limiting the number of messages in flight, or using idempotent producers, can help achieve an ordering in line with expectations. In the case of multiple producers and the unpredictability of network latency, the option available is to fix the overall ordering on the consumer side by properly handling the event time that needs to be specified in the payload. Some further readings: Kafka Streams event time Check out the Timestamp router SMT in Kafka Connect
Authentication is the process of identifying a user and verifying that they have access to a system or server. It is a security measure that protects the system from unauthorized access and guarantees that only valid users are using the system. Given the expansive nature of the IoT industry, it is crucial to verify the identity of those seeking access to its infrastructure. Unauthorized entry poses significant security threats and must be prevented. And that's why IoT developers should possess a comprehensive understanding of the various authentication methods. Today, I'll explain how authentication works in MQTT, what security risks it solves, and introduce the first authentication method: password-based authentication. What Is Authentication in MQTT? Authentication in MQTT refers to the process of verifying the identity of a client or a broker before allowing them to establish a connection or interact with the MQTT network. It is only about the right to connect to the broker and is separate from authorization, which determines which topics a client is allowed to publish and subscribe to. The authorization will be discussed in a separate article in this series. The MQTT broker can authenticate clients mainly in the following ways: Password-based authentication: The broker verifies that the client has the correct connecting credentials: username, client ID, and password. The broker can verify either the username or client ID against the password. Enhanced authentication (SCRAM): This authenticates the clients using a back-and-forth challenge-based mechanism known as Salted Challenge Response Authentication Mechanism. Other methods include Token Based Authentication like JWT, and also HTTP hooks, and more. In this article, we will focus on password-based authentication. Password-Based Authentication Password-based authentication aims to determine if the connecting party is legitimate by verifying that he has the correct password credentials. In MQTT, password-based authentication generally refers to using a username and password to authenticate clients, which is also recommended. However, in some scenarios, some clients may not carry a username, so the client ID can also be used as a unique identifier to represent the identity. When an MQTT client connects to the broker, it sends its username and password in the CONNECT packet. The example below shows a Wireshark capture of the CONNECT packet for a client with the corresponding values of client1, user, and MySecretPassword. After the broker gets the username (or client ID) and password from the CONNECT packet, it needs to look up the previously stored credentials in the corresponding database according to the username, and then compare it with the password provided by the client. If the username is not found in the database, or the password does not match the credentials in the database, the broker will reject the client's connection request. This diagram shows a broker using PostgreSQL to authenticate the client's username and password. The password-based authentication solves one security risk. Clients that do not hold the correct credentials (Username and Password) will not be able to connect to the broker. However, as you can see in the Wireshark capture, a hacker who has access to the communication channel can easily sniff the packets and see the connect credentials because everything is in plaintext. We will see in a later article in this series how we can solve this problem using TLS (Transport Layer Security). Secure Your Passwords With Salt and Hash Storing passwords in plaintext is not considered secure practice because it leaves passwords vulnerable to attacks. If an attacker gains access to a password database or file, they can easily read and use the passwords to gain unauthorized access to the system. To prevent this from happening, passwords should instead be stored in a hashed and salted format. What is a hash? It is a function that takes some input data, applies a mathematical algorithm to the data, and then generates an output that looks like complete nonsense. The idea is to obfuscate the original input data and also the function should be one-way. That means that there is no way to calculate the input given the output. However, hashes by themselves are not secure and can be vulnerable to dictionary attacks as shown in the following example. Consider this sha256 hash: 8f0e2f76e22b43e2855189877e7dc1e1e7d98c226c95db247cd1d547928334a9 It looks secure; you cannot tell what the password is by looking at it. However, the problem is that for a given password, the hash always produces the same result. So, it is easy to create a database of common passwords and their hash values. Here is an example: A hacker could look up this hash in an online hash database and learn that the password is passw0rd. "Salting" a password solves this problem. A salt is a random string of characters that is added to the password before hashing. This makes each password hash unique, even if the passwords themselves are the same. The salt value is stored alongside the hashed password in the database. When a user logs in, the salt is added to their password, and the resulting hash is compared to the hash stored in the database. If the hashes match, the user is granted access. Suppose that we add a random string of text to the password before we perform the hash function. The random string is called the salt value. For example with a salt value of az34ty1, sha256(passw0rdaz34ty1) is 6be5b74fa9a7bd0c496867919f3bb93406e21b0f7e1dab64c038769ef578419d This is unlikely to be in a hash database since this would require a large number of database hash entries just for the single plaintext passw0rd value. Best Practices for Password-Based Authentication in MQTT Here are some key takeaways from what we’ve mentioned in this article, which can be the best practices for password-based authentication in MQTT: One of the most important aspects of password-based authentication in MQTT is choosing strong and unique passwords. Passwords that are easily guessable or reused across multiple accounts can compromise the security of the entire MQTT network. It is also crucial to securely store and transmit passwords to prevent them from falling into the wrong hands. For instance, passwords should be hashed and salted before storage, and transmitted over secure channels like TLS. In addition, it's a good practice to limit password exposure by avoiding hard-coding passwords in code or configuration files, and instead using environment variables or other secure storage mechanisms. Summary In conclusion, password-based authentication plays a critical role in securing MQTT connections and protecting the integrity of IoT systems. By following best practices for password selection, storage, and transmission, and being aware of common issues like brute-force attacks, IoT developers can help ensure the security of their MQTT networks. However, it's important to note that password-based authentication is just one of many authentication methods available in MQTT, and may not always be the best fit for every use case. For instance, more advanced methods like digital certificates or OAuth 2.0 may provide stronger security in certain scenarios. Therefore, it's important for IoT developers to stay up-to-date with the latest authentication methods and choose the one that best meets the needs of their particular application. Next, I'll introduce another authentication method: SCRAM. Stay tuned for it!
When I first planned to start this blog, I had in mind to talk about my personal views on generative language technology. However, after sending my first draft to friends, family, and colleagues, it soon became clear that some background information about generative text AI itself was needed first. So, my first challenge is to offer an introduction and simple explanation about generative text AI. This is for all the folks who are flabbergasted by the wonder of generative text AI and wonder how and where the magic happens. Warning: if you’re not genuinely interested in it, this will be boring. Xavier, let's go. What Is Artificial Intelligence, What Is Machine Learning, and What Is Deep Learning? Artificial intelligence is the broader concept of machines being able to perform tasks that would require human intelligence. It’s just a computer program doing something intelligent. A system that uses if-then-else rules in order to make a decision is also artificial intelligence. If you build a loan expert system consisting of thousands of rules like “if spends more on coffee than on groceries, less chance to get a loan” or “if hides his/her online shopping from his/her significant other, less chance for the loan” to decide whether you get a loan, that system belongs to the artificial intelligence domain. These programs (although with slightly better rules) have already been around for 50+ years. Artificial intelligence is not something new. Machine learning is the subset of AI that involves the use of algorithms and statistics to enable systems to learn from and make predictions by themselves. Simply put, it is a way for computers to learn how to do things by themselves without being specifically programmed to do so. Without a human expert specifying all the domain knowledge like “if borrows money from parents to pay for Netflix subscription, less chance for a loan.” The system figures out what to do based on examples. Again, these — and even popular current approaches like neural networks — have already been around for 50+ years, although largely in academic settings. Deep learning is the subset of machine learning that is usually associated with the use of (deep) artificial neural networks — the most popular approach to machine learning. We’ll explain artificial neural networks in the next section, but for now, think of them as computer system that tries to work like our brains. Just like our brains have lots of little cells called neurons that work together to help us think and learn, an artificial neural network has lots of nodes that work together to solve problems. The term “deep” refers to the depth of these networks — the reason why they appear more and more in the media, and real use cases are that to make them work well, they need a lot of nodes and layers (hence, the term “deep”), which in turn requires a lot of data and computational power. What ended the previous AI winter (=a period without major breakthroughs in AI) is mostly the fact that we have more data and computational power now. Some recent important algorithmic advances enhanced a new spring in the AI seasons, though with the data and computational power of 20 years ago, (almost) no impressive machine learning system would be used in a real product today. These neural networks can figure out things like “if a person has a pet rock, wears a tutu to work, speaks only in rhymes, is allergic to money, has a fear of banks, but has initial influencer success, get more chance for a loan” but then even a thousand or million-fold more complex and without the possibility of the deep net explaining itself in a way that’s interpretable by us humans. Xavier, next time I’ll leave it up to you to make the text more interesting to read. The link between AI, ML, and DL is represented like this in a Venn diagram: What Are Artificial Neural Networks? In this blog, we will often talk about generative text AI. Since each performant generative text, AI belongs to the deep learning domain; this means we need to talk about artificial neural networks. Artificial neural networks (ANNs) are a type of machine learning model that are inspired by the human brain. They consist of layers of interconnected nodes or “neurons”: The neural network on the picture determines for a word whether it has a positive, neutral, or negative connotation. It has 1 input layer (that accepts the input word), 2 hidden layers (=the layers in the middle), and 1 output layer (that returns the output). A neural network makes predictions by processing and transmitting numbers to each next layer. The input word is transmitted to the nodes of the first hidden layer, where some new numbers are calculated that are passed to the next layer. To calculate these numbers, the weights of each connection are multiplied with the input values of each connection, after which another function (like sigmoid in the example) is applied to keep the passed numbers in a specific range. It is these weights that steer the calculations. Finally, the numbers calculated by the last hidden layer are transmitted to the output layer and determine what the output is. For our example, the end result is 0.81 for positive, 0.02 for negative, and 0.19 for neutral, indicating that the neural network predicts this word should be classified as positive (since 0.81 is the highest): Technical Notes for Xavier In the above animation, we do not explain why we represent the word “happy” with the numbers [4, 0.5, -1] in the input layer. We just ignore how to represent a word to the network for now — the important thing to get is the word is represented to the network as numbers. We will cover later how we represent a word as numbers. Also, the ANN above returns a probability distribution. ANNs almost always work like this. They never say, “This is the exact output”; it is almost always a probability over all (or a lot of) the different options. Training a Neural Network How does the neural network know which calculations to perform in order to achieve a correct prediction? We can’t easily make up any rules or do calculations in some traditional way to end up with a correct probability distribution (when working with unstructured data such as a piece of text, an audio fragment, or an image). Well, the weights of the first version of a neural network are actually entirely wrong. They are initialized randomly. But by processing example after example and using an algorithm called backpropagation, it knows how to adjust its weights to get closer and closer to an accurate solution. This process is also called the training of the neural network. In order for this to work, the network needs to know the correct output value for each training example. Without it, it couldn’t calculate its error and never adapt its weights towards the correct solution. So, for our example above, we need (word -> classification) pairs as training data, like (happy: positive = 1; neutral = 0, negative = 0), to feed to the network during training. For other ANNS, this will be something different. For speech recognition, for example, this would be (speech fragment -> correctly transcribed text) pairs. If you need to have this training data anyway, why do you still need the neural network? Can’t you just do a lookup on the training data? For the examples you have, you can do such a lookup, but for most problems for which ANNs are used, no example is ever unique, and it is not feasible to collect the training data for all possible inputs. The point is that it learns to model the problem in its neurons, and it learns to generalize, so it can deal with inputs it didn’t see at training time too. Small, Big, or Huge AI Models You will hear people talk about small, big, or huge AI models. This size of an AI model refers to the number of parameters an artificial neural network has. A parameter is really just one of the weights you saw before (note for the purists: we are omitting the bias parameters here, and we are also not covering the hyperparameters of the network, which is something entirely different). Small ANN student assignments involve neural networks with only a few parameters. Many production language-related ANNs have about 100M parameters. The OpenAI GPT3 model reportedly has 175 billion parameters. The number of parameters is directly linked to the computational power you need and what the ANN can learn. The more parameters, the more computation power you need, and the more different things your ANN can learn. Especially the GPT3-model’s size spans the crown: to store the parameters alone, you need 350 GB! To store all the textual data of Wikipedia, you only need 50GB. This gives a sense of its ability to store detailed data, but it also gives an impression of how much computational power it needs compared to other models. Transfer Learning A popular thing to do — in the AI community (haha, Xavier, I preempted you there!) — is to use transfer learning: To understand how transfer learning works, you can think of the first hidden layers of an ANN as containing more general info for solving your problems. For language models, the first layers could represent more coarse features, such as individual words and syntax, while later layers learn higher-level features like semantic meaning, contextual information, and relationships between words. By only removing the last layers of the network, you can use the same network for different tasks without having to retrain the entire network. This means that larger models can be retrained for a particular task or domain by only retraining only the last layers, which is much less computationally expensive and requires, in general, much less training data (often, a few thousand training samples are enough as opposed to millions). This makes it possible for smaller companies or research teams with less budget to do stuff with large models and democratizes AI models to a wider public (since the base models are usually extremely expensive to train). How Are Words Represented in Generative Text AI? In the article about artificial neural networks, we gave some high-level representation of what such a neural network is. The deeper you dig into these neural networks, the more details you encounter, like why it often makes sense to prune nodes or that there are several kinds of neurons, some of which keep different “states.” Or that there are even very different kinds of neural network layers and complex architectures consisting of multiple neural networks communicating with each other. But we will just focus on one question now, specifically for language-related ANNs. Let’s get back to the neural network we talked about before: In the example with the calculations, we presented the input word “happy” as [4, 0.5, -1] to demonstrate how a neural network works. Of course, this representation of the word happy was entirely made up and wouldn’t work in a real setting. So then, how do we represent a word? 1. Naive Approach: Take No Semantic Information Into Account A natural reflex would be: what kind of question is that? By its letters, of course. You represent the word “word” with “word.” Simple! Well, it’s not. A machine only understands numbers, so we need to convert words into numbers. So, what is a good way to do this? To convert a word to numbers, a naive approach is to create a dictionary with each known word (or word part) and give each word a unique number. The word “horse” could be number 1122, and the word “person” could be number 1123. Technical Note for Xavier I’m taking a shortcut here: you actually need to convert the actual numbers to their one hot vector encodings first in order for this approach to work. Otherwise, a relationship between words or letters is created that is just not there, e.g., “person” almost being the same as “horse” because their numbers are almost the same. For many problems, such as predicting the next word or translating the next word in a sentence, it turns out the above approach does not work well. The important piece of the puzzle here is that this approach fundamentally lacks the ability to relate words to each other in a semantic way. 2. Better Approach: Take Static Semantic Information Into Account Let’s consider the words “queen” and “king.” These words are very similar semantically — they basically mean the same thing except for gender. But in our approach above, the words would be entirely unrelated to each other. This means that whatever our neural network learns for the word “queen,” it will need to learn again for the word “king.” But we know words share many characteristics with other words — it doesn’t make sense to treat them as entirely separate entities. They might be synonyms or opposites, belong to the same semantic domain, and have the same grammatical category… Seeing each word as a standalone unit, or even just as the combination of its letters, is not the way to go — it certainly isn’t how a human would approach it. What we need is a numeric semantic representation that takes all these relations into account. How Do We Get Such Semantic Word Representations? The answer is: with an algorithm that’s based on which words appear together in a huge set of sentences. The details of it don’t matter. If you want to know more, read about Word2Vec and GloVe, and word embeddings. Using this approach, we get semantic representations of a word that capture its (static) meaning. The simplified representations could be: Queen: [-0.43 0.57 0.12 0.48 0.63 0.93 0.01 … 0.44] King: [-0.43 0.57 0.12 0.48 -0.57 -0.91 0.01 … 0.44] As you may notice, this representation of both words is the same (they share the same numbers), except for the fifth and sixth values, which might be linked to the different gender of the words. Technical Note for Xavier Again we are simplifying — in reality, the embeddings of 2 such related words are very similar but never identical since the training of an AI system never produces perfect results. The values are also never humanly readable; you can think of each number representing something like gender, word type, or domain,…but when they are trained by a machine, it is much more likely the machine comes up with characteristics that are not humanly readable, or that a human-readable one is spread over multiple values. It turns out that for tasks in which semantic meaning is important (such as predicting or translating the next word), capturing such semantic relationships between words is of vital importance. It is an essential ingredient to make tasks like “next word prediction” and “masked language modeling” work well. Although there are many more semantic tasks, I’m mentioning “next-word prediction” and “masked language modeling” here because they are the primary tasks most generative text ai models are trained for. “Next word prediction” is the task of predicting the most likely word after a given phrase. For example, for the given phrase “I am writing a blog,” such a system would output a probability distribution over all possible next word (parts) in which “article,” “about,” or “.” are probably all very probable. “Masked language modelling” is a similar task, but instead of predicting the next word, the most likely word in the middle of a sentence is predicted. For a phrase like “I am writing an <MASK> article.” it would return a probability distribution over all possible words at the <MASK> position. Technical Note for Xavier These word embeddings are a great starting point to get your hands dirty with language-related machine learning. Training them is easy — you don’t need tons of computational power or a lot of annotated data. They are also relatively special in the sense that when training them, the actual output that is used is the weights, not the output of the trained network itself. This helps to get some deeper intuition on what ANN weights and their activations are. They also have many practical and fun applications. You can use them to query for synonyms for another word or for a list of words for a particular domain. By training them yourself on extra texts, you can control which words are known within the embeddings and, for example, add low frequent or local words (very similar words have very similar embeddings). They can also be asked questions like “man is to computer programmer what woman is to …” Unfortunately, their answer, in this case, is likely to be something like “housewife.” So, they usually contain a lot of bias. But it can be shown you can actually remove that bias once it is found, despite word embeddings being unreadable by themselves. 3. Best Approach: Take Contextual Semantic Information Into Account However, this is not the end of it. We need to take the representation one step further. These static semantic embeddings we now have for each word are great, but we are still missing something essential: each word actually means something slightly different in each sentence. A clear example of this is the sentences “The rabbit knew it had to flee from the fox” and “The fox knew it could attack the rabbit.” The same word “it” refers to 2 completely different things: one time the fox, one time the rabbit, and it’s, in both cases, essential info. Although “it” is an easy-to-understand example, the same is true for each word in a sentence, which is a bit more difficult to understand. It turns out that modeling the relationship of each word with all the other words in a sentence gives much better results. And our semantic embeddings do not model relationships like that – they don’t “look around” to the other words. So, we do not only need to capture the “always true” relationships between words, such as that king is the male version of the queen, but also the relationships between words in the exact sentences we use them in. The first way to capture such relationships was to use a special kind of neural network — recurrent neural networks — but the current state-of-the-art approach to this is using a transformer, which has an encoder that gives us contextual embeddings (that are based on the static ones) as opposed to the non-contextual ones we discussed in the previous section. This approach — first described in the well-referenced paper “Attention is all you need” — was probably the most important breakthrough for GPT3, ChatGPT, and GPT4 to work as well as they do today. Moving from a static semantic word representation to a contextual semantic word representation was the last remaining part of the puzzle when it comes down to representing a word to the machine. What Is Generative Text AI? Generative text AI is AI — an ANN, either a recurrent neural network or a transformer – that produces continuous text. Usually, the general base task on which these systems are trained is “next word prediction,” but it can be anything. Translating a text, paraphrasing a text, correcting a text, making a text easier to read, predicting what came before instead of after… Oversimplified, generative text AI is next-word prediction. When you enter a prompt in ChatGPT or GPT4, like “Write me a story about generative text AI.” it generates the requested story by predicting word after word. Technical Note for Xavier To make generative AI work well, <START> and <END> tokens are usually added to the training data to teach them to determine the beginning and ends of something. Recent models also incorporate human feedback on top to be more truthful and possibly less harmful. And on top of next-word prediction, in-between word prediction or previous-word prediction can also be trained. But disregarding the bit more complex architectures and other options, these systems are always just predicting word after word without any added reasoning on top whatsoever. The basic idea behind generative text AI is to train the model on a large dataset of text, such as books, articles, or websites, to make it predict the next word. We don’t need labeled data for this since the “next word” is always available in the sentences of the texts themselves. When we don’t need human-labeled data, we talk about “unsupervised learning.” This has a huge advantage. Since there exist trillions of sentences available for training, access to “big data” is easy. If we want to visualize such an ANN on a high level, it looks something like this: This is only a high-level representation. In reality, the most state-of-the-art architecture of these systems — the transformer — is quite complex. We already mentioned transformers in the previous section when we were talking about the encoder that produces the contextual embeddings. As we showed there, it is able to process text in a way that is much more complex and sophisticated than previous language models. The intuition about it is that: By considering the surrounding words and sentences of a word, it is much better to understand the intended meaning It works in a “bidirectional” manner — it can understand how the words in a sentence relate to each other both forwards and backward, which can help it to better understand the overall meaning of a piece of text. Technical Note for Xavier This transformer architecture actually consists of both an encoder (e.g., “from static word embeddings to contextual embeddings”) and a decoder (e.g., “from contextual embeddings of the encoder and its own contextual embeddings to target task output”). The encoder is bidirectional, and the decoder is unidirectional (only looking at works that come after, not before). The explanation of this system is quite complex, and there are many variants that all use the concept of contextual semantic embeddings/self-attention. The GPT models are actually called "decoder-only" since they are only using the decoder part of a typical transformer, although "modified encoder only" would be a more suitable name imho. This video is a good explanation of Transformers — it’s one of the clearest videos explaining the current generative text AI breakthroughs. Following up on that, the blog article “The Illustrated Transformer” explains the transformer even more in-depth. And if you want to dig even deeper, this video explains how a transformer works at both training and inference times. Some more technical background is required for these, though. The task a system like this learns is “next word prediction,” not “next entire phrase prediction.” It can only predict one word at a time. So, the ANN is run in different time steps, predicting the first next word first, then the second, and so on. As we mentioned before, the output for each run is again a probability distribution. So, for instance, if you feed the ANN the sentence “I am working,” the next word with the highest probability could be “in,” but it could be closely followed by “at” or “on.” One option is selected, then. Technical Notes for Xavier This doesn’t really need to be the word with the highest probability — all implementations allow a configuration option to pick either the word with the highest probability or allow some freedom in this. Once the word is picked, the algorithm continues with predicting the next word. It goes on like this until a certain condition is met, like maximum tokens reached or something like a <STOP> token is encountered (which is a “word” the model learned at training time to know when to stop). An important thing here to realize is the simplicity of the algorithm. The only thing it is really doing — albeit with “smart, meaningful word and in-sentence relationship representations” the transformer is providing — is predicting the next word given some input text (“the prompt”) based on a lot of statistics. 1 cool thing about this generative text AI is that it doesn’t necessarily need to predict the next word. You can just as easily train it for “previous word prediction” or “in-between word prediction” — something that would be much more difficult for humans. You could pass it “Hi, here is the final chapter of my book. Can you give me what comes before.” — it would work just as well as the other way around. Is ChatGPT Just Using Next Word Prediction? In the previous article, we explained how GPT itself (like GPT3 and GPT3.5) works, not ChatGPT (or GPT4). So, what’s the difference? Base models for generative text AI are always trained on tasks like “next word prediction” (or “masked language modelling”). This is because the data to learn this task is abundant: namely, all the sentences we produce on a daily basis are accessible in Internet databases, documents, and webpages,.. they are everywhere. No labeling is necessary. This is of vital importance. There is no feasible way to get a training set of trillions of manually labeled examples. But a task like “next word prediction” creates a capability vs. alignment problem. We literally train our model to predict the next word. Is this what we actually want or expect as a human? Most definitely not. Let’s consider some examples. What would be the next word for the prompt “What is the gender of a manager?” Since the base model was trained on different texts, many of which are decades old, we know for a fact that the training data contained a lot of bias relating to this question. Because of this, statistically, the output “male” will be much more probable than “female.” Or let’s ask it for the next word for “The United States went to war with Liechtenstein.” Statistically, the most likely outputs after this “in” is year numbers, not “an” (as you would need to reach “an alternate universe”) or any other word. But since the task it got was to actually predict the statistically most likely word given the expectations of the data it was trained with, it’s doing an awesome job here if it outputs some year, no? 100% correct, 100% capable. The problem is that we don’t really want it to predict the next most likely word. We want it to give a proper answer based on our human preferences. There is a clear divergence here between the way the model was trained and the way we want to use it. It’s inherently misaligned. Predicting the next word is not the same task as giving a truthful, harmless, helpful answer. This is exactly what ChatGPT tries to fix, and it tries to do so by learning to mimic human preference. In case you expect a big revelation now, like the thing learning to reason, or it somehow at least learning to reason about itself, you’re in for a disappointment: it’s actually just more of the same ANN stuff, but a little different. Very briefly, an approach using transfer learning (see before) was developed in which the GPT3.5 model was finetuned to “learn which responses humans like based on human feedback.” As we said before, transfer learning means freezing the first layers of the neural network (in the case of GPT-3, this is almost the entire model since even finetuning the last 10B parameters is prohibitively expensive). The first step was to create a finetuned model using standard transfer learning based on labeled data. A dataset of about 15 000 <prompt, ideal human response> pairs was created for this. This finetuned “model can already start outputting responses that are more favored by humans (more truthful, helpful, with the risk of being harmful) than the base model. Creating the dataset for this model, however, was already a huge task — for each prompt, a human needed to do some intellectual work to make sure the response could be considered an “ideal human response” or at least be close enough. The problem with this approach is that it doesn’t scale. The second step then was to learn a reward model. As we saw before, trained generative text AI outputs a probability distribution. So instead of asking it for the most probable next word, you can also ask it for the second most probable next word, or for the 10 most probable next words, or for 100 words that are probably enough given some threshold. You can sample it for as many responses to a prompt as you want. To get a dataset to train this reward model, 4 to 9 possible responses to each prompt were sampled, and for each prompt, humans ordered these responses from least favorable to most favorable. This approach scales much better than humans writing ideal responses to prompts manually. The fact that they needed to order the responses might make the reward model a bit unclear, but the reward model is as you would expect: it just outputs a score on a “human preference scale” for each text: the higher, the more preferable. The reason why they asked people to order the responses instead of just assigning a score is that different humans always give different scores on such “free text,” and using ordering + something like an ELO system (e.g., a standardized ranking system) works much better to calculate a consistent score for a response than manually assigning a number when multiple humans are involved in the scoring. In the third step, another fine-tuned model (finetuned from the model that was created in the first step) is created by using the reward model that is even better at outputting human-preferred responses. Step 1 is only done once; steps 2 and 3 can be repeated iteratively to keep improving the model. Using this transfer learning approach, we end up with a system — ChatGPT — that’s actually much better than the base model at creating responses preferred by humans that are indeed more truthful and less biased. If you don’t get how this can work, the answer is in the embeddings again. This model has some kind of deep knowledge about concepts. So if we rank responses that contain a male/female bias consistently worse than ones without such bias, the model will actually see that pattern and apply it quite generally, and this bias will get quite successfully removed (except for cases where the output is entirely different, like, as part of a computer program).
In the previous article, we discussed the emergence of Date Lakehouses as the next-generation data management solution designed to address the limitations of traditional data warehouses and Data Lakes. Data Lakehouses combines the strengths of both approaches, providing a unified platform for storing, processing and analyzing diverse data types. This innovative approach offers flexibility, scalability, and advanced analytics capabilities that are essential for businesses to remain competitive in today's data-driven landscape. In this article, we will delve deeper into the architecture and components of Data Lakehouses, exploring the interconnected technologies that power this groundbreaking solution. The Pillars of Data Lakehouse Architecture A Data Lakehouse is a comprehensive data management solution that combines the best aspects of data warehouses and Data Lakes, offering a unified platform for storing, processing, and analyzing diverse data types. The Data Lakehouse architecture is built upon a system of interconnected components that work together seamlessly to provide a robust and flexible data management solution. In this section, we discuss the fundamental components of the Data Lakehouse architecture and how they come together to create an effective and convenient solution for the end user. At the core of the Data Lakehouse lies unified data storage. This element is designed to handle various data types and formats, including structured, semi-structured, and unstructured data. The storage layer's flexibility is enabled through storage formats such as Apache Parquet, ORC, and Delta Lake, which are compatible with distributed computing frameworks and cloud-based object storage services. By unifying data storage, Data Lakehouses allow organizations to easily ingest and analyze diverse data sources without extensive data transformation or schema modifications. Another essential aspect of the Data Lakehouse architecture is data integration and transformation. Data Lakehouses excel at handling data ingestion and transformation from various sources by incorporating built-in connectors and support for a wide array of data integration tools, such as Apache Nifi, Kafka, or Flink. These technologies enable organizations to collect, transform, and enrich data from disparate sources, including streaming data, providing real-time insights and decision-making capabilities. By offering seamless data integration, Data Lakehouses help reduces the complexity and cost associated with traditional data integration processes. Metadata management is a critical component of a Data Lakehouse, facilitating data discovery, understanding, and governance. Data cataloging tools like Apache Hive, Apache Atlas, or AWS Glue allow organizations to create a centralized metadata repository about their data assets. A comprehensive view of data lineage, schema, relationships, and usage patterns provided by metadata management tools enhances data accessibility, ensures data quality, and enables better compliance with data governance policies. Data processing and analytics capabilities are also integral to the Data Lakehouse architecture. Unified query engines like Apache Spark, Presto, or Dremio provide a single interface for querying data using SQL or other query languages, integrating batch and real-time processing for both historical and live data. Moreover, Data Lakehouses often support advanced analytics and machine learning capabilities, making it easier for organizations to derive valuable insights from their data and build data-driven applications. Finally, data governance and security are crucial in any data-driven organization. Data Lakehouses address these concerns by providing robust data quality management features like data validation, data lineage tracking, and schema enforcement. Additionally, Data Lakehouses support role-based access control, which enables organizations to define granular access permissions to different data assets, ensuring that sensitive information remains secure and compliant with regulatory requirements. Optimizing Storage Formats for Data Lakehouses In a Data Lakehouse architecture, the storage layer is crucial for delivering high performance, efficiency, and scalability while handling diverse data types. This section will focus on the storage formats and technologies used in Data Lakehouses and their significance in optimizing storage for better performance and cost-effectiveness. Columnar storage formats such as Apache Parquet and ORC are key components of Data Lakehouses. By storing data column-wise, these formats offer improved query performance, enhanced data compression, and support for complex data types. This enables Data Lakehouses to handle diverse data types efficiently without requiring extensive data transformation. Several storage solutions have been developed to cater to the unique requirements of Data Lakehouses. Delta Lake, Apache Hudi, and Apache Iceberg are three notable examples. Each of these technologies has its own set of advantages and use cases, making them essential components of modern Data Lakehouse architectures. Delta Lake is a storage layer project explicitly designed for Data Lakehouses. Built on top of Apache Spark, it integrates seamlessly with columnar storage formats like Parquet. Delta Lake provides ACID transaction support, schema enforcement and evolution, and time travel features, which enhance reliability and consistency in data storage. Apache Hudi is another storage solution that brings real-time data processing capabilities to Data Lakehouses. Hudi offers features such as incremental data processing, upsert support, and point-in-time querying, which help organizations manage large-scale datasets and handle real-time data efficiently. Apache Iceberg is a table format for large, slow-moving datasets in Data Lakehouses. Iceberg focuses on providing better performance, atomic commits, and schema evolution capabilities. It achieves this through a novel table layout that uses metadata more effectively, allowing for faster queries and improved data management. The intricacies of Delta Lake, Apache Hudi, and Apache Iceberg, as well as their unique advantages, are fascinating topics on their own. In one of our upcoming articles, we will delve deeper into these technologies, providing a comprehensive understanding of their role in Data Lakehouse architecture. Optimizing storage formats for Data Lakehouses involves leveraging columnar formats and adopting storage solutions like Delta Lake, Apache Hudi, and Apache Iceberg. These technologies work together to create an efficient and high-performance storage layer that can handle diverse data types and accommodate the growing data needs of modern organizations. Embracing Scalable and Distributed Processing in Data Lakehouses Data Lakehouse architecture is designed to address modern organizations' growing data processing needs. By leveraging distributed processing frameworks and techniques, Data Lakehouses can ensure optimal performance, scalability, and cost-effectiveness. Apache Spark, a powerful open-source distributed computing framework, is a foundational technology in Data Lakehouses. Spark efficiently processes large volumes of data and offers built-in support for advanced analytics and machine learning workloads. By supporting various programming languages, Spark serves as a versatile choice for organizations implementing distributed processing. Distributed processing frameworks like Spark enable parallel execution of tasks, which is essential for handling massive datasets and complex analytics workloads. Data partitioning strategies divide data into logical partitions, optimizing query performance and reducing the amount of data read during processing. Resource management and scheduling are crucial for distributed processing in Data Lakehouses. Tools like Apache Mesos, Kubernetes, and Hadoop YARN orchestrate and manage resources across a distributed processing environment, ensuring tasks are executed efficiently, and resources are allocated optimally. In-memory processing techniques significantly improve the performance of analytics and machine learning tasks by caching data in memory instead of reading it from disk. This reduces latency and results in faster query execution and better overall performance. Data Lakehouses embrace scalable and distributed processing technologies like Apache Spark, partitioning strategies, resource management tools, and in-memory processing techniques. These components work together to ensure Data Lakehouses can handle the ever-growing data processing demands of modern organizations. Harnessing Advanced Analytics and Machine Learning in Data Lakehouses Data Lakehouse architectures facilitate advanced analytics and machine learning capabilities, enabling organizations to derive deeper insights and drive data-driven decision-making. This section discusses the various components and techniques employed by Data Lakehouses to support these essential capabilities. First, the seamless integration of diverse data types in Data Lakehouses allows analysts and data scientists to perform complex analytics on a wide range of structured and unstructured data. This integration empowers organizations to uncover hidden patterns and trends that would otherwise be difficult to discern using traditional data management systems. Second, the use of distributed processing frameworks such as Apache Spark, which is equipped with built-in libraries for machine learning and graph processing, enables Data Lakehouses to support advanced analytics workloads. By leveraging these powerful tools, Data Lakehouses allows data scientists and analysts to build and deploy machine learning models and perform sophisticated analyses on large datasets. Additionally, Data Lakehouses can be integrated with various specialized analytics tools and platforms. For example, integrating Jupyter Notebooks and other interactive environments provides a convenient way for data scientists and analysts to explore data, develop models, and share their findings with other stakeholders. To further enhance the capabilities of Data Lakehouses, machine learning platforms like TensorFlow, PyTorch, and H2O.ai can be integrated to support the development and deployment of custom machine learning models. These platforms provide advanced functionality and flexibility, enabling organizations to tailor their analytics and machine-learning efforts to their specific needs. Lastly, real-time analytics and stream processing play an important role in Data Lakehouses. Technologies like Apache Kafka and Apache Flink enable organizations to ingest and process real-time data streams, allowing them to respond more quickly to market changes, customer needs, and other emerging trends. Ensuring Robust Data Governance and Security in Data Lakehouses Data Lakehouses prioritize data governance and security, addressing the concerns of organizations regarding data privacy, regulatory compliance, and data quality. This section delves into the various components and techniques that facilitate robust data governance and security in Data Lakehouses. Data cataloging and metadata management tools play a crucial role in establishing effective data governance within a Data Lakehouse. Tools such as Apache Atlas, AWS Glue, and Apache Hive provide centralized repositories for metadata, enabling organizations to track data lineage, discover data assets, and enforce data governance policies. Fine-grained access control is essential for maintaining data privacy and security in Data Lakehouses. Role-based access control (RBAC) and attribute-based access control (ABAC) mechanisms allow organizations to define and enforce user access permissions, ensuring that sensitive data remains secure and available only to authorized users. Data encryption is another key component of Data Lakehouse security. By encrypting data both at rest and in transit, Data Lakehouses ensure that sensitive information remains protected against unauthorized access and potential breaches. Integration with key management systems like AWS Key Management Service (KMS) or Azure Key Vault further enhances security by providing centralized management of encryption keys. Data Lakehouses also incorporate data quality and validation mechanisms to maintain the integrity and reliability of the data. Data validation tools like Great Expectations, data profiling techniques, and automated data quality checks help identify and address data inconsistencies, inaccuracies, and other issues that may impact the overall trustworthiness of the data. Auditing and monitoring are essential for ensuring compliance with data protection regulations and maintaining visibility into Data Lakehouse operations. Data Lakehouses can be integrated with logging and monitoring solutions like Elasticsearch, Logstash, Kibana (ELK Stack), or AWS CloudTrail, providing organizations with a comprehensive view of their data management activities and facilitating effective incident response. By prioritizing data privacy, regulatory compliance, and data quality, Data Lakehouses enables organizations to confidently manage their data assets and drive data-driven decision-making in a secure and compliant manner. Embracing the Data Lakehouse Revolution The Data Lakehouse architecture is a game-changing approach to data management, offering organizations the scalability, flexibility, and advanced analytics capabilities necessary to thrive in the era of big data. By combining the strengths of traditional data warehouses and Data Lakes, Data Lakehouses empowers businesses to harness the full potential of their data, driving innovation and informed decision-making. In this article, we have explored the key components and technologies that underpin the Data Lakehouse architecture, from data ingestion and storage to processing, analytics, and data governance. By understanding the various elements of a Data Lakehouse and how they work together, organizations can better appreciate the value that this innovative approach brings to their data management and analytics initiatives. As we continue our series on Data Lakehouses, we will delve deeper into various aspects of this revolutionary data management solution. In upcoming articles, we will cover topics such as the comparison of Delta Lake, Apache Hudi, and Apache Iceberg – three storage solutions that are integral to Data Lakehouse implementations – as well as best practices for Data Lakehouse design, implementation, and operation. Additionally, we will discuss the technologies and tools that underpin Data Lakehouse architecture, examine real-world use cases that showcase the transformative power of Data Lakehouses, and explore the intricacies and potential of this groundbreaking approach. Stay tuned for more insights and discoveries as we navigate the exciting journey of Data Lakehouse architectures together!
Previous Articles on CockroachDB CDC Using CockroachDB CDC with Azure Event Hubs Using CockroachDB CDC with Confluent Cloud Kafka and Schema Registry SaaS Galore: Integrating CockroachDB with Confluent Kafka, Fivetran, and Snowflake CockroachDB CDC using Minio as a cloud storage sink CockroachDB CDC using Hadoop Ozone S3 Gateway as a cloud storage sink Motivation Apache Pulsar is a cloud-native distributed messaging and streaming platform. In my customer conversations, it most often comes up when compared to Apache Kafka. I have a customer needing a Pulsar sink support as they rely on Pulsar's multi-region capabilities. CockroachDB does not have a native Pulsar sink; however, the Pulsar project supports Kafka on Pulsar protocol support, and that's the core of today's article. This tutorial assumes you have an enterprise license, you can also leverage our managed offerings where enterprise changefeeds are enabled by default. I am going to demonstrate the steps using a Docker environment instead. High-Level Steps Deploy Apache Pulsar Deploy a CockroachDB cluster with enterprise changefeeds Deploy a Kafka Consumer Verify Conclusion Step-By-Step Instructions Deploy Apache Pulsar Since I'm using Docker, I'm relying on the KoP Docker Compose environment provided by the Stream Native platform, which spearheads the development of Apache Pulsar. I've used the service taken from the KoP example almost as-is aside from a few differences: pulsar: container_name: pulsar hostname: pulsar image: streamnative/sn-pulsar:2.11.0.5 command: > bash -c "bin/apply-config-from-env.py conf/standalone.conf && exec bin/pulsar standalone -nss -nfw" # disable stream storage and functions worker environment: allowAutoTopicCreationType: partitioned brokerDeleteInactiveTopicsEnabled: "false" PULSAR_PREFIX_messagingProtocols: kafka PULSAR_PREFIX_kafkaListeners: PLAINTEXT://pulsar:9092 PULSAR_PREFIX_brokerEntryMetadataInterceptors: org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor PULSAR_PREFIX_webServicePort: "8088" ports: - 6650:6650 - 8088:8088 - 9092:9092 I removed PULSAR_PREFIX_kafkaAdvertisedListeners: PLAINTEXT://127.0.0.1:19092 as I don't need it, I also changed the exposed port for - 19092:9092 to - 9092:9092. My PULSAR_PREFIX_kafkaListeners address points to a Docker container with the hostname pulsar. I will need to access the address from other containers and I can't rely on the localhost. I'm also using a more recent version of the image than the one in their docs. Deploy a CockroachDB Cluster With Enterprise Changefeeds I am using a 3-node cluster in Docker. If you've followed my previous articles, you should be familiar with it. I am using Flyway to set up the schema and seed the tables. The actual schema and data are taken from the changefeed examples we have in our docs. The only difference is I'm using a database called example. To enable CDC we need to execute the following commands: SET CLUSTER SETTING cluster.organization = '<organization name>'; SET CLUSTER SETTING enterprise.license = '<secret>'; SET CLUSTER SETTING kv.rangefeed.enabled = true; Again, if you don't have an enterprise license, you won't be able to complete this tutorial. Feel free to use our Dedicated or Serverless instances if you want to follow along. Finally, after the tables and the data are in place, we can create a changefeed on these tables. CREATE CHANGEFEED FOR TABLE office_dogs, employees INTO 'kafka://pulsar:9092'; Here I am using the Kafka port and the address of the Pulsar cluster, in my case pulsar. job_id ---------------------- 855538618543276035 (1 row) NOTICE: changefeed will emit to topic office_dogs NOTICE: changefeed will emit to topic employees Time: 50ms total (execution 49ms / network 1ms) Everything seems to work and changefeed does not error out. Deploy a Kafka Consumer To validate data is being written to Pulsar, we need to stand up a Kafka client. I've created an image that downloads and installs Kafka. Once the entire Docker Compose environment is running, we can access the client and run the console consumer to verify. /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server pulsar:9092 --topic office_dogs --from-beginning {"after": {"id": 1, "name": "Petee H"} {"after": {"id": 2, "name": "Carl"} If we want to validate that new data is flowing, let's insert another record into CockroachDB: INSERT INTO office_dogs VALUES (3, 'Test'); The consumer will print a new row: {"after": {"id": 3, "name": "Test"} Since we've created two topics, let's now look at the employees topic. /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server pulsar:9092 --topic employees --from-beginning {"after": {"dog_id": 1, "employee_name": "Lauren", "rowid": 855539880336523267} {"after": {"dog_id": 2, "employee_name": "Spencer", "rowid": 855539880336654339} Similarly, let's update a record and see the changes propagate to Pulsar. UPDATE employees SET employee_name = 'Spencer Kimball' WHERE dog_id = 2; {"after": {"dog_id": 2, "employee_name": "Spencer Kimball", "rowid": 855539880336654339} Verify We've confirmed we can produce messages to Pulsar topics using the Kafka protocol via KoP. We've also confirmed we can consume using the Kafka console consumer. We can also use the native Pulsar tooling to confirm the data is consumable from Pulsar. I installed the Pulsar Python client, pip install pulsar-client, on the Kafka client machine and created a Python script with the following code: import pulsar client = pulsar.Client('pulsar://pulsar:6650') consumer = client.subscribe('employees', subscription_name='my-sub') while True: msg = consumer.receive() print("Received message: '%s'" % msg.data()) consumer.acknowledge(msg) client.close() I execute the script: root@kafka-client:/opt/kafka# python3 consume_messages.py 2023-04-11 14:17:21.761 INFO [281473255101472] Client:87 | Subscribing on Topic :employees 2023-04-11 14:17:21.762 INFO [281473255101472] ClientConnection:190 | [<none> -> pulsar://pulsar:6650] Create ClientConnection, timeout=10000 2023-04-11 14:17:21.762 INFO [281473255101472] ConnectionPool:97 | Created connection for pulsar://pulsar:6650 2023-04-11 14:17:21.763 INFO [281473230237984] ClientConnection:388 | [172.28.0.3:33826 -> 172.28.0.6:6650] Connected to broker 2023-04-11 14:17:21.771 INFO [281473230237984] HandlerBase:72 | [persistent://public/default/employees-partition-0, my-sub, 0] Getting connection from pool 2023-04-11 14:17:21.776 INFO [281473230237984] ClientConnection:190 | [<none> -> pulsar://pulsar:6650] Create ClientConnection, timeout=10000 2023-04-11 14:17:21.776 INFO [281473230237984] ConnectionPool:97 | Created connection for pulsar://localhost:6650 2023-04-11 14:17:21.776 INFO [281473230237984] ClientConnection:390 | [172.28.0.3:33832 -> 172.28.0.6:6650] Connected to broker through proxy. Logical broker: pulsar://localhost:6650 2023-04-11 14:17:21.786 INFO [281473230237984] ConsumerImpl:238 | [persistent://public/default/employees-partition-0, my-sub, 0] Created consumer on broker [172.28.0.3:33832 -> 172.28.0.6:6650] 2023-04-11 14:17:21.786 INFO [281473230237984] MultiTopicsConsumerImpl:274 | Successfully Subscribed to a single partition of topic in TopicsConsumer. Partitions need to create : 0 2023-04-11 14:17:21.786 INFO [281473230237984] MultiTopicsConsumerImpl:137 | Successfully Subscribed to Topics Let's insert a record into the employees tables: INSERT INTO employees (dog_id, employee_name) VALUES (3, 'Test'); UPDATE employees SET employee_name = 'Artem' WHERE dog_id = 3; The Pulsar client output is as follows: Received message: 'b'{"after": {"dog_id": 3, "employee_name": "Test", "rowid": 855745376561364994}'' Received message: 'b'{"after": {"dog_id": 3, "employee_name": "Artem", "rowid": 855745376561364994}'' Conclusion This is how you can leverage existing CockroachDB capability with non-standard services like Apache Pulsar. Hopefully, you've found this article useful and can start leveraging the existing Kafka sink with non-standard message brokers.
Any trustworthy data streaming pipeline needs to be able to identify and handle faults. Exceptionally while IoT devices ingest endlessly critical data/events into permanent persistence storage like RDBMS for future analysis via multi-node Apache Kafka cluster. (Please click here to read how to set up a multi-node Apache Kafka Cluster). There could be scenarios where IoT devices might send fault/bad events due to various reasons at the source points, and henceforth appropriate actions can be executed to correct it further. The Apache Kafka architecture does not include any filtering and error handling mechanism within the broker so that maximum performance/scale can be yielded. Instead, it is included in Kafka Connect, which is an integration framework of Apache Kafka. As a default behavior, if a problem arises as a result of consuming an invalid message, the Kafka Connect task terminates, and the same applies to JDBC Sink Connector. Kafka Connect has been classified into two categories, namely Source (to ingest data from various data generation sources and transport to the topic) and Sink (to consume data/messages from the topic and send them eventually to various destinations). Without implementing a strict filtering mechanism or exception handling, we can ingest/publishes messages inclusive of wrong formatted to the Kafka topic because the Kafka topic accepts all messages or records as byte arrays in key-value pairs. But by default, the Kafka Connect task stops if an error occurs because of consuming an invalid message, and on top of that JDBC sink connector additionally won’t work if there is an ambiguity in the message schema. The biggest difficulty with the JDBC sink connector is that it requires knowledge of the schema of data that has already landed on the Kafka topic. Schema Registry must therefore be integrated as a separate component with the exiting Kafka cluster to transfer the data into the RDBMS. Therefore, to sink data from the Kafka topic to the RDBMS, the producers must publish messages/data containing the schema. You could read here to learn streaming Data via Kafka JDBC Sink Connector without leveraging Schema Registry from the Kafka topic. Since Apache Kafka 2.0, Kafka Connect has integrated error management features, such as the ability to reroute messages to a dead letter queue. In the Kafka cluster, a dead letter queue (DLQ) is a straightforward topic that serves as the destination for messages that, for some reason, were unable to reach their intended recipients, especially for JDBC sink connector, tables in RDBMS There might be two major reasons why the JDBC Kafka sink connector stops working abruptly while consuming messages from the topic: Ambiguity between data types and the actual payload Junk data in payload or wrong schema There is no complicacy of DLQ configuration in the JDBC sink connector. The following parameters need to be added in the sink configuration file (.properties file): errors.tolerance=allerrors.deadletterqueue.topic.name= <<Name of the DLQ Toic>>errors.deadletterqueue.topic.replication.factor= <<No of replication>>Note:- No of replication should be equal or less then the number of Kafka broker in the cluster. The DLQ topic would be created automatically with the above-mentioned replication factor when we start the JDBC sink connector for the first time. When an error occurs, or bad data is encountered by the JDBC sink connector while consuming messages from the topic, these unprocessed messages/bad data would be forwarded straightly to the DLQ. Subsequently, correct messages or data will send to the respective RDBMS tables continuously and again in between. If bad messages are encountered, then the same would be forwarded to the DLQ and so on. After landing the bad or erroneous messages on the DLQ, we will have two options either manually introspect each message to understand the root cause of the error or implement a mechanism to reprocess the bad messages and push them eventually to the consumers for JDBC sink connector the destination should be RDBMS tables. Dead letter queues are not enabled by default in Kafka Connect due to the above reason. Even though Kafka Connect supports several error-handling strategies, such as dead letter queues, silently ignoring, and failing quickly, the adoption of DLQ would be the best approach while configuring the JDBC sink connector. Decoupling completely the bad/error messages handling from the normal messages/data transportation from the Kafka topic would boost the overall efficiency of the entire system as well as allow the development team to develop an independent error handling mechanism from easy maintainability perspectives. Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.
Amazon S3 is the most commonly used managed storage solution in AWS. It provides object storage in a highly scalable and secure way. AWS guarantees 11 9s for its durability. Objects stored in S3 are shared access objects; shared meaning they can be accessed by different clients at the same time. S3 provides low latency for data, and it has high throughput (able to move data in and out of S3 quickly). S3 is highly available, durable, and can encrypt data. S3 provides access management, lifecycle management, and the ability to query-in-place (no need to move to a data lake). Static website hosting is another very popular feature of S3. The following storage tiers are offered by S3: Standard Intelligent Tiering (save cost) Standard IA storage cost is reduced. Cost for read. 1Zone IA Lowest price but does not offer much availability/durability. For more information, please visit the official website at this link. Application Requirements Persisting application data is a common requirement in software/services development. There are all kinds of data storage solutions available. RDBMS and NoSQL solutions are very popular for some applications, and same time storing data on files is also very common. For our application, let's assume we are asked to store application data on files, and we decided to use AWS S3 storage service for this purpose. Technical Stack We will be building a web API using .NET6 tech stack. The application will allow users to perform typical CRUD operations for the Note entity. A very simple application with the bare minimum to keep our focus on data storage on AWS S3 from a .NET application. The source code for the application is available on this GitHub repository. We will be performing various actions on AWS cloud and our .NET application as follows AWS Resources Setup Create S3 bucket Create an IAM user and access keys. Create a Policy and attach it with the User. .NET6 Application Wiring Create a .NET6 application. Wire .NET6 application with AWS resources. .NET6 Application Code Write Application code for CRUD operations for S3 objects. Let's start with these steps next: Creating an S3 Bucket S3 buckets are created in a region. The S3 bucket name has to be globally unique. The reason for that is that the DNS record is going to be assigned to each bucket we create. For all other fields, we can use the defaults for now. Once the bucket is created, the next step is to set up access to this bucket for our application. Creating a User for Application Access In order for our application to access the S3 bucket, we will need to create a new user in AWS IAM Service and give this user access to our S3 bucket (you can create a role instead if you like). In the following screenshot, I created a user for our application: For other fields, use the default settings. We’ll also need to set up programmatic access for this user (access keys), but we will do that a little bit later. Next, let's create a policy that will allow permissions for the bucket. Creating a Policy From the IAM policies screen, we can create a custom policy as follows: This policy is allowing some actions on the bucket we created earlier. Next, we can attach this policy to the application user. Attach Policy to User From the IAM dashboard, we can edit the user to attach the policy as shown below: Create Access Keys for User We also need to create access keys for our users, which we will use in our .NET application later. The following screen from the IAM user shows this step: For now, I selected Local code as I will be running the application from my development machine. However, you can try other options if you like. Here is the screen showing access keys created for the user. Note down these keys somewhere for later reference. So, until this point, We have created an S3 bucket to store our application data files. We then create a user for our application. We created a custom policy to allow certain action permissions on the S3 bucket and attach this policy to our application user. We also created the access keys for this user, which we will use in our .NET application later. The AWS resource setup part is done, and we can now start working on .NET6 Application part. Create .NET6 Web API Application To start, I created a .NET6 application using a visual studio built-in template, as shown below .NET6 Application Wiring With S3 To wire up the application, the following configurations are setup in appsettings.json file and the corresponding C# class to read these configs (note these configs are pointing to the bucket we created earlier and AWS AccessKeys from the user setup earlier) And in the Program.cs class read the configs and populate the S3Settings class as shown below: At this point, we have wired up the configuration needed by our application to connect/access S3. With all this done, we can now focus on writing the application code for the REST API, and we will do that in the next post. Summary In this post, we started with a basic introduction to S3 and discussed our application data storage requirement, and S3 can be used for this purpose. We created an S3 bucket to store our application data and also created a user policy and attached the policy to the user. We then created access keys for this user to use in our .NET application later. Next, we created a very basic .NET6 application and did some configuration wiring to make it ready to access our S3 bucket. In the next post, we will write our application code to allow users to perform CRUD operations, which will result in creating various files in the S3 bucket, and we will be able to access those files in our application. The source code is available on this GitHub repository. Let me know if you have some comments or questions.
Small and medium-sized enterprises (SMEs) face unique challenges in their operations, especially when it comes to IT infrastructure. However, with the right technology stack, SMEs can streamline their operations, reduce costs, and achieve success in the ever-changing business landscape. In this blog post, we will explore how SMEs can use CRM, AWS, Kafka, and Snowflake data warehouse in their IT architecture to achieve the best solution with a scalable and advanced technology stack. Reference data steam system: Unlocking the Potential of IoT Applications. AWS (Amazon Web Services) AWS is a cloud computing platform that provides a range of services to businesses, including compute power, storage, and databases. SMEs can use AWS to reduce IT costs by only paying for what they use, improve scalability and flexibility, enhance security and compliance, and streamline operations by automating tasks. With AWS, SMEs can leverage cloud-based services and manage their infrastructure with greater agility and efficiency. This enables SMEs to access advanced computing resources without the need to maintain their own servers or hardware. Once you have an AWS account, you can use Terraform to provision your infrastructure on AWS. Here's some sample Terraform code to create a VPC, subnet, security group, and EC2 instance: provider "aws" { region = "us-west-2" } resource "aws_vpc" "example" { cidr_block = "10.0.0.0/16" } resource "aws_subnet" "example" { vpc_id = aws_vpc.example.id cidr_block = "10.0.1.0/24" } resource "aws_security_group" "example" { name_prefix = "example" vpc_id = aws_vpc.example.id ingress { from_port = 0 to_port = 65535 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 65535 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } } resource "aws_instance" "example" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" subnet_id = aws_subnet.example.id vpc_security_group_ids = [aws_security_group.example.id] tags = { Name = "example-instance" } } Kafka Kafka is an open-source event streaming platform that enables SMEs to process, store, and analyze large volumes of data in real time. Kafka can help SMEs to build real-time data pipelines, streamline data processing and analysis, scale horizontally to handle increasing data volumes, and integrate with other systems and tools seamlessly. By implementing Kafka in their IT architecture, SMEs can manage their data with greater efficiency, reduce processing times, and make faster, data-driven decisions. CRM (Customer Relationship Management) A CRM system is a must-have for any business that wants to manage customer interactions effectively. It enables SMEs to organize customer data, track sales, and improve customer relationships. However, with a scalable CRM system, SMEs can achieve more than just managing customer interactions. They can achieve a more comprehensive view of their customer base, allowing them to make data-driven decisions about marketing and sales efforts and automate key sales processes. Snowflake Data Warehouse Snowflake is a cloud-based data warehousing platform that enables SMEs to store and analyze large volumes of data in a scalable, cost-effective, and secure manner. Snowflake can help SMEs to improve data governance and compliance, reduce data storage and processing costs, achieve near real-time data processing and analysis, and integrate with other tools and platforms seamlessly. With Snowflake, SMEs can manage their data more effectively and get insights faster than with traditional data warehousing platforms. Java # Example Java code that reads messages from a Kafka topic and writes them to Snowflake Data Warehouse using the Snowflake JDBC driver import java.sql.*; import java.util.*; import org.apache.kafka.clients.consumer.*; import org.apache.kafka.common.*; import org.apache.kafka.common.serialization.*; public class KafkaToSnowflake { public static void main(String[] args) throws Exception { // Kafka configuration String topic = "my-topic"; String brokers = "localhost:9092"; String groupId = "my-group"; // Snowflake configuration String snowflakeUrl = "jdbc:snowflake://my-account.snowflakecomputing.com"; String snowflakeUser = "my-user"; String snowflakePassword = "my-password"; String snowflakeWarehouse = "my-warehouse"; String snowflakeDatabase = "my-database"; String snowflakeSchema = "my-schema"; String snowflakeTable = "my-table"; // Kafka consumer configuration Properties consumerProps = new Properties(); consumerProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers); consumerProps.put(ConsumerConfig.GROUP_ID_CONFIG, groupId); consumerProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); consumerProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(consumerProps); // Snowflake JDBC configuration Class.forName("net.snowflake.client.jdbc.SnowflakeDriver"); Properties snowflakeProps = new Properties(); snowflakeProps.put("user", snowflakeUser); snowflakeProps.put("password", snowflakePassword); snowflakeProps.put("warehouse", snowflakeWarehouse); snowflakeProps.put("db", snowflakeDatabase); snowflakeProps.put("schema", snowflakeSchema); Connection snowflakeConn = DriverManager.getConnection(snowflakeUrl, snowflakeProps); // Subscribe to the Kafka topic consumer.subscribe(Collections.singleton(topic)); // Start consuming messages from the Kafka topic and inserting them into Snowflake while (true) { ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100)); for (ConsumerRecord<String, String> record : records) { String message = record.value(); System.out.println("Received message: " + message); // Insert the message into Snowflake PreparedStatement stmt = snowflakeConn.prepareStatement( "INSERT INTO " + snowflakeTable + " (message) VALUES (?)" ); stmt.setString(1, message); stmt.execute(); } } } } Conclusion Small and Medium Enterprises (SMEs) must adopt advanced technologies to remain competitive. By leveraging cutting-edge solutions like Customer Relationship Management (CRM), Amazon Web Services (AWS), Kafka, and Snowflake data warehouse, SMEs can streamline their operations, reduce costs, and achieve success. One of the most significant benefits of CRM systems is the ability to manage customer interactions and automate processes. With CRM, SMEs can improve customer satisfaction, leading to higher sales and increased customer loyalty. The system also offers valuable insights into customer behavior, enabling SMEs to make informed decisions and improve the sales process. AWS offers a wide range of cloud-based services that can help SMEs reduce infrastructure costs, increase scalability, and improve flexibility. These services can be tailored to meet the specific needs of an SME, allowing them to manage their IT infrastructure with greater agility and efficiency. AWS is ideal for SMEs with limited budgets, as it offers affordable pricing and payment options. Kafka is an open-source platform that enables the real-time streaming of data between systems. This makes it an ideal tool for processing large volumes of data and improving operational efficiency. Kafka can help SMEs make more informed decisions by providing real-time insights into business operations and customer behavior. This can help SMEs remain competitive in the fast-paced business environment. Snowflake data warehouse is a scalable, cloud-based solution for storing and analyzing data. With Snowflake, SMEs can reduce their data management costs, improve data quality, and gain real-time insights into their business operations. The system offers high-performance analytics and can handle large amounts of data, making it an ideal solution for SMEs with significant data management needs. Implementing a scalable and advanced technology stack can help SMEs leverage cloud-based services and manage their infrastructure with greater efficiency. A well-designed technology stack can help SMEs stay competitive, achieve success, and grow their business. However, SMEs must also consider the challenges associated with adopting modern technologies, such as data security, privacy, and compliance. They should ensure that their IT infrastructure and processes are designed to meet the highest standards of security and compliance. In conclusion, modern technologies like CRM, AWS, Kafka, and Snowflake data warehouse offer SMEs a multitude of benefits. These cutting-edge solutions can help SMEs streamline their operations, reduce costs, and achieve success in the ever-changing business landscape. By implementing a scalable and advanced technology stack, SMEs can leverage cloud-based services and manage their infrastructure with greater agility and efficiency. With the right technology stack, SMEs can achieve success, grow their business, and make the most of their IT resources. SMEs that invest in modern technologies and implement them strategically will be best positioned to compete in today's digital economy.
If you're developing applications for a business, then one of your most important tasks is collecting payment for goods or services. Sure, providing those goods or services is essential to keeping customers happy. But if you don’t collect payments, your business won’t be around for very long. In the dev world, when we talk about infrastructure, we often consider the resiliency of our servers and APIs. We don’t talk about payment processing in the same way. But we should. Payment processing is something companies take for granted as long as it’s working smoothly. Once they’ve put some sort of solution in place, the cash starts flowing. Then, they forget about it until they encounter issues with their payment processor, or need to expand into a new region. With steady cashflows being essential to so many businesses, it’s worth thinking about resiliency for this critical piece of your business operations. In this post, we’ll look at a few reasons businesses should put time into improving the resiliency of their payment processing, and how to approach this problem from a technical perspective. Why Bother Building Resiliency Into Payments? If your company is like most others, you’ve probably been using a single payment processor for a while. In that case, you might ask: Why should I build more resiliency than my processor already has in place? After all, that’s why you pay them their processing fees. It’s up to them to make sure things work properly. Even if you set aside the present resiliency of whatever payment processor you’re using, you’ll still find many benefits from adding more processing options to your application. Of course, this isn’t possible if all of your customer PCI data is stored with a single payment processor, so you’ll need technical solutions to allow you to work with multiple payment processors without increasing your PCI compliance scope. Potential Cost Benefits For example, if you have only a single payment processor, you’re stuck paying whatever fees they charge you for the transactions you send their way. If you have multiple processors in place, you can route payments based on whichever service charges the lowest transaction cost and has the highest authorization rates. Maybe one processor has better pricing on higher volumes of transactions, but a different one has better rates for high-amount transactions. In this situation, you could send the majority of your customers’ purchases through the higher volume processor, but send large transactions through the processor that gives you better rates based on the individual payment amount. It’s a great way to boost profits without passing along costs to your customers. Overcoming Geographical or Regional Restrictions Certain individual payment processors may be constrained by geographical restrictions, giving you the ability to process payments only from specific countries. If you’re seeking to expand your business into other markets, you’ll encounter less friction if you already have several options on hand. This way, you can route customer payments to specific processors based on their location. It’s also possible that you can benefit from different processing costs across different regions, continuing to find savings from that difference. Greater Control Over the Process Another benefit of adding more payment processing services to your stack is that you gain greater control over the details of how your payments are processed By controlling how payments are processed in your systems, you can run more analytics to better understand the types of purchases your customers are making. With these insights, you can make even better decisions about which processors should receive your transactions. Greater control also means that you can make sure you provide a better customer experience when any one of your payment processors experiences an outage. If you only have a single payment processor and it experiences an outage, then you’ll be unable to accept payments. You’ll be scrambling to find a workaround. And for companies that make most of their annual business during a few key days—like many US retailers who rely on Black Friday shopping—such an outage can be disastrous. If your business already has control over your payments stack, then you could be able to design your system to fail over automatically if transaction decline rates increase. Along with protecting your sales, you’ll also benefit your customers by ensuring that they have a seamless payments experience, even if you’re rerouting payments to a backup processor behind the scenes. With payment processing resiliency, your customers will experience no problems even as you failover to another payment processor. How Do You Build for Payment Processing Resiliency? By now, you’re probably thinking: How is this even possible? Doesn’t PCI compliance require passing your customers’ payment information straight from your purchase page to your payment processor, bypassing any systems that aren’t PCI compliant? At the very least, wouldn’t introducing this type of resiliency widen the scope of compliance, causing headaches for any business that has offloaded PCI compliance to a single payment processor? If you don’t have the right technology, then yes — you definitely could end up increasing how much of your infrastructure falls under the scope of PCI compliance. Fortunately, using an architectural approach to this problem with a data privacy vault provides business flexibility without adding to your PCI compliance scope. But, without a good solution for keeping customer financial data safe, your hands are tied. Instead of being able to enjoy true payment processing resiliency, you have to hope that your current payment processor is resilient. But there’s a better way. Using a well-designed data privacy vault can unlock all of the benefits described above. A data privacy vault lets you isolate, protect, and govern any type of sensitive information, including PCI data—while still remaining fully PCI compliant and without increasing your PCI compliance scope. Instead of introducing greater risk, data privacy vaults significantly reduce the risk of using PCI data to process payments and help you to ensure PCI compliance across your business applications. By enabling you to completely separate sensitive data — not just financial information, but also PHI and PII — from the rest of the transactional data in your systems, a data privacy vault gives your sensitive PCI data an extra layer of protection while easing compliance with data privacy regulations. An Example Implementation What does it look like to implement this technology in your systems? We’ll outline one example, from Skyflow (described in more detail here), looking at some of their diagrams to illustrate how this works. To start with, let’s consider what a single credit card transaction looks like. How PCI Tokenization Works for Card Transactions (source) From the outset, when a merchant seeks to carry out a transaction with a credit card, the credit card data they send to the processor is already tokenized. This point is significant as it relates to introducing other processors since it means that you can store tokens in your systems, rather than sensitive, plain-text PCI data. With these tokens, you can reference the PCI data that’s stored in the vault — credit card details like the PANs and expiration dates. Based on the routing logic for your application, you can send the PCI data to the appropriate processor at the appropriate time. High-level Architecture for Using Multiple Payment Gateways with Skyflow (source) By using a data privacy vault to store PCI data you no longer need to store sensitive information within your own infrastructure or bet your business on a single payment processor that is storing PCI data on your behalf. One company that is offering this type of payment resiliency, as well as data residency, is Apaya, a merchant-enabling payment automation platform based in Dubai. As Apaya emphasized building for payment processing resiliency, it leaned heavily on data privacy vaults to get the job done. Conclusion As a software developer, you build business applications that depend on payments to keep the faucet running. That means that heavy dependence on one payment processor introduces a single point of failure that could cripple your business if an issue arises. For this reason, many enterprises are building resiliency into their payment processing, leveraging multiple payment processors. Of course, building in this flexibility without adding to your PCI compliance burden is only possible when companies leverage tools like a data privacy vault. Designing your systems to isolate sensitive data and ease compliance with a data privacy vault is good design, and good for business.
In this tutorial, we will explore the exciting world of MicroStream, a powerful open-source platform that enables ultrafast data processing and storage. Specifically, we will explore how MicroStream can leverage the new Jakarta Data and NoSQL specifications, which offer cutting-edge solutions for handling data in modern applications. With MicroStream, you can use these advanced features and supercharge your data processing capabilities while enjoying a simple and intuitive development experience. So whether you're a seasoned developer looking to expand your skill set or just starting in the field, this tutorial will provide you with a comprehensive guide to using MicroStream to explore the latest in data and NoSQL technology. MicroStream is a high-performance, in-memory, NoSQL database platform for ultrafast data processing and storage. One of the critical benefits of MicroStream is its ability to achieve lightning-fast data access times, thanks to its unique architecture that eliminates the need for disk-based storage and minimizes overhead. With MicroStream, you can easily store and retrieve large amounts of data in real-time, making it an ideal choice for applications that require rapid data processing and analysis, such as financial trading systems, gaming platforms, and real-time analytics engines. MicroStream provides a simple and intuitive programming model, making it easy to integrate into your existing applications and workflows. One of the main differences between MicroStream and other databases is its focus on in-memory storage. While traditional databases rely on disk-based storage, which can lead to slower performance due to disk access times, MicroStream keeps all data in memory, allowing for much faster access. Additionally, MicroStream's unique architecture allows it to achieve excellent compression rates, further reducing the memory footprint and making it possible to store even more data in a given amount of memory. Finally, MicroStream is designed with simplicity and ease of use. It provides a developer-friendly interface and minimal dependencies, making integrating into your existing development workflow easy. MicroStream Eliminates Mismatch Impedance Object-relational impedance mismatch refers to the challenge of mapping data between object-oriented programming languages and relational databases. Object-oriented programming languages like Java or Python represent data using objects and classes, whereas relational databases store data in tables, rows, and columns. This fundamental difference in data representation can lead to challenges in designing and implementing database systems that work well with object-oriented languages. One of the trade-offs of the object-relational impedance mismatch is that it can be challenging to maintain consistency between the object-oriented and relational databases. For example, suppose an object in an object-oriented system has attributes related to one another. In that case, mapping those relationships to the relational database schema may be challenging. Additionally, object-oriented systems often support inheritance, which can be tough to represent in a relational database schema. While various techniques and patterns can be used to address the object-relational impedance mismatch, such as object-relational mapping (ORM) tools or database design patterns, these solutions often come with their trade-offs. They may introduce additional complexity to the system. Ultimately, achieving a balance between object-oriented programming and relational database design requires careful consideration of the specific needs and constraints of the application at hand. MicroStream can help reduce the object-relational impedance mismatch by eliminating the need for a conversion layer between object-oriented programming languages and relational databases. Since MicroStream is an in-memory, NoSQL database platform that stores data as objects, it provides a natural fit for object-oriented programming languages, eliminating the need to map between object-oriented data structures and relational database tables. With MicroStream, developers can work directly with objects in their code without worrying about the complexities of mapping data to a relational database schema. It can result in increased productivity and improved performance, as there is no need for an additional conversion layer that can introduce overhead and complexity. Moreover, MicroStream's in-memory storage model ensures fast and efficient data access without expensive disk I/O operations. Data can be stored and retrieved quickly and efficiently, allowing for rapid processing and analysis of large amounts of data. Overall, by eliminating the object-relational impedance mismatch and providing a simple, efficient, and performant way to store and access data, MicroStream can help developers focus on building great applications rather than worrying about database architecture and design. MicroStream could guarantee a better performance by reducing the conversion to/from objects. The next step on your journey, let's create a MicroProfile application. MicroStream Faces Jakarta Specifications Now that I have explained about MicroStream, let's create our microservices application using Eclipse MicroProfile. The first step is going to the Eclipse MicroProfile Starter, where you can define configurations to your initial scope. Your application will be a simple library service using Open Liberty running with Java 17 and MicroStream. With the project downloaded, we must add the dependency integration between MicroProfile and MicroStream. This project dependency will change later internally to MicroStream, so this is a temporary house of this integration: XML <dependency> <groupId>expert.os.integration</groupId> <artifactId>microstream-jakarta-data</artifactId> <version>${microstream.data.version}</version> </dependency> The beauty of this integration is that it works with any vendors that work with MicroProfile 5 or higher. Currently, we're using Open Liberty. This integration enables both Jakarta persistence specifications: NoSQL and Data. Jakarta NoSQL and Jakarta Data are two related specifications developed by the Jakarta EE Working Group, aimed at providing standard APIs for working with NoSQL databases and managing data in Java-based applications. With the project defined, let's create a Book entity. The code below shows the annotation. We currently use the Jakarta NoSQL annotations. Java @Entity public class Book { @Id private String isbn; @Column("title") private String title; @Column("year") private int year; @JsonbCreator public Book(@JsonbProperty("isbn") String isbn, @JsonbProperty("title") String title, @JsonbProperty("year") int year) { this.isbn = isbn; this.title = title; this.year = year; } } The next step is the Jakarta Data part, where you can define a single interface with several capabilities with this database. Java @Repository public interface Library extends CrudRepository<Book, String> { } The last step is the resource, where we'll have service available. Java @Path("/library") @ApplicationScoped @Consumes(MediaType.APPLICATION_JSON) @Produces(MediaType.APPLICATION_JSON) public class LibraryResource { private final Library library; @Inject public LibraryResource(Library library) { this.library = library; } @GET public List<Book> allBooks() { return this.library.findAll().collect(Collectors.toUnmodifiableList()); } @GET @Path("{id}") public Book findById(@PathParam("id") String id) { return this.library.findById(id) .orElseThrow(() -> new WebApplicationException(Response.Status.NOT_FOUND)); } @PUT public Book save(Book book) { return this.library.save(book); } @Path("{id}") public void deleteBy(@PathParam("id") String id) { this.library.deleteById(id); } } Conclusion Jakarta NoSQL and Jakarta Data are critical specifications that provide a standard set of APIs and tools for managing data in Java-based applications. Jakarta NoSQL enables developers to interact with various NoSQL databases using a familiar interface, while Jakarta Data provides APIs for working with data in multiple formats. These specifications help reduce the complexity and costs of application development and maintenance, enabling developers to achieve greater interoperability and portability across different NoSQL databases and data formats. Furthermore, MicroStream provides a high-performance, in-memory NoSQL database platform that eliminates the need for a conversion layer between object-oriented programming languages and relational databases, reducing the object-relational impedance mismatch and increasing productivity and performance. By combining the power of MicroStream with the standard APIs provided by Jakarta NoSQL and Jakarta Data, developers can create robust and scalable applications that can easily handle large amounts of data. The MicroStream, Jakarta NoSQL, and Jakarta Data combination offer robust tools and specifications for managing data in modern Java-based applications. These technologies help streamline the development process and enable developers to focus on building great applications that meet the needs of their users. Find the source code on GitHub.