A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
A streaming database is a type of database that is designed specifically to process large amounts of real-time streaming data. Unlike traditional databases, which store data in batches before processing, a streaming database processes data as soon as it is generated, allowing for real-time insights and analysis. Unlike traditional stream processing engines that do not persist data, a streaming database can store data and respond to user data access requests. Streaming databases are ideal for latency-critical applications such as real-time analytics, fraud detection, network monitoring, and the Internet of Things (IoT) and can simplify the technology stack. Brief History The concept of a streaming database was first introduced in academia in 2002. A group of researchers from Brown, Brandeis, and MIT pointed out the demand for managing data streams inside databases and built the first streaming database, Aurora. A few years later, the technology was adopted by large enterprises. The top three database vendors, Oracle, IBM, and Microsoft, consecutively launched their stream processing solutions known as Oracle CQL, IBM System S, and Microsoft SQLServer StreamInsight. Instead of developing a streaming database from scratch, these vendors have directly integrated stream processing functionality into their existing databases. Since the late 2000s, developers inspired by MapReduce have separated stream processing functionality from database systems and developed large-scale stream processing engines, including Apache Storm, Apache Samza, Apache Flink, and Apache Spark Streaming. These systems were designed to continuously process ingested data streams and deliver results to downstream systems. However, compared to streaming databases, stream processing engines do not store data and, therefore, cannot serve user-initiated ad-hoc queries. Streaming databases keep evolving in parallel with stream processing engines. Two streaming databases, PipelineDB and KsqlDB, were developed in the 2010s and were popular then. In the early 2020s, a few cloud-based streaming databases, like RisingWave, Materialize, and DeltaStream, emerged. These products aim to provide users with streaming database services in the cloud. To achieve that objective, the focus is on designing an architecture that fully utilizes resources on the cloud to achieve unlimited horizontal scalability and supreme cost efficiency. Typical Use Cases Real-time applications need streaming databases. Streaming databases are well-suited for real-time applications that demand up-to-date results with a freshness requirement ranging from sub-seconds to minutes. Applications like IoT and network monitoring require sub-second latency, and latency requirements for applications like ad recommendations, stock dashboarding, and food delivery can range from hundreds of milliseconds to several minutes. Streaming databases continuously deliver results at low latency and can be a good fit for these applications. Some applications are not freshness sensitive and can tolerate delays of tens of minutes, hours, or even days. Some representative applications include hotel reservations and inventory tracking. In these cases, users may consider using either streaming databases or traditional batch-based databases. They should decide based on other factors, such as cost efficiency, flexibility, and tech stack complexity. Streaming databases are commonly used alongside other data systems in real-time applications to facilitate two classic types of use cases: streaming ingestion (ETL) and streaming analytics. Streaming databases are commonly integrated with other modern data systems to facilitate two types of use cases: streaming ingestion (ETL) and streaming analytics. Streaming Ingestion (ETL) Streaming ingestion provides a continuous flow of data from one set of systems to another. Developers can use a streaming database to clean streaming data, join multiple streams, and move the joined results into downstream systems in real time. In real-world scenarios, data ingested into the streaming databases typically come from OLTP databases, messaging queues, or storage systems. After processing, the results are most likely to be dumped back into these systems or inserted into data warehouses or data lakes. Streaming Analytics Streaming analytics focuses on performing complex computations and delivering fresh results on-the-fly. Data typically comes from OLTP databases, message queues, and storage systems in the streaming analytics scenario. Results are usually ingested into a serving system to support user-triggered requests. A streaming database can also serve queries on its own. Users can connect a streaming database directly with a BI tool to visualize results. With the growing demand for real-time machine learning, streaming databases have also become a crucial tool for enabling agile feature engineering. By utilizing streaming databases to store transformed data as features, developers can respond promptly to changing data patterns and new events. Streaming databases allow for real-time ingestion, processing, and transformation of data into meaningful features that can enhance the accuracy and efficiency of machine learning models while also reducing data duplication and improving data quality. This empowers organizations to make faster and more informed decisions, optimize their machine-learning workflows, and gain a competitive advantage. Streaming Databases vs. Traditional Databases Difference between a streaming database and a traditional database. Traditional databases are designed to store large amounts of batch data and provide fast, consistent access to that data through transactions and queries. They are often optimized for processing complex operations, such as aggregations and joins, that manipulate the data in bulk. Traditional databases’ execution models are often referred to as Human-Active, DBMS-Passive (HADP) models. That is, a traditional database passively stores data, and queries actively initiated by humans trigger computations. Examples of traditional databases include OLTP databases like MySQL and PostgreSQL and OLAP databases like DuckDB and ClickHouse. Streaming databases, on the other hand, are designed to incrementally process a large volume of continuously ingested data on-the-fly, and provide low-latency access to the data and results for further processing and analysis. They are optimized for processing data as soon as it arrives rather than bulk processing after data is persisted. Streaming databases’ execution models are often called DBMS-active, Human-Passive (DAHP) models. A streaming database actively triggers computation as data comes in, and humans passively receive results from the database. Examples of streaming databases include PipelineDB, KsqlDB, and RisingWave. Streaming Databases vs. OLTP Databases An OLTP database is ACID-compliant and can process concurrent transactions. In contrast, a streaming database does not guarantee ACID compliance and, therefore, cannot be used to support transactional workloads. In terms of data correctness, streaming databases enforce consistency and completeness. A well-designed streaming database should guarantee the following two properties: Exactly-once semantics, meaning that every single data event will be processed once and only once, even if a system failure occurs. Out-of-order processing means that users can enforce a streaming database to process data events in a predefined order, even if data events arrive out of order. Streaming Databases vs. OLAP Databases An OLAP database is optimized for efficiently answering user-initiated analytical queries. OLAP databases typically implement columnar stores and a vectorized execution engine to accelerate complex query processing over large amounts of data. OLAP databases are best suited for use cases where interactive queries are essential. Different from OLAP databases, streaming databases focus more on the resulting freshness, and they use an incremental computation model to optimize latency. Streaming databases typically do not adopt column stores but may implement vectorized execution for query processing. Conclusion In conclusion, a streaming database is an essential system for organizations that require real-time insights from large amounts of data. By providing real-time processing, scalability, and reliability, a streaming database can help organizations make better decisions, identify opportunities, and respond to threats in real time.
When it comes to managing large amounts of data in a distributed system, Apache Cassandra and Apache Pulsar are two names that often come up. Apache Cassandra is a highly scalable NoSQL database that excels at handling high-velocity writes and queries across multiple nodes. It is an ideal solution for use cases such as user profile management, product catalogs, and real-time analytics. A platform for distributed messaging and streaming, called Apache Pulsar, was created to manage moving data. It can handle standard messaging workloads and more complex streaming use cases including real-time data processing and event-driven architectures. This article covers the main steps of building a Spring Boot and React-based web application that interacts with Pulsar and Cassandra, displaying stock data live as it is received. This is not a complete tutorial, it only covers the most important steps. You can find the complete source code for the application on GitHub. You will learn how to: Set up Cassandra and Pulsar instances using DataStax Astra DB and Astra Streaming. Publish and consume Pulsar messages in a Spring Boot application. Store Pulsar messages in Cassandra using a sink. Viewing live and stored data in React using the Hilla framework by Vaadin. Used Technologies and Libraries Apache Cassandra (with Astra DB) Apache Pulsar (with Astra Streaming) Spring Boot Spring for Apache Pulsar Spring Data for Apache Cassandra React Hilla AlphaVantage API Requirements Java 17 or newer Node 18 or newer Intermediate Java skills and familiarity with Spring Boot Storing Sensitive Data in Spring Boot Much of the setup for Cassandra and Pulsar is configuration-based. While it might be tempting to put the configuration in application.properties, it is not a smart idea as the file is under source control, and you may unintentionally reveal secrets. Instead, create a local config/local/application.properties configuration file and add it to .gitignore to ensure it does not leave your computer. The settings from the local configuration file will be automatically applied by Spring Boot: mkdir -p config/local touch config/local/application.properties echo " # Contains secrets that shouldn't go into the repository config/local/" >> .gitignore You may provide Spring Boot with the options as environment variables when using it in production. Setting Up Cassandra and Pulsar Using DataStax Astra Both Apache technologies used in this article are open-source projects and can be installed locally. However, using cloud services to set up the instances is a simpler option. In this article, we set up the data infrastructure required for our example web application using DataStax free tier services. Begin by logging in to your existing account or signing up for a new one on Astra DataStax’s official website, where you will be required to create the database and streaming service separately. Cassandra Setup Start by clicking “Create Database” from the official Astra DataStax website. Sinking data from a stream into Astra DB requires that both services are deployed in a region that supports both Astra Streaming and Astra DB: Enter the name of your new database instance. Select the keyspace name. (A keyspace stores your group of tables, a bit like schema in relational databases). Select a cloud Provider and Region.Note: For the demo application to work, you need to deploy the database service on a region that supports streaming too. Select “Create Database.” Cassandra: Connecting to the Service Once the initialization of the database service is created, you need to generate a token and download the “Secure Connection Bundle” that encrypts the data transfer between the app and the cloud database (mTLS). Navigate to the DB dashboard “Connect” tab sheet where you will find the button to generate a one-time token (please remember to download it) and the bundle download button: spring.cassandra.schema-action=CREATE_IF_NOT_EXISTS spring.cassandra.keyspace-name=<KEYSPACE_NAME> spring.cassandra.username=<ASTRADB_TOKEN_CLIENT_ID> spring.cassandra.password=<ASTRADB_TOKEN_SECRET> # Increase timeouts when connecting to Astra from a dev workstation spring.cassandra.contact-points=<ASTRADB_DATACENTER_ID> spring.cassandra.port=9042 spring.cassandra.local-datacenter=<ASTRADB_REGION> datastax.astra.secure-connect-bundle=<secure-connect-astra-stock-db.zip> Pulsar parameters for application.properties. Pulsar Set Up Start by clicking “Create Stream” from the main Astra DataStax page: Enter the name for your new streaming instance. Select a provider and region.Note: Remember to use the same provider and region you used to create the database service. Select “Create Stream.” Pulsar: Enabling Auto Topic Creation In addition to getting the streaming service up and running, you will also need to define the topic that is used by the application to consume and produce messages. You can create a topic explicitly using UI, but a more convenient way is to enable “Allow Auto Topic Creation” setting for the created instance: Click on the newly created stream instance and navigate to the “Namespace and Topics” tab sheet, and click “Modify Namespace.” Navigate to the “Settings” tab located under the default namespace (not the top-level “Settings” tab) and scroll all the way down. Change the “Allow Topic Creation” to “Allow Auto Topic Creation.” Changing this default setting will allow the application to create new topics automatically without any additional admin effort in Astra. With this, you have successfully established the infrastructure for hosting your active and passive data. Pulsar: Connecting to the Service Once the streaming instance has been set up, you need to create a token to access the service from your app. Most of the necessary properties are located on the “Connect” tab sheet of the “Streaming dashboard.” The “topic-name” input is found in the “Namespaces and Topics” tab sheet: ## Client spring.pulsar.client.service-url=<Broker Service URL> spring.pulsar.client.auth-plugin-class-name=org.apache.pulsar.client.impl.auth.AuthenticationToken spring.pulsar.client.authentication.token=<Astra_Streaming_Token> ## Producer spring.pulsar.producer.topic-name=persistent://<TENANT_NAME>/default/<TOPIC_NAME> spring.pulsar.producer.producer-name=<name of your choice> ## Consumer spring.pulsar.consumer.topics=persistent://<TENANT_NAME>/default/<TOPIC_NAME> spring.pulsar.consumer.subscription-name=<name of your choice> spring.pulsar.consumer.consumer-name=<name of your choice> spring.pulsar.consumer.subscription-type=key_shared Pulsar parameters for application.properties. Publishing Pulsar Messages From Spring Boot The Spring for Apache Pulsar library takes care of setting up Pulsar producers and consumers based on the given configuration. In the application, the StockPriceProducer component handles message publishing. To fetch stock data, it makes use of an external API call before publishing it to a Pulsar stream using a PulsarTemplate. Autowire the PulsarTemplate into the class and save it to a field: Java @Component public class StockPriceProducer { private final PulsarTemplate<StockPrice> pulsarTemplate; public StockPriceProducer(PulsarTemplate<StockPrice> pulsarTemplate) { this.pulsarTemplate = pulsarTemplate; } //... } Then use it to publish messages: Java private void publishStockPrices(Stream<StockPrice> stockPrices) { // Publish items to Pulsar with 100ms intervals Flux.fromStream(stockPrices) // Delay elements for the demo, don't do this in real life .delayElements(Duration.ofMillis(100)) .subscribe(stockPrice -> { try { pulsarTemplate.sendAsync(stockPrice); } catch (PulsarClientException e) { throw new RuntimeException(e); } }); } You need to configure the schema for the custom StockPrice type. In Application.java, define the following bean: Java @Bean public SchemaResolver.SchemaResolverCustomizer<DefaultSchemaResolver> schemaResolverCustomizer() { return (schemaResolver) -> schemaResolver.addCustomSchemaMapping(StockPrice.class, Schema.JSON(StockPrice.class)); } Consuming Pulsar Messages in Spring Boot The Spring for Apache Pulsar library comes with a @PulsarListener annotation for a convenient way of listening to Pulsar messages. Here, the messages are emitted to a Project Reactor Sink so the UI can consume them as a Flux: Java @Service public class StockPriceConsumer { private final Sinks.Many<StockPrice> stockPriceSink = Sinks.many().multicast().directBestEffort(); private final Flux<StockPrice> stockPrices = stockPriceSink.asFlux(); @PulsarListener private void stockPriceReceived(StockPrice stockPrice) { stockPriceSink.tryEmitNext(stockPrice); } public Flux<StockPrice> getStockPrices() { return stockPrices; } } Creating a Server Endpoint for Accessing Data From React The project uses Hilla, a full-stack web framework for Spring Boot. It manages websocket connections for reactive data types and allows type-safe server communication. The client may utilize the matching TypeScript methods created by the StockPriceEndpoint to fetch data: Java @Endpoint @AnonymousAllowed public class StockPriceEndpoint { private final StockPriceProducer producer; private final StockPriceConsumer consumer; private final StockPriceService service; StockPriceEndpoint(StockPriceProducer producer, StockPriceConsumer consumer, StockPriceService service) { this.producer = producer; this.consumer = consumer; this.service = service; } public List<StockSymbol> getSymbols() { return StockSymbol.supportedSymbols(); } public void produceDataForTicker(String ticker) { producer.produceStockPriceData(ticker); } public Flux<StockPrice> getStockPriceStream() { return consumer.getStockPrices(); } public List<StockPrice> findAllByTicker(String ticker) { return service.findAllByTicker(ticker); } } Displaying a Live-Updating Chart in React The DashboardView has an Apex Chart candle stick chart for displaying the stock data. It’s bound to a state of type ApexAxisChartSeries: TypeScript const [series, setSeries] = useState<ApexAxisChartSeries>([]); The view uses a React effect hook to call the server endpoint and subscribe to new data. It returns a disposer function to close the websocket when it is no longer needed: TypeScript useEffect(() => { const subscription = StockPriceEndpoint .getStockPriceStream() .onNext((stockPrice) => updateSeries(stockPrice)); return () => subscription.cancel(); }, []); The series is bound to the template. Because the backend and frontend are reactive, the chart is automatically updated any time a new Pulsar message is received: HTML <ReactApexChart type="candlestick" options={options} series={series} height={350} ></div> Persisting Pulsar Messages to Cassandra Sinking Pulsar messages to Astra DB can be useful in scenarios where you need a reliable, scalable, and secure platform to store event data from Pulsar for further analysis, processing, or sharing. Perhaps you need to retain a copy of event data for compliance and auditing purposes, need to store event data from multiple tenants in a shared database, or for some other use case. Astra Streaming offers numerous fully-managed Apache Pulsar connectors you can use to persist event data to various databases and third party solutions, like Snowflake. In this article, we are persisting the stream data into Astra DB. Creating a Sink Start by selecting the “Sink” tab sheet from the Astra streaming dashboard. Select the “default” namespace: From the list of available “Sink Types,” choose “Astra DB.” Give the sink a name of your choice Select the “stock-feed” that will be available once you have published messages to that topic from your app. After selecting data stream input, select the database you want to persist pulsar messages: To enable table creation, paste the Astra DB token with valid roles. You’ll notice keyspaces after the entry of a valid token, choose the keyspace name that was used to create the database. Then enter the table name.Note: This needs to match the @Table("stock_price") annotation value you use in StockPrice.java class to read back the data. Next, you need to map the properties from the Pulsar message to the database table column. Property fields are automatically mapped in our demo application, so you can simply click “Create” to proceed. If you were, for instance, persisting a portion of the data to the database, opening the schema definition would enable you to view the property names employed and create a custom mapping between the fields. After the sink is created, the initialization process will begin. After which, the status will change to “active.” Then, you’re done with automatically persisting stock data into your database for easy access by application. The sink dashboard provides access to sink log files in the event of an error. Displaying Cassandra Data in a Table The historical data that is stored in Cassandra are displayed in a data grid component. The DetailsView contains a Vaadin Grid component that is bound to an array of StockPrice objects which are kept in a state variable: TypeScript const [stockData, setStockData] = useState<StockPrice[]>([]); The view has a dropdown selector for selecting the stock you want to view. When the selection is updated, the view fetches the data for that stock from the server endpoint: TypeScript async function getDataFor(ticker?: string) { if (ticker) setStockData(await StockPriceEndpoint.findAllByTicker(ticker)); } The StockData array is bound to the grid in the template. GridColumns define the properties that columns should map to: HTML <Grid items={stockData} className="flex-grow"> <GridColumn path="time" ></GridColumn> <GridColumn path="open" ></GridColumn> <GridColumn path="high" ></GridColumn> <GridColumn path="low" ></GridColumn> <GridColumn path="close" ></GridColumn> <GridColumn path="volume" ></GridColumn> </Grid> Conclusion In this article, we showed how you can build a scalable real-time application using an open-source Java stack. You can clone the completed application and use it as a base for your own experiments.
ClickHouse is an open-source real-time analytics database built and optimized for use cases requiring super-low latency analytical queries over large amounts of data. To achieve the best possible performance for analytical applications, combining tables in a data denormalization process is typical. Flattened tables help minimize query latency by avoiding joins at the cost of incremental ETL complexity, typically acceptable in return for sub-second queries. However, denormalizing data isn't always practical for some workloads, for instance, those coming from more traditional data warehouses. Sometimes, part of the source data for analytical queries needs to remain normalized. These normalized tables take less storage and provide flexibility with data combinations, but they require joins at query time for certain types of analysis. Fortunately, contrary to some misconceptions, joins are fully supported in ClickHouse! In addition to supporting all standard SQL JOIN types, ClickHouse provides additional JOIN types useful for analytical workloads and time-series analysis. ClickHouse allows you to choose between six different algorithms for the join execution or allow the query planner to adaptively choose and dynamically change the algorithm at runtime, depending on resource availability and usage. Join Types Supported in Clickhouse We use Venn diagrams and example queries on a normalized IMDB dataset originating from the relational dataset repository to explain the available join types in ClickHouse. Instructions for creating and loading the tables are here. The dataset is also available in our playground for users wanting to reproduce queries. We are going to use four tables from our example dataset: The data in that four tables represent movies. A movie can have one or many genres. The roles in a movie are played by actors. The arrows in the diagram above represent foreign-to-primary-key-relationships. e.g., the movie_idcolumn of a row in the genres table contains theid value from a row in the movies table. There is a many-to-many relationship between movies and actors. This many-to-many relationship is normalized into two one-to-many relationships by using the roles table. Each row in the roles table contains the values of the id fields of the movies table and the actors' table. Inner Join The Inner Join returns, for each pair of rows matching on join keys, the column values of the row from the left table, combined with the column values of the row from the right table. If a row has more than one match, then all matches are returned (meaning that the cartesian product is produced for rows with matching join keys). This query finds the genre(s) for each movie by joining the movies table with the genres table: SQL SELECT m.name AS name, g.genre AS genre FROM movies AS m INNER JOIN genres AS g ON m.id = g.movie_id ORDER BY m.year DESC, m.name ASC, g.genre ASC LIMIT 10; ┌─name───────────────────────────────────┬─genre─────┐ │ Harry Potter and the Half-Blood Prince │ Action │ │ Harry Potter and the Half-Blood Prince │ Adventure │ │ Harry Potter and the Half-Blood Prince │ Family │ │ Harry Potter and the Half-Blood Prince │ Fantasy │ │ Harry Potter and the Half-Blood Prince │ Thriller │ │ DragonBall Z │ Action │ │ DragonBall Z │ Adventure │ │ DragonBall Z │ Comedy │ │ DragonBall Z │ Fantasy │ │ DragonBall Z │ Sci-Fi │ └────────────────────────────────────────┴───────────┘ 10 rows in set. Elapsed: 0.126 sec. Processed 783.39 thousand rows, 21.50 MB (6.24 million rows/s., 171.26 MB/s.) Note that the Inner keyword can be omitted. The behavior of the Inner Join can be extended or changed by using one of the following other join types. (Left/Right/Full) Outer Join The Left Outer Join behaves like Inner Join; plus, for non-matching left table rows, ClickHouse returns default values for the right table’s columns. A Right Outer Join query is similar and also returns values from non-matching rows from the right table together with default values for the columns of the left table. A Full Outer Join query combines the left and right outer join and returns values from non-matching rows from the left and the right table, together with default values for the columns of the right and left table, respectively. Note that ClickHouse can be configured to return NULLs instead of default values (however, for performance reasons, that is less recommended). This query finds all movies that have no genre by querying for all rows from the movies table that don’t have matches in the genres table and therefore gets (at query time) the default value 0 for the movie_id column: SQL SELECT m.name FROM movies AS m LEFT JOIN genres AS g ON m.id = g.movie_id WHERE g.movie_id = 0 ORDER BY m.year DESC, m.name ASC LIMIT 10; ┌─name──────────────────────────────────────┐ │ """Pacific War, The""" │ │ """Turin 2006: XX Olympic Winter Games""" │ │ Arthur, the Movie │ │ Bridge to Terabithia │ │ Mars in Aries │ │ Master of Space and Time │ │ Ninth Life of Louis Drax, The │ │ Paradox │ │ Ratatouille │ │ """American Dad""" │ └───────────────────────────────────────────┘ 10 rows in set. Elapsed: 0.092 sec. Processed 783.39 thousand rows, 15.42 MB (8.49 million rows/s., 167.10 MB/s.) Note that the Outer keyword can be omitted. Cross Join The Cross Join produces the full cartesian product of the two tables without considering join keys. Each row from the left table is combined with each row from the right table. The following query, therefore, is combing each row from the movies table with each row from the genres table: SQL SELECT m.name, m.id, g.movie_id, g.genre FROM movies AS m CROSS JOIN genres AS g LIMIT 10; ┌─name─┬─id─┬─movie_id─┬─genre───────┐ │ #28 │ 0 │ 1 │ Documentary │ │ #28 │ 0 │ 1 │ Short │ │ #28 │ 0 │ 2 │ Comedy │ │ #28 │ 0 │ 2 │ Crime │ │ #28 │ 0 │ 5 │ Western │ │ #28 │ 0 │ 6 │ Comedy │ │ #28 │ 0 │ 6 │ Family │ │ #28 │ 0 │ 8 │ Animation │ │ #28 │ 0 │ 8 │ Comedy │ │ #28 │ 0 │ 8 │ Short │ └──────┴────┴──────────┴─────────────┘ 10 rows in set. Elapsed: 0.024 sec. Processed 477.04 thousand rows, 10.22 MB (20.13 million rows/s., 431.36 MB/s.) While the previous example query alone didn’t make much sense, it can be extended with a where clause for associating matching rows to replicate Inner join behavior for finding the genre(s) for each movie: SQL SELECT m.name, g.genre FROM movies AS m CROSS JOIN genres AS g WHERE m.id = g.movie_id ORDER BY m.year DESC, m.name ASC LIMIT 10; ┌─name───────────────────────────────────┬─genre─────┐ │ Harry Potter and the Half-Blood Prince │ Action │ │ Harry Potter and the Half-Blood Prince │ Adventure │ │ Harry Potter and the Half-Blood Prince │ Family │ │ Harry Potter and the Half-Blood Prince │ Fantasy │ │ Harry Potter and the Half-Blood Prince │ Thriller │ │ DragonBall Z │ Action │ │ DragonBall Z │ Sci-Fi │ │ DragonBall Z │ Fantasy │ │ DragonBall Z │ Comedy │ │ DragonBall Z │ Adventure │ └────────────────────────────────────────┴───────────┘ 10 rows in set. Elapsed: 0.441 sec. Processed 783.39 thousand rows, 21.50 MB (1.78 million rows/s., 48.78 MB/s.) An alternative syntax for Cross Join specifies multiple tables in the from clause separated by commas. ClickHouse is rewriting a Cross Join to an Inner Join if there are joining expressions in the where section of the query. We can check that for the example query via EXPLAIN SYNTAX (that returns the syntactically optimized version into which a query gets rewritten before being executed): SQL EXPLAIN SYNTAX SELECT m.name AS name, g.genre AS genre FROM movies AS m CROSS JOIN genres AS g WHERE m.id = g.movie_id ORDER BY m.year DESC, m.name ASC, g.genre ASC LIMIT 10; ┌─explain─────────────────────────────────────┐ │ SELECT │ │ name AS name, │ │ genre AS genre │ │ FROM movies AS m │ │ ALL INNER JOIN genres AS g ON id = movie_id │ │ WHERE id = movie_id │ │ ORDER BY │ │ year DESC, │ │ name ASC, │ │ genre ASC │ │ LIMIT 10 │ └─────────────────────────────────────────────┘ 11 rows in set. Elapsed: 0.077 sec. The Inner Join clause in the syntactically optimized Cross Join query version contains the all keyword, that got explicitly added in order to keep the cartesian product semantics of the Cross Join even when being rewritten into an Inner Join, for which the cartesian product can be disabled. And because, as mentioned above, the Outer keyword can be omitted for a Right Outer Join, and the optional all keyword can be added. You can write All Right Join and it will work all right. Left/Right Semi Join A Left Semi Join query returns column values for each row from the left table that has at least one join key match in the right table. Only the first found match is returned (the cartesian product is disabled). A Right Semi Join query is similar and returns values for all rows from the right table with at least one match in the left table, but only the first found match is returned. This query finds all actors/actresses that performed in a movie in 2023. Note that with a normal (Inner) join, the same actor/actress would show up more than one time if they had more than one role in 2023: SQL SELECT a.first_name, a.last_name FROM actors AS a LEFT SEMI JOIN roles AS r ON a.id = r.actor_id WHERE toYear(created_at) = '2023' ORDER BY id ASC LIMIT 10; ┌─first_name─┬─last_name──────────────┐ │ Michael │ 'babeepower' Viera │ │ Eloy │ 'Chincheta' │ │ Dieguito │ 'El Cigala' │ │ Antonio │ 'El de Chipiona' │ │ José │ 'El Francés' │ │ Félix │ 'El Gato' │ │ Marcial │ 'El Jalisco' │ │ José │ 'El Morito' │ │ Francisco │ 'El Niño de la Manola' │ │ Víctor │ 'El Payaso' │ └────────────┴────────────────────────┘ 10 rows in set. Elapsed: 0.151 sec. Processed 4.25 million rows, 56.23 MB (28.07 million rows/s., 371.48 MB/s.) Left/Right Anti Join A Left Anti Join returns column values for all non-matching rows from the left table. Similarly, the Right Anti Join returns column values for all non-matching right table rows. An alternative formulation of our previous outer join example query is using an anti join for finding movies that have no genre in the dataset: SQL SELECT m.name FROM movies AS m LEFT ANTI JOIN genres AS g ON m.id = g.movie_id ORDER BY year DESC, name ASC LIMIT 10; ┌─name──────────────────────────────────────┐ │ """Pacific War, The""" │ │ """Turin 2006: XX Olympic Winter Games""" │ │ Arthur, the Movie │ │ Bridge to Terabithia │ │ Mars in Aries │ │ Master of Space and Time │ │ Ninth Life of Louis Drax, The │ │ Paradox │ │ Ratatouille │ │ """American Dad""" │ └───────────────────────────────────────────┘ 10 rows in set. Elapsed: 0.077 sec. Processed 783.39 thousand rows, 15.42 MB (10.18 million rows/s., 200.47 MB/s.) Left/Right/Inner Any Join A Left Any Join is the combination of the Left Outer Join + the Left Semi Join, meaning that ClickHouse returns column values for each row from the left table, either combined with the column values of a matching row from the right table or combined with default column values for the right table, in case no match exists. If a row from the left table has more than one match in the right table, ClickHouse only returns the combined column values from the first found match (the cartesian product is disabled). Similarly, the Right Any Join is the combination of the Right Outer Join + the Right Semi Join. And the Inner Any Join is the Inner Join with a disabled cartesian product. We demonstrate the Left Any Join with an abstract example using two temporary tables (left_table and right_table) constructed with the values table function: SQL WITH left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)), right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4)) SELECT l.c AS l_c, r.c AS r_c FROM left_table AS l LEFT ANY JOIN right_table AS r ON l.c = r.c; ┌─l_c─┬─r_c─┐ │ 1 │ 0 │ │ 2 │ 2 │ │ 3 │ 3 │ └─────┴─────┘ 3 rows in set. Elapsed: 0.002 sec. This is the same query using a Right Any Join: SQL WITH left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)), right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4)) SELECT l.c AS l_c, r.c AS r_c FROM left_table AS l RIGHT ANY JOIN right_table AS r ON l.c = r.c; ┌─l_c─┬─r_c─┐ │ 2 │ 2 │ │ 2 │ 2 │ │ 3 │ 3 │ │ 3 │ 3 │ │ 0 │ 4 │ └─────┴─────┘ 5 rows in set. Elapsed: 0.002 sec. This is the query with an Inner Any Join: SQL WITH left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)), right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4)) SELECT l.c AS l_c, r.c AS r_c FROM left_table AS l INNER ANY JOIN right_table AS r ON l.c = r.c; ┌─l_c─┬─r_c─┐ │ 2 │ 2 │ │ 3 │ 3 │ └─────┴─────┘ 2 rows in set. Elapsed: 0.002 sec. ASOF Join The ASOF Join, implemented for ClickHouse in 2019 by Martijn Bakker and Artem Zuikov, provides non-exact matching capabilities. If a row from the left table doesn’t have an exact match in the right table, then the closest matching row from the right table is used as a match instead. This is particularly useful for time-series analytics and can drastically reduce query complexity. We will do time-series analytics of stock market data as an example. A quotes table contains stock symbol quotes based on specific times of the day. The price is updated every 10 seconds in our example data. A trades table lists symbol trades - a specific volume of a symbol got bought at a specific time: In order to calculate the concrete cost of each trade, we need to match the trades with their closest quote time. This is easy and compact with the ASOF Join, where we use the ON clause for specifying an exact match condition and the AND clause for specifying the closest match condition — we are looking for the closest row from the quotes table exactly or before the date of a trade: SQL SELECT t.symbol, t.volume, t.time AS trade_time, q.time AS closest_quote_time, q.price AS quote_price, t.volume * q.price AS final_price FROM trades t ASOF LEFT JOIN quotes q ON t.symbol = q.symbol AND t.time >= q.time FORMAT Vertical; Row 1: ────── symbol: ABC volume: 200 trade_time: 2023-02-22 14:09:05 closest_quote_time: 2023-02-22 14:09:00 quote_price: 32.11 final_price: 6422 Row 2: ────── symbol: ABC volume: 300 trade_time: 2023-02-22 14:09:28 closest_quote_time: 2023-02-22 14:09:20 quote_price: 32.15 final_price: 9645 2 rows in set. Elapsed: 0.003 sec. Note that the ON clause of the ASOF Join is required and specifies an exact match condition next to the non-exact match condition of the AND clause. ClickHouse currently doesn't support (yet) joins without any part of the join keys performing strict matching. Summary This blog post showed how ClickHouse supports all standard SQL Join types, plus specialized joins to power analytical queries. We described and demonstrated all supported join types.
In a contemporary application design that utilizes microservices and keeps up with dynamic analytics demands, it’s common to synchronize data from various databases dispersed throughout an enterprise and integrated with diverse systems. In modern applications, real-time syncing of databases is often necessary, so the synchronization cannot wait for a batch job to run daily or even hourly. The “Change Data Capture” concept addresses this need for real-time syncing by capturing and tracking any changes made to the data in the source databases and propagating those changes to the target databases in real time. As organizations continue to generate and store large amounts of data, the need for efficient and reliable real-time data replication has become increasingly important. Two solutions in this space are Debezium and DBConvert Streams. Both platforms promise to replicate data between different types of databases, but which one is right for you? In this article, we’ll compare DBConvert Streams and Debezium in terms of performance. Debezium/Kafka Architecture Source: Debezium The architecture for “Change Data Capture” using Debezium and Kafka Connect involves several components working together to capture, process, and route the data changes: Source database: The source database is where the changes occur, emitting events on inserts, updates, and deletes. Debezium connector: Debezium provides connectors for different databases, which capture the changes in a database-agnostic way and convert them into a standard message format. Apache Kafka: Kafka is a distributed streaming platform that acts as a message queue and ensures the messages are delivered in order and without loss. Kafka Connect: Kafka Connect is a component of Kafka that handles the integration between Kafka and external systems. It is used to pull data from Debezium and push it to the destination systems. Consumer applications: These applications consume the captured data changes and process them according to the use case. They could be analytics applications, data warehouses, or any other application that requires real-time data. DBConvert Streams Architecture In the DBConvert Streams architecture, the flow of data is as follows: The DBConvert Streams database source reader component reads CDC changes from the upstream source database. The source reader component propagates the changes to the NATS message broker. DBConvert Streams (DBS) uses NATS instead of Kafka in its architecture for Change Data Capture. The NATS message broker acts as a message queue, ensuring reliable and efficient delivery. The DBS target writer component consumes the events from the NATS message broker and writes them to the target database. This architecture enables real-time data integration between different databases, making it easier for development teams to work with diverse databases and reducing the amount of custom code needed to implement CDC. Different Database Log Formats There is no common standard for the format of database logs. Each database system typically uses its proprietary format for its transaction logs. This lack of standardization can make it challenging for developers to work with change data capture, particularly in multi-database environments. To address this issue, Debezium and DBConvert Streams offer connectors for popular databases, thereby simplifying the implementation of CDC. Performance Tests We will set up the test environment using Docker Compose to start all services and conduct performance tests to compare the efficiency of Debezium and DBConvert Streams in replicating one million records from a MySQL source database to a Postgres target database. We have created a GitHub repository to store the test scripts and results: Debezium tests DBConvert Streams tests Prerequisites Before proceeding, please ensure you have installed the necessary prerequisites, including: Docker Docker Compose Curl Please clone the specified repository onto your local machine: git clone git@github.com:slotix/dbconvert-streams-public.git Table Structure Source and Target Databases mysql-source database image is based on slotix/dbs-mysql:8, which has all the necessary settings to enable MySQL CDC replication. This image also contains the initdb.sql script, which creates a table with the above structure. postgres-target database is based on the official postgres:15-alpine image. It will receive all changes made to the mysql-source database. These databases are typically hosted on separate physical servers in a production environment. However, we will run them on a single machine using separate containers for our example. Debezium Test cd dbconvert-streams-public/examples/mysql2postgres/1- million-records-debezium export DEBEZIUM_VERSION=2.0 docker-compose up --build -d Set the environment variable DEBEZIUM_VERSION to 2.0 and then run the docker-compose up command with the options --build to build the images and -d to start the containers in the background. Deployment curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @source.json curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @target.json These two curl commands create Kafka Connect connectors for Debezium to read data from the MySQL source database and write data to the Postgres target database. Visit this link to view a list of active and running connectors. Monitor the Number of Records on the Target To ensure the replication process is working correctly, we will monitor the number of records on the target database using the script below: #!/bin/bash count=0 start_time=0 while true; do output=$(docker compose exec postgres-target bash -c "psql -U \$POSTGRES_USER \$POSTGRES_DB -c 'select count(*) from products'") count=$(echo "$output" | sed -n '3p' | awk '{print $1}') if [[ $count -gt 1 && $start_time -eq 0 ]]; then start_time=$(date +%s) echo "Started at" fi if [[ $count -gt 1000000 ]]; then end_time=$(date +%s) elapsed_time=$((end_time - start_time)) echo "Record count exceeded 1000000 after $elapsed_time seconds." break fi echo "$output" | sed -n '3p' | awk '{print $1}' sleep 1 done The script will print the number of records in the target database every second. Once the number of records exceeds 1 million, the script will print the elapsed time and exit. In the next terminal run the following commands: export DEBEZIUM_VERSION=2.0 ./count.sh Populate the Source Table With Sample Data To execute the SQL script that populates the source table with sample data, you can run the following commands in the first terminal: docker compose exec -it \ mysql-source \ mysql -u root -p123456 -D source In the MySQL prompt, execute the following command: INSERT INTO products (name, price, weight) SELECT CONCAT('Product', number) AS name, ROUND(RAND() * 100, 2) AS price, RAND() * 10 AS weight FROM (SELECT @row := @row + 1 AS number FROM (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t1, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t2, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t3, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t4, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t5, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t6, (SELECT @row := 0) r ) numbers LIMIT 1000000; This command will insert 1 million rows of random data into a table named products. Results Once the operation is completed, you will see something like this in the terminal where the count.sh script is running: .... 942882 960702 977532 995352 Record count exceeded 1000000 after 96 seconds. That means it took about 96 seconds to replicate 1 million records from MySQL to Postgres Database. DBConvert Streams Test We are going to make the same test with DBConvert Streams: cd dbconvert-streams-public/examples/mysql2postgres/1-million-records docker-compose up --build -d The command above will start the services in the background, build the images, and use the docker-compose.yml file to configure the services. Send Stream Configuration Send a request to the DBConvert Streams API with configuration parameters. Run the curl command: curl --request POST --url http://127.0.0.1:8020/api/v1/streams\?file=./mysql2pg.json Populate the Source Table With Sample Data Connect to the MySQL container to interact with the MySQL database running inside the container: docker exec -it \ mysql-source \ mysql -uroot -p123456 -D source In the MySQL prompt, execute the following command: INSERT INTO products (name, price, weight) SELECT CONCAT('Product', number) AS name, ROUND(RAND() * 100, 2) AS price, RAND() * 10 AS weight FROM (SELECT @row := @row + 1 AS number FROM (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t1, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t2, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t3, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t4, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t5, (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10) t6, (SELECT @row := 0) r ) numbers LIMIT 1000000; This command will insert 1 million rows of random data into a table named products. Results Copy and paste the URL into your web browser to access the Prometheus metrics page: http://127.0.0.1:9090/graph?g0.expr=source_events_in&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h&g1.expr=source_duration_seconds&g1.tab=1&g1.stacked=0&g1.show_exemplars=0&g1.range_input=1h&g2.expr=target_events_out&g2.tab=1&g2.stacked=0&g2.show_exemplars=0&g2.range_input=1h&g3.expr=target_duration_seconds&g3.tab=1&g3.stacked=0&g3.show_exemplars=0&g3.range_input=1h Once you’re on the page, you’ll be able to see various performance metrics related to the replication process, including the number of events in and out of the source and target databases and the duration of the replication process. It took about 8 seconds to transfer 1 million records from MySQL to Postgres. Minimum Hardware Requirements We conducted additional tests as described above on the DigitalOcean cloud computing platform to find out the information about the minimum hardware resources Debezium and DBConvert Streams requires. The table below summarizes the test results. The table below summarizes the test results of replication of 1 Million records from MySQL to PostgreSQL: HARDWARE RESOURCES DEBEZIUM DBCONVERT STREAMS 2 CPU / 2 GB RAM Failed 15 seconds 2 CPU / 4 GB RAM Failed (after ~300k records) 12 seconds 4 CPU / 8 GB RAM 236 seconds 8 seconds 8 CPU / 16 GB RAM 221 seconds 8 seconds According to the table, the DBConvert Streams platform requires fewer resources and provides faster replication speeds compared to Debezium. Conclusion While Debezium is a popular CDC platform that offers a lot of flexibility and powerful features, it may not always be the most cost-effective solution for some use cases. Specifically, DBConvert Streams may provide better performance and cost-efficiency for use cases requiring high throughput and real-time replication.
Recently, researchers at UC Berkeley published TAOBench, an end-to-end benchmark for social network workloads based on the traces collected from TAO, the distributed database for Meta. Among the typical data operations in social network applications, 99.7% are read, while only 0.2% are write and 0.01% are write transactions. Some other findings include: Transaction hotkeys are often co-located. Read and write hotspots appear on different keys. Contention can result intentionally. All these characteristics of social network application workloads pose special challenges. In this article, we will share the typical technological challenges faced by social network applications. Based on the findings from TAOBench, we will also discuss some features and optimizations of Distributed SQL databases that help optimize social network benchmark performance. The Underlying Technical Challenges of Social Networking Applications Social network applications allow users to share information, such as text, photos, videos, and links, and to interact with each other through various features, such as messaging, forums, and groups. These objects and features work together to describe relationships. As a user of a social application, you will have many needs that pose data challenges: You don't want unauthorized people to see your posts, but you do want people relevant to you to see them. You want your posts visible immediately, not delayed for hours. You don't want to miss replies that come in late. Late replies also confuse users. You want to access the information anytime, anywhere. The underlying technology challenges of social network applications include scalability, privacy, security, and data management. Scalability: This is the ability of a system to handle an increasing amount of traffic and data without becoming slow or unresponsive. Social network applications must handle large amounts of data and traffic: they often have millions of users who are constantly generating and accessing data. Privacy and security: These are vital for social network applications. These applications often contain sensitive personal information, such as user profiles, messages, and connections, and they need to protect this information from unauthorized access and disclosure. Data management: Applications need to be able to efficiently store, process, and retrieve large amounts of data while also ensuring data integrity and availability. All these needs are even more challenging at hyper-scale—a scale that increases with time. As the TAO paper mentions, "A single Facebook page may aggregate and filter hundreds of items from the social graph…. It must be efficient, highly available, and scale to high query rates." How Do Distributed SQL Databases Fit? To handle the high volume of data and traffic that social networks generate, database systems must be able to scale horizontally, using multiple servers and other resources to distribute the workload and improve performance. In addition to scalability, social network databases must be able to support fast and efficient querying to provide users with a smooth and responsive experience. This involves using specialized indexing and data structures, as well as complex query optimization algorithms. Distributed SQL databases are designed to handle large amounts of data and traffic and can be easily scaled horizontally across multiple servers and other resources. They may also offer features such as high availability, fault tolerance, and flexible data modeling, which can be useful for social network applications. Indications in the TAOBench Test In the TAOBench test, a UC Berkeley scholar tested several distributed cloud databases with equivalently priced infrastructure resources. The result is shown below. TAOBench’s benchmark on distributed databases The figure shows that the latency of all databases increases when the throughput reaches a certain level. (Each database has different scalability and performance limits.) This is because these databases (except for Cloud Spanner) are limited to resources of equal cost. Compared to other vendors, TiDB, a distributed SQL database, demonstrates stable performance with the best scalability. This allows it to achieve higher throughput. TiDB’s Architecture and Optimizations for Social Networking TiDB is a Distributed SQL database that offers scalability, high availability, ACID transactions, and MySQL compatibility, making it ideal for OLTP scenarios. Today, TiDB plays a vital role in many social networking companies, such as Zhihu (China's Quora), Redbook, Pinterest, and Bilibili. Many enterprises are using TiDB to help them deal with large-scale data issues. TiDB also provides true Hybrid Transactional/Analytical Processing (HTAP) capabilities that simplify technology stacks by combining OLTP and real-time analysis. As a Distributed SQL database, TiDB excels in TAOBench tests for performance and scaling. There are some good architectural reasons: Consistency and isolation: ACID and distributed transaction support based on Percolator High availability: data replicas based on Raft High throughput: horizontal elastic, scalable nodes to support Multi-write Relational data access: MySQL compatibility Ability to handle hotspot issues: Auto split and rebalance with the data region In addition, certain aspects of TiDB's design make it well-suited for networking applications. Auto Dynamic Sharding and Rebalancing As the TAOBench paper says, "Transaction hotkeys are often colocated." The hot spot issue is a difficult one in social networking applications. In TiDB, the fundamental data storage unit for management and distribution is called a "Region.” Regions can be split and merged based on the amount of data they manage and can be scheduled to move between nodes. Typically, data is distributed evenly across all storage nodes, and TiDB automatically balances the resources of each store based on the load. A storage node’s CPU and disk usage may become a bottleneck. TiDB’s Placement Driver (PD) estimates the load of the data regions based on the statistics such as the number of requests and amount of data written and synchronized. PD can schedule the balancing operation accordingly. Data region rebalancing with PD In a social network, hotspots may be concentrated within one data Region. TiDB samples a data region to analyze the distribution of workload. It then finds a suitable split point to allow the hot data Region to split into smaller Regions. After splitting, the hotspot balancing scheduler can move the hotspots into different nodes. With these two scheduling features, TiDB can fully utilize the distributed nature of storage, IO, and computing. This keeps performance stable — even in cases of severe hot-spotting. Hotspots processing in TiDB Write Transaction Optimization for Colocated Participants Distributed systems that support cross-row and cross-node transactions typically use 2-phase-commit (2PC) to achieve atomicity. TiDB's 2PC implementation is based on Percolator. In TiDB’s Percolator model, a transaction is considered committed once the primary key is committed. This requires at least two network round trips. However, not all transactions require 2PC to achieve atomicity. If a transaction only involves data hosted on one node, atomic commits can be achieved with only one round of RPCs. TiDB’s optimized process for write transactions The TAOBench paper says, "Hotkeys in write transactions tend to be colocated on the same shards." This optimization in TiDB effectively reduces the number of transaction commit phases. In the test results, we observed that Commit Operations per Second (OPS) dropped from 6,000 to less than 1,000, indicating that most 2PCs were reduced to 1PC. However, since writes in TAOBench only account for approximately 0.2% of all traffic, the overall Queries per Second (QPS) only saw a slight improvement. Commit performance observed in TAOBench A potential topic for future optimization is using data affinity to colocate as much relevant data as possible in one data region. This can reduce the overhead of 2PC and improve performance. Plan Cache for Read-Heavy Workload TiDB supports plan cache for SQL statements. This allows it to skip the optimization phase, which includes rule-based optimization and cost-based optimization. For read-heavy workloads, skipping these processes saves computing resources and improves performance. Based on our testing, enabling the plan cache improves QPS by about 10.5%. Plan cache in TiDB Semi-Auto-Tuning Garbage Collection For any data-intensive system, garbage collection (GC) is a resource-intensive background task. The GC threshold parameters can significantly affect the system's performance, particularly when it consumes a lot of CPU resources. Go GC auto-tuning, an optimization proposed by an Uber engineer, can reduce unnecessary GC operations and save overhead on frequent lightweight operations. TiDB adopted this optimization, which significantly improved throughput for TAOBench and many other Online Transactional Processing (OLTP) workloads. However, there’s a tradeoff. Although this method reduces unnecessarily frequent GC, in extreme cases, it may increase the risk of out-of-memory (OOM) crashes. Results can be found in the following graphic. Continuous Iteration and Evaluation In addition to the features and optimizations we’ve discussed, the evolution of the product itself is essential to address scalability challenges. TiDB iterates quickly and often, with a release cadence of one to two months. To capture the performance gains across different versions, the team also set up TAOBench to compare the performance of TiDB 6.4 with TiDB 5.0. As indicated in the chart, we achieved a 30% QPS improvement over the past year and a half. TiDB’s overall performance comparison on TAOBench Conclusion Overall, the workload of a social network database can be quite challenging. It requires advanced technologies and techniques to manage and optimize the data to give users a seamless and enjoyable experience. To learn more, you can watch this Meetup playback.
NoSQL stands for "Not Only SQL" and refers to a type of database management system that is designed to handle large volumes of unstructured and semi-structured data. Unlike traditional SQL databases that use a tabular format with predefined schemas, NoSQL databases are schema-less and allow for flexible and dynamic data structures. NoSQL databases are required because they can handle the large volumes and complex data types associated with Big Data. They are designed to scale horizontally by distributing data across many servers, making them well-suited for handling large and growing datasets. Additionally, NoSQL databases are often faster and more efficient than SQL databases for certain types of queries, such as those involving large amounts of data and complex data structures. NoSQL databases are also used in modern web applications that require fast and flexible data storage, such as social media platforms, online marketplaces, and content management systems. They are particularly useful for applications that require high levels of availability and scalability, as they can handle large amounts of traffic and data without sacrificing performance. Different Types of NoSQL Databases There are several types of NoSQL databases, each designed to handle different types of data and workloads. Some common types of NoSQL databases include: Document Databases These databases store and manage semi-structured data as documents, typically in JSON or XML formats. Document databases are well-suited for managing unstructured data, such as user profiles, product catalogs, or content management systems. Examples of document databases include MongoDB, Elasticsearch, and Couchbase. Key-Value Databases These databases store data as key-value pairs, making them ideal for simple lookups and high-speed data retrieval. Key-value databases are often used for caching, session management, and message queues. Examples of key-value databases include Redis and Riak. Column-Family Databases Also known as column-oriented databases, these databases store data as columns instead of rows, making them ideal for handling large amounts of data and complex queries. Column-family databases are often used for analytics, content management, and data warehousing. Examples of column-family databases include Apache Cassandra and HBase. Graph Databases These databases store and manage data as nodes and edges, making them well-suited for managing complex relationships and hierarchies. Graph databases are often used for social networks, recommendation engines, and fraud detection. Examples of graph databases include Neo4j and OrientDB. CAP Theorem for NoSQL Database The CAP theorem, also known as Brewer's theorem, is a fundamental concept in distributed computing that applies to NoSQL databases. The CAP theorem states that in any distributed system, it is impossible to simultaneously provide all three of the following guarantees: Consistency: Every read request from a node in the system will return the most recent write request. Availability: Every request to the system will receive a response without guaranteeing that it contains the most recent written request. Partition tolerance: The system can continue to operate and function correctly even if there are network partitions or messages are lost between nodes. In other words, when designing a distributed system like a NoSQL database, developers have to make trade-offs between consistency, availability, and partition tolerance. NoSQL databases are typically designed to prioritize either availability or partition tolerance while sacrificing some degree of consistency. This means that in certain failure scenarios, a NoSQL database may not provide the most up-to-date data to all nodes in the system but instead might return stale or conflicting data. For example, in a partitioned network, a NoSQL database may prioritize partition tolerance and continue to accept writes from multiple nodes, but these nodes may have different versions of the same data. In contrast, a traditional relational database might prioritize consistency and reject writes until it can guarantee that all nodes have the most recent data. Overall, the CAP theorem is an important consideration when designing and choosing a NoSQL database, as it helps to identify the trade-offs between consistency, availability, and partition tolerance that must be made in a distributed system. Use of NoSQL Database NoSQL databases are widely used for a variety of reasons, including: Scalability: NoSQL databases are highly scalable, allowing them to handle large amounts of data and high-traffic loads more easily than traditional relational databases. Flexibility: NoSQL databases allow for flexible data modeling, making it easier to handle unstructured or semi-structured data such as social media posts, documents, and sensor data. Performance: NoSQL databases are often faster than traditional relational databases, particularly when handling large volumes of data. Availability: NoSQL databases are designed to be highly available and fault-tolerant, ensuring that data is always accessible, even in the event of hardware or network failures. Cost-effectiveness: NoSQL databases can be more cost-effective than traditional relational databases, particularly for large-scale applications that require significant amounts of data storage and processing. Common Use Cases for NoSQL Databases Web applications: NoSQL databases are often used to power web applications, which require scalability, performance, and flexibility. Big Data: NoSQL databases are commonly used in big data applications, where traditional relational databases can struggle to handle the massive volumes of data involved. Internet of Things (IoT): NoSQL databases are used to store and process data from IoT devices, which can generate massive amounts of data in real time. Real-Time Analytics: NoSQL databases can be used for real-time analytics, enabling businesses to make faster, data-driven decisions. Content Management: NoSQL databases are often used for content management applications, which require the ability to handle unstructured or semi-structured data such as documents, images, and videos. Big Data Technologies Using NoSQL Big data technologies rely on NoSQL databases due to their scalability and ability to handle large volumes of unstructured and semi-structured data. Here are some of the most used big data technologies that leverage NoSQL databases: Hadoop: Hadoop is a popular open-source big data platform that includes the Hadoop Distributed File System (HDFS) for storing and processing large amounts of data, and Apache HBase, a NoSQL column-family database that provides low-latency access to Hadoop data. Cassandra: Apache Cassandra is a highly scalable NoSQL column-family database that is often used in big data applications. Cassandra can handle massive amounts of data across multiple nodes and data centers, making it ideal for distributed systems. MongoDB: MongoDB is a popular document-oriented NoSQL database that is often used in big data applications. MongoDB can store and process large amounts of data, and its flexible data model makes it well-suited for handling unstructured data. Couchbase: Couchbase is a NoSQL document-oriented database that provides a distributed key-value store with high performance and scalability. It is often used in big data applications where real-time data access and processing are critical. Neo4j: Neo4j is a graph database that is often used in big data applications that require the processing of complex relationships between data points. Neo4j is well-suited for applications such as social networks, recommendation engines, and fraud detection systems. Overall, NoSQL databases are a critical component of many big data architectures, enabling organizations to store and process large volumes of data efficiently and effectively. Conclusion NoSQL databases have become increasingly popular in recent years due to their ability to handle large amounts of unstructured or semi-structured data, their scalability, and their high availability. They provide a flexible data model that can adapt to changing data requirements and allow for efficient data processing. NoSQL databases come in various types, including document-oriented, key-value, column-family, and graph databases. Each type has its own strengths and weaknesses, and the choice of the database will depend on the specific requirements of the application. One of the key trade-offs when using NoSQL databases is the CAP theorem, which states that consistency, availability, and partition tolerance cannot be simultaneously guaranteed in a distributed system. NoSQL databases typically prioritize either availability or partition tolerance over consistency, which can lead to data inconsistencies in certain failure scenarios. Overall, NoSQL databases have revolutionized the way we store and process data, particularly in big data applications. They provide a powerful and flexible alternative to traditional relational databases and have become a critical component of many modern data architectures. However, as with any technology, they have their limitations and are not always the best choice for every application. It's important to carefully evaluate the requirements of your application and choose the database that best fits those needs.
There is a common problem most backend developers face at least once in their careers: where should we store our secrets? It appears to be simple enough, we have a lot of services focusing on this very issue, we just need to pick one and get on the next task. Sounds easy, but how can we pick the right solution for our needs? We should evaluate our options to see more clearly. The Test For the demonstration, we can take a simple Spring Boot application as an example. This will be perfect for us because that is one of the most popular technology choices on the backend today. In our example, we will assume we need to use a MySQL database over JDBC; therefore, our secrets will be the connection URL, driver class name, username, and password. This is only a proof of concept, any dependency would do as long as it uses secrets. We can easily generate such a project using Spring Initializr. We will get the DataSource auto configured and then create a bean that will do the connection test. The test can look like this: Java @Component public class MySqlConnectionCheck { private final DataSource dataSource; @Autowired public MySqlConnectionCheck(DataSource dataSource) { this.dataSource = dataSource; } public void verifyConnectivity() throws SQLException { try (final Connection connection = dataSource.getConnection()) { query(connection); } } private void query(Connection connection) throws SQLException { final String sql = "SELECT CONCAT(@@version_comment, ' - ', VERSION()) FROM DUAL"; try (final ResultSet resultSet = connection.prepareStatement(sql).executeQuery()) { resultSet.next(); final String value = resultSet.getString(1); //write something that will be visible on the Gradle output System.err.println(value); } } } This class will establish a connection to MySQL, and make sure we are, in fact, using MySQL as it will print the MySQL version comment and version. This way we would notice our mistake even if an auto configured H2 instance was used by the application. Furthermore, if we generate a random password for our MySQL Docker container, we can make sure we are using the instance we wanted, validating the whole configuration worked properly. Back to the problem, shall we? Storing Secrets The Easy Way The most trivial option is to store the secrets together with the code, either hard-coded or as a configuration property, using some profiles to be able to use separate environments (dev/test/staging/prod). As simple as it is, this is a horrible idea as many popular sites had to learn the hard way over the years. These “secrets” are anything but a secret. As soon as someone gets access to a repository, they will have the credentials to the production database. Adding insult to injury, we won’t even know about it! This is the most common cause of data breaches. A good indicator of the seriousness of the situation is to see how common secret scanning offerings got for example on GitHub, GitLab, Bitbucket, or others hosting git repositories. The Right Way Now that we see what the problem is, we can start to look for better options. There is one common thing we will notice in all the solutions we can use: they want us to store our secrets in an external service that will keep them secure. This comes with a lot of benefits these services can provide, such as: Solid access control. Encrypted secrets (and sometimes more, like certificates, keys). Auditable access logs. A way to revoke access/rotate secrets in case of a suspected breach. Natural separation of environments as they are part of the stack (one secrets manager per env). Sounds great, did we solve everything? Well, it is not that simple. We have some new questions we need to answer first: Who will host and maintain these? Where should we put the secrets we need for authentication when we want to access the secrets manager? How will we run our code locally on the developer laptops? How will we run our tests on CI? Will it cost anything? These are not trivial, and their answers depend very much on the solution we want to use. Let us review them one by one in the next section. Examples of Secrets Managers In all cases below, we will introduce the secrets manager as a new component of our stack, so if we had an application and a database, it would look like the following diagram. HashiCorp Vault If we go for the popular open-source option, HashiCorp Vault, then we can either self-host, or use their managed service, HCP Vault. Depending on the variant we select, we may or may not have some maintenance effort already, but it answers the first question. Answering the rest should be easy as well. Regarding the authentication piece, we can use, for example, the AppRole Auth Method using environment variables providing the necessary credentials to our application instances in each environment. Regarding the local and CI execution, we can simply configure and run a vault instance in dev server mode on the machine where the app should run and pass the necessary credentials using environment variables similarly to the live app instances. As these are local vaults, providing access to throw-away dev databases, we should not worry too much about their security as we should avoid storing meaningful data in them. To avoid spending a lot of effort on maintaining these local/CI vault instances, it can be a clever idea to store their contents in a central location, and let each developer update their vault using a single command every now and then. Regarding the cost, it depends on a few things. If you can go with the self-hosted open-source option, you should worry only about the VM cost (and the time spent on maintenance); otherwise, you might need to figure out how you can optimize the license/support cost. Cloud-Based Solutions If we are hosting our services using the services of one of the three big cloud providers, we have even more options. AWS, Azure, and Google Cloud are all offering a managed service for secrets managers. Probably because of the nature of the problem, AWS Secrets Manager, Azure Key Vault, and Google Cloud Secret Manager share many similarities. Please see a list below for examples: Stores versioned secrets. Logs access to the service and its contents. Uses solid authentication and authorization features. Well integrated with other managed services of the same provider Provides an SDK for developers of some popular languages At the same time, we should keep in mind that these are still hugely different services. Some of the obvious differences are the API they are using for communication, and the additional features they provide. For example, Azure Key Vault can store secrets, keys, and certificates, while AWS and GCP provide separate managed services for these additional features. Thinking about the questions we wanted to answer, they can answer the first two questions the same way. All of them are managed services, and the managed identity solution of the cloud provider they belong to is the most convenient, secure way to access them. Thanks to this, we do not need to bother storing secrets/tokens in our application configuration, just the URL of the secrets manager, which is not considered to be a secret. Regarding the cost, AWS and GCP can charge by the number of secrets and number of API calls. On the other hand, Azure only charges for the latter. In general, they are very reasonably priced, and we can sleep better at night knowing our security posture is a bit better. Trouble starts when we try to answer the remaining two questions dealing with the local and CI use-cases. All three solutions can be accessed from the outside world (given the proper network configuration), but simply punching holes on a firewall and sharing the same secrets manager credentials is not an ideal solution. There are situations when doing so is simply not practical, such as the following cases: Our team is scattered around the globe in the home office, and we would not be able to use strong IP restrictions, or we would need constant VPN connection just to build/test the code. Needing internet connection for tests is bad enough. But, using VPN constantly while at work can put additional stress on the infrastructure and team at the same time. When our CI instances are spawning with random IPs from an unknown range, we cannot set proper IP restrictions. A similar case to the previous. We cannot trust the whole team with the secrets of the shared secrets manager. For example, in the case of open-source projects, we cannot run around and share a secrets manager instance with the rest of the world. We need to change the contents of the secrets manager during the tests. When this happens, we are risking isolation problems between each developer and CI instance. We cannot launch a different secrets manager instance for each person and process (or test case) as that would not be very scalable. We do not want to pay extra for the additional secrets managers used in these cases. Can We Fake It Locally? Usually, this would be the moment when I start to search for a suitable test double and formulate plans about using that instead of the real service locally and on CI. What do we expect from such a test double? Behave like the real service would include in exceptional situations. Be actively maintained to reduce the risk of lagging behind in case of API version changes in the real service. Have a way to initialize the content of the secrets manager double on start-up to not need additional code in the application. Allow us to synchronize the secret values between the team and CI instances to reduce maintenance cost. Let us start and throw-away the test double simply, locally and on CI. Do not use a lot of resources. Do not introduce additional dependencies to our application if possible. I know about third-party solutions ticking all the boxes in case of AWS or Azure, while I have failed to locate one for GCP. Solving the Local Use Case for Each Secrets Manager in Practice It is finally time for us to roll up our sleeves and get our hands dirty. How should we modify our test project to be able to use our secrets manager integrations locally? Let us see for each of them: HashiCorp Vault Since we can run the real thing locally, getting a test double is pointless. We can simply integrate vault using the Spring Vault module by adding a property source: Java @Component("SecretPropertySource") @VaultPropertySource(value = "secret/datasource", propertyNamePrefix = "spring.datasource.") public class SecretPropertySource { private String url; private String username; private String password; private String driverClassName; // ... getters and setters ... } As well as a configuration for the “dev” profile: Java @Configuration @Profile("dev") public class DevClientConfig extends AbstractVaultConfiguration { @Override public VaultEndpoint vaultEndpoint() { final String uri = getEnvironment().getRequiredProperty("app.secrets.url"); return VaultEndpoint.from(URI.create(uri)); } @Override public ClientAuthentication clientAuthentication() { final String token = getEnvironment().getRequiredProperty("app.secrets.token"); return new TokenAuthentication(token); } @Override public VaultTemplate vaultTemplate() { final VaultTemplate vaultTemplate = super.vaultTemplate(); final SecretPropertySource datasourceProperties = new SecretPropertySource(); datasourceProperties.setUrl("jdbc:mysql://localhost:15306/"); datasourceProperties.setDriverClassName("com.mysql.cj.jdbc.Driver"); datasourceProperties.setUsername("root"); datasourceProperties.setPassword("16276ec1-a682-4022-b859-38797969abc6"); vaultTemplate.write("secret/datasource", datasourceProperties); return vaultTemplate; } } We need to be careful, as each bean—depending on the fetched secret values (or the DataSource)—must be marked with @DependsOn("SecretPropertySource") to make sure it will not be populated earlier during start-up while the vault backend PropertySource is not registered. As for the reason we used a “dev” specific profile, it was necessary because of two things: The additional initialization of the vault contents on start-up. The simplified authentication as we are using a simple token instead of the aforementioned AppRole. Performing the initialization here solves the worries about the maintenance of the vault contents as the code takes care of it, and we did not need any additional dependencies either. Of course, it would have been even better if we used some Docker magic to add those values without ever needing to touch Java. This might be an improvement for later. Speaking of Docker, the Docker Compose file is simple as seen below: YAML version: "3" services: vault: container_name: self-hosted-vault-example image: vault ports: - '18201:18201' restart: always cap_add: - IPC_LOCK entrypoint: vault server -dev-kv-v1 -config=/vault/config/vault.hcl volumes: - config-import:/vault/config:ro environment: VAULT_DEV_ROOT_TOKEN_ID: 00000000-0000-0000-0000-000000000000 VAULT_TOKEN: 00000000-0000-0000-0000-000000000000 # ... MySQL config ... volumes: config-import: driver: local driver_opts: type: "none" o: "bind" device: "vault" The key points to remember are the dev mode in the entry point, the volume config that will allow us to add the configuration file, and the environment variables baking in the dummy credentials we will use in the application. As for the configuration, we need to set in-memory mode and configure a HTTP endpoint without TLS: disable_mlock = true storage "inmem" {} listener "tcp" { address = "0.0.0.0:18201" tls_disable = 1 } ui = true max_lease_ttl = "7200h" default_lease_ttl = "7200h" api_addr = "http://127.0.0.1:18201" The complexity of the application might need some changes in the vault configuration or the Docker Compose content. However, for this simple example, we should be fine. Running the project, should produce the expected output: MySQL Community Server - GPL - 8.0.32 We are done with configuring vault for local use. Setting it up for tests should be even more simple using the things we have learned here. Also, we can simplify some of the steps there if we decide to use the relevant Testcontainers module. Google Cloud Secret Manager As there is no readily available test double for Google Cloud Secret Manager, we need to make a trade-off. We can decide what we would like to choose from the following three options: We can fall back to the easy option in case of the local/CI case, disabling the logic that will fetch the secrets for us in any real environment. In this case, we will not know whether the integration works until we deploy the application somewhere. We can decide to use some shared Secret Manager instances, or even let every developer create one for themselves. This can solve the problem locally, but it is inconvenient compared to the solution we wanted, and we would need to avoid running our CI tests in parallel and clean up perfectly in case the content of the Secret Manager must change on CI. We can try mocking/stubbing the necessary endpoints of the Secret Manager ourselves. WireMock can be a good start for the HTTP API, or we can even start from nothing. It is a worthy endeavor for sure, but will take a lot of time to do it well. Also, if we do this, we must consider the ongoing maintenance effort. As the decision will require quite different solutions for each, there is not much we can solve in general. AWS Secrets Manager Things are better in case of AWS, where LocalStack is a tried-and-true test double with many features. Chances are that if you are using other AWS managed services in your application, you will be using LocalStack already, making this even more appealing. Let us make some changes to our demo application to demonstrate how simple it is to implement the AWS Secrets Manager integration as well as using LocalStack locally. Fetching the Secrets First, we need a class that will know the names of the secrets in the Secrets Manager: Java @Configuration @ConfigurationProperties(prefix = "app.secrets.key.db") public class SecretAccessProperties { private String url; private String username; private String password; private String driver; // ... getters and setters ... } This will read the configuration and let us conveniently access the names of each secret by a simple method call. Next, we need to implement a class that will handle communication with the Secrets Manager: Java @Component("SecretPropertySource") public class SecretPropertySource extends EnumerablePropertySource<Map<String, String>> { private final AWSSecretsManager client; private final Map<String, String> mapping; private final Map<String, String> cache; @Autowired public SecretPropertySource(SecretAccessProperties properties, final AWSSecretsManager client, final ConfigurableEnvironment environment) { super("aws-secrets"); this.client = client; mapping = Map.of( "spring.datasource.driver-class-name", properties.getDriver(), "spring.datasource.url", properties.getUrl(), "spring.datasource.username", properties.getUsername(), "spring.datasource.password", properties.getPassword() ); environment.getPropertySources().addFirst(this); cache = new ConcurrentHashMap<>(); } @Override public String[] getPropertyNames() { return mapping.keySet() .toArray(new String[0]); } @Override public String getProperty(String property) { if (!Arrays.asList(getPropertyNames()).contains(property)) { return null; } final String key = mapping.get(property); //not using computeIfAbsent to avoid locking map while the value is resolved if (!cache.containsKey(key)) { cache.put(key, client .getSecretValue(new GetSecretValueRequest().withSecretId(key)) .getSecretString()); } return cache.get(key); } } This PropertySource implementation will know how each secret name can be translated to Spring Boot configuration properties used for the DataSource configuration, self-register as the first property source, and cache the result whenever a known property is fetched. We need to use the @DependsOn annotation same as in case of the vault example to make sure the properties are fetched in time. As we need to use basic authentication with LocalStack, we need to implement one more class, which will only run in the “dev” profile: Java @Configuration @Profile("dev") public class DevClientConfig { @Value("${app.secrets.url}") private String managerUrl; @Value("${app.secrets.accessKey}") private String managerAccessKey; @Value("${app.secrets.secretKey}") private String managerSecretKey; @Bean public AWSSecretsManager secretClient() { final EndpointConfiguration endpointConfiguration = new EndpointConfiguration(managerUrl, Regions.DEFAULT_REGION.getName()); final BasicAWSCredentials credentials = new BasicAWSCredentials(managerAccessKey, managerSecretKey); return AWSSecretsManagerClientBuilder.standard() .withEndpointConfiguration(endpointConfiguration) .withCredentials(new AWSStaticCredentialsProvider(credentials)) .build(); } } Our only goal with this service is to set up a suitable AWSSecretsManager bean just for local use. Setting Up the Test Double With the coding done, we need to make sure LocalStack will be started using Docker Compose whenever we start our Spring Boot app locally and stop it when we are done. Starting with the Docker Compose part, we need it to start LocalStack and make sure to use the built-in mechanism for running an initialization script when the container starts using the approach shared here. To do so, we need a script that can add the secrets: Shell #!/bin/bash echo "########### Creating profile ###########" aws configure set aws_access_key_id default_access_key --profile=localstack aws configure set aws_secret_access_key default_secret_key --profile=localstack aws configure set region us-west-2 --profile=localstack echo "########### Listing profile ###########" aws configure list --profile=localstack echo "########### Creating secrets ###########" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-connection-url --secret-string "jdbc:mysql://localhost:13306/" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-driver --secret-string "com.mysql.cj.jdbc.Driver" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-username --secret-string "root" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-password --secret-string "e8ce8764-dad6-41de-a2fc-ef905bda44fb" --profile=localstack || echo "ERROR" echo "########### Secrets created ###########" This will configure the bundled AWS CLI inside the container and perform the necessary HTTP calls to port 4566 where the container listens. To let LocalStack use our script, we will need to start our container with a volume attached. We can do so using the following Docker Compose configuration: YAML version: "3" services: localstack: container_name: aws-example-localstack image: localstack/localstack:latest ports: - "14566:4566" environment: LAMBDA_DOCKER_NETWORK: 'my-local-aws-network' LAMBDA_REMOTE_DOCKER: 0 SERVICES: 'secretsmanager' DEFAULT_REGION: 'us-west-2' volumes: - secrets-import:/docker-entrypoint-initaws.d:ro # ... MySQL config ... volumes: secrets-import: driver: local driver_opts: type: "none" o: "bind" device: "localstack" This will set up the volume, start LocalStack with the “secretsmanager” feature active, and allow us to map port 4566 from the container to port 14566 on the host so that our AWSSecretsManager can access it using the following configuration: Properties files app.secrets.url=http://localhost:14566 app.secrets.accessKey=none app.secrets.secretKey=none If we run the project, we will see the expected output: MySQL Community Server - GPL - 8.0.32 Well done, we have successfully configured our local environment. We can easily replicate these steps for the tests as well. We can even create multiple throw-away containers from our tests for example using Testcontainers. Azure Key Vault Implementing the Azure Key Vault solution will look like a cheap copy-paste job after the AWS Secrets Manager example we have just implemented above. Fetching the Secrets We have the same SecretAccessProperties class for the same reason. The only meaningful difference in SecretPropertySource is the fact that we are using the Azure SDK. The changed method will be this: Java @Override public String getProperty(String property) { if (!Arrays.asList(getPropertyNames()).contains(property)) { return null; } final String key = mapping.get(property); //not using computeIfAbsent to avoid locking map while the value is resolved if (!cache.containsKey(key)) { cache.put(key, client.getSecret(key).getValue()); } return cache.get(key); } The only missing piece is the “dev” specific client configuration that will create a dumb token and an Azure Key Vault SecretClient for us: Java @Configuration @Profile("dev") public class DevClientConfig { @Value("${app.secrets.url}") private String vaultUrl; @Value("${app.secrets.user}") private String vaultUser; @Value("${app.secrets.pass}") private String vaultPass; @Bean public SecretClient secretClient() { return new SecretClientBuilder() .credential(new BasicAuthenticationCredential(vaultUser, vaultPass)) .vaultUrl(vaultUrl) .disableChallengeResourceVerification() .buildClient(); } } With this, the Java side changes are completed, we can add the missing configuration and the application is ready: Properties files app.secrets.url=https://localhost:10443 app.secrets.user=dummy app.secrets.pass=dummy The file contents are self-explanatory, we have some dummy credentials for the simulated authentication and a URL for accessing the vault. Setting Up the Test Double Although setting up the test double will be like the LocalStack solution we implemented above, it will not be the same. We will use Lowkey Vault, a fake, that implements the API endpoints we need and more. As Lowkey Vault provides a way for us to import the vault contents using an attached volume, we can start by creating an import file containing the properties we will need: { "vaults": [ { "attributes": { "baseUri": "https://{{host}:{{port}", "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90, "created": {{now 0}, "deleted": null }, "keys": { }, "secrets": { "database-connection-url": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-connection-url", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "jdbc:mysql://localhost:23306/", "contentType": "text/plain" } ] }, "database-username": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-username", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "root", "contentType": "text/plain" } ] }, "database-password": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-password", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "5b8538b6-2bf1-4d38-94f0-308d4fbb757b", "contentType": "text/plain" } ] }, "database-driver": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-driver", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "com.mysql.cj.jdbc.Driver", "contentType": "text/plain" } ] } } } ] } This is a Handlebars template that would allow us to use placeholders for the host name, port, and the created/updated/etc., timestamp fields. We must use the {{port} placeholder as we want to make sure we can use any port when we start our container, but the rest of the placeholders are optional, we could have just written a literal there. See the quick start documentation for more information. Starting the container has a similar complexity as in case of the AWS example: YAML version: "3" services: lowkey-vault: container_name: akv-example-lowkey-vault image: nagyesta/lowkey-vault:1.18.0 ports: - "10443:10443" volumes: - vault-import:/import/:ro environment: LOWKEY_ARGS: > --server.port=10443 --LOWKEY_VAULT_NAMES=- --LOWKEY_IMPORT_LOCATION=/import/keyvault.json.hbs # ... MySQL config ... volumes: vault-import: driver: local driver_opts: type: "none" o: "bind" device: "lowkey-vault/import" We need to notice almost the same things as before, the port number is set, the Handlebars template will use the server.port parameter and localhost by default, so the import should work once we have attached the volume using the same approach as before. The only remaining step we need to solve is configuring our application to trust the self-signed certificate of the test double, which is used for providing an HTTPS connection. This can be done by using the PKCS#12 store from the Lowkey Vault repository and telling Java that it should be trusted: Groovy bootRun { systemProperty("javax.net.ssl.trustStore", file("${projectDir}/local/local-certs.p12")) systemProperty("javax.net.ssl.trustStorePassword", "changeit") systemProperty("spring.profiles.active", "dev") dependsOn tasks.composeUp finalizedBy tasks.composeDown } Running the project will log the expected string as before: MySQL Community Server - GPL - 8.0.32 Congratulations, we can run our app without the real Azure Key Vault. Same as before, we can use Testcontainers for our test, but, in this case, the Lowkey Vault module is a third-party from the Lowkey Vault project home, so it is not in the list provided by the Testcontainers project. Summary We have established that keeping secrets in the repository defeats the purpose. Then, we have seen multiple solution options for the problem we have identified in the beginning, and can select the best secrets manager depending on our context. Also, we can tackle the local and CI use cases using the examples shown above. The full example projects can be found on GitHub here.
As digital transformation reaches more industries, the number of data points generated is growing exponentially. As such, data integration strategies to collect such large volumes of data from different sources in varying formats and structures are now a primary concern for data engineering teams. Traditional approaches to data integration, which have largely focused on curating highly structured data into data warehouses, struggle to deal with the volume and heterogeneity of new data sets. Time series data present an additional layer of complexity. By nature, the value of each time series data point diminishes over time as the granularity of the data loses relevance as it gets stale. So it is crucial for teams to carefully plan data integration strategies into time series databases (TSDBs) to ensure that the analysis reflects the trends and situation in near real-time. In this article, we’ll examine some of the most popular approaches to data integration for TSDBs: ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Data Streaming with CDC (Change Data Capture) Given the need for real-time insights for time series data, most modern event-driven architectures now implement data streaming with CDC. To illustrate how it works in practice, we will walk through a reference implementation with QuestDB (a fast TSDB) to show that CDC can flexibly handle the needs of a time series data source. Extract, Transform, Load (ETL) ETL is a traditional and popular data integration strategy that involves first transforming the data into a predetermined structure before loading the data into the target system (typically a data warehouse). One of the main advantages of ETL is that it provides the highest degree of customization. Since the data is first extracted to a staging area where it is transformed into a clean, standardized format, ETL systems can handle a wide range of formats and structures. Also, once the data is loaded into the data warehouse, data science teams can run efficient queries and analyses. Finally, given the maturity of the ETL ecosystem, there is a plethora of enterprise-grade tools to choose from. On the other hand, ETL is both time- and resource-intensive to maintain. The logic for sanitizing and transforming the data can be complex and computationally expensive. This is why most ETL systems are typically batch-oriented, only loading the data into the warehouse periodically. As the volume of data and the sources of data grows, this can become a bottleneck. Given these qualities, ETL systems are most used for datasets that require complex transformation logic before analysis. It can also work well for datasets that do not require real-time insights and can be stored for long-term trend analysis. Extract, Load, Transform (ELT) ELT, as the name suggests, loads the data first into the target system (typically a data lake) and performs the transformation within the system itself. Given the responsibilities of the target system to handle both fast loads and transformations, ELT pipelines usually leverage modern, cloud-based data lakes that can deal with the processing requirements. Compared to ETL pipelines, ELT systems can provide more real-time analysis of the data since raw data is ingested and transformed on the fly. Most cloud-based data lakes provide SDKs or endpoints to efficiently ingest data in micro-batches and provide almost limitless scalability. However, ELT is not without downsides. Since transformation is done by the target system, such operations are limited by the capabilities supported by the data lakes. If you need a more complex transformation logic, additional steps may be needed to re-extract data and store it in a more friendly format. For most use cases, ELT is a more efficient data integration strategy than ETL. If your application can leverage cloud-based tools and does not require complex processing, ELT can be a great choice to handle large amounts of data in near real-time. Change Data Capture (CDC) For new projects, teams can plan to utilize TSDBs as one of the target systems in an ETL or ELT pipeline. However, for existing projects, either migrating to or adding a TSDB into the mix can be a challenge. For one, work may be required to either modify or use a new driver/SDK to stream data into the TSDB. Even if the same drivers are supported, data formats may also need to change to fully take advantage of TSDB capabilities. CDC tools can be useful to bridge this gap. CDC is simple in principle: CDC tools such as Debezium continuously monitor changes in the source system and notify your data pipeline whenever there is a change. The application causing the change is often not even aware there is a CDC process listening on changes. This makes CDC a good fit for integrating new real-time data pipelines into existing architectures because it requires small or no changes in the existing applications. As such, CDC can be used in conjunction with either ETL or ELT pipelines. For example, the source system can be a SQL RDBMS (e.g., MySQL, PostgreSQL, etc) or NoSQL DB (e.g., MongoDB, Casandra), and one of the target systems can be a TSDB along with other data lakes or warehouses. The main advantage of using CDC for data integration is that it provides real-time data replication. Unlike traditional ETL and ELT systems that work with batches, the changes to the source system are continuously streamed into one or more target systems. This can be useful to replicate either a subset or the entire data across multiple databases in near real-time. The target databases may even be in different geographical regions or serve different purposes (i.e., long-term storage vs. real-time analytics). For time series data where the changes in value over time are often most useful, CDC efficiently captures that delta for real-time insights. Reference Implementation for CDC To illustrate how CDC works more concretely, let’s take the reference implementation I wrote about recently to stream stock prices into QuestDB. At a high level, a Java Spring App publishes stock price information into PostgreSQL. Debezium Connector then reads the changes from PostgreSQL and publishes them onto Kafka. On the other side, QuestDB’s Kafka Connector reads from the Kafka topics and streams them onto QuestDB. For a deeper dive, please refer to Change Data Capture with QuestDB and Debezium. In this reference architecture, the Java Spring App is transforming and loading the data onto PostgreSQL before it’s replicated to TSDB for further analysis. If a more ELT-like pipeline is desired, raw data from another provider could have been loaded directly onto PostgreSQL and transformed later in QuestDB as well. The important thing to note with this architecture is that CDC can seamlessly integrate with existing systems. From the application standpoint, it can retain the transactional guarantees of PostgreSQL while adding a new TSDB component down the pipeline. Conclusion Data integration plays an important role for organizations that use a TSDB to store and analyze time series data. In this article, we looked at some advantages and disadvantages of using ETL or ELT. We also examined how CDC can be used in conjunction with those pipelines to provide real-time replication into TSDBs. Given the special qualities of time series data, using a TSDB to properly store and analyze them is important. If you are starting fresh, look to build an ELT pipeline to stream data into a TSDB. To integrate with an existing system, look at utilizing CDC tools to limit the disruption to the current architecture.
SQL Server Management Studio (SSMS) is one of the most proven database administration and management tools. Whether you use SQL Server Management Studio for Database Administrators or just for database development, it's a proven tool. We want to increase our capabilities, adapting this tool to our specific needs. For this, we use a range of SSMS add-ins and extensions. We'll discuss two productivity-focused add-ins that help us achieve more in less time. They're a great way to extend the functionality of the SSMS. They make it more flexible and enhance our SQL codings with autocompletion, snippets, and refactoring. These are SSMS Tools Pack and dbForge SQL Complete. Let's first take a brief overview of both products, then look at features that might interest you. It helps you decide whether SQL Complete is better than SQL Server Management Studio. The Overview of dbForge SQL Complete We start with dbForge SQL Complete, available for SSMS and Microsoft Visual Studio. It offers IntelliSense-like context-based code completion, SQL formatting, and smart refactoring with auto-correction of references. This makes coding up to four times faster. It's suitable for your own use, but it helps to form and unify SQL standards for corporate teamwork. This tool has a powerful T-SQL debugger, tab coloring, and document outline. SQL Complete has many features, and a clean interface — all of that make your work more convenient. Pricing: DBForge SQL Complete is available in three editions: Free Basic (Express), Standard, and Professional. The Express edition is a unique offering that comes completely free of charge. There is no other way to extend the code completion functionality in SSMS for free. SQL Complete can also be purchased in a package called dbForge SQL Tools, which includes fifteen essential products that cover nearly any aspect of SQL Server development, management, and administration. The Overview of SSMS Tools Pack While the second contender, SSMS Tools Pack, doesn't come close to being as versatile and powerful as the first contender, it does offer quite a bit of functionality. It is a SQL Server Management Studio plugin that was created to boost the user's productivity. It's easy to use and delivers a handy SQL editor, CRUD procedure generation, snippets, formatting, convenient search with filtering, and SQL execution history. Like SQL Complete, it also includes features that aren't essential yet, like tab coloring and the ability to export to Excel spreadsheets. Pricing: SSMS Tools Pack is a commercial product with licenses available in Small, Large, and Enterprise team packages. A free version is available for one computer for sixty days. Feature Comparison of dbForge SQL Complete and SSMS Tools Pack To make this comparison, we used the latest versions of tools — SQL Complete v6.12.8 and SSMS Tools Pack v5.5.2. Read the feature descriptions carefully. Some may be far more critical for your particular goals than others. Feature dbForgeSQL Complete SSMSTools Pack Compatibility SSMS integration Yes Yes Visual Studio integration Yes No Improved code quality Find invalid objects Yes No CRUD procedure generation Yes Yes Generation of the CREATE/ALTER script for server objects Yes No Execution Plan Analyzer No Yes Renaming of objects, variables, and aliases Yes No T-SQL Debugger Yes No Run on multiple targets Yes Yes Safe work with document environment and databases Various options for executing statements Yes Yes Execution warnings Yes Yes Execution notifications Yes No Transaction reminder Yes Yes Run At Status Bar Element No Yes Tab coloring Yes Yes Custom SSMS main window title Yes Yes Execution history of SQL statements Yes Yes Tab management Yes Yes Quick Connect Active SQL Editor Window No Yes Document sessions Yes No Operations with data in the SSMS data grid Results Grid data visualizers Yes No Copy Data As from the SSMS grid to XML, CSV, HTML, and JSON Yes No Copy Results Grid headers (column names + types) Yes No Export to Excel from the SSMS Results Grid No Yes Grid aggregates Yes Yes Find in Results Grid Yes Yes Generate Script As from the SSMS data grid Yes Yes Increased coding productivity Context-sensitive suggestion of object names and keywords Yes No Expand SELECT * Yes Yes Object information Yes No Parameter information Yes No SQL snippets Yes Yes New query template No Yes ‘Go to definition’ for database objects Yes Yes Highlighted occurrences of identifiers Yes No Named regions Yes Yes Document Outline window Yes No Unified SQL standards SQL formatting Yes Yes Multiple predefined formatting profiles Yes No Preservation of the original formatting for any selected piece of code Yes No Command-line interface Yes No Settings Import/Export Settings Wizard Yes Yes Quick search for options Yes No Releases Initial release v1.0 (November 19, 2010) v1.0 (May 1, 2008) Latest release (as of September 2021) v6.12 (September 12, 2022) v5.5 (July 1, 2020) Total quantity of releases 133 41 The Verdict of Comparison of dbForge SQL Complete and SSMS Tools Pack We saw that dbForge SQL Complete has more features than SSMS Tools Pack. It offers more to improve productivity, provides a much wider range of data operations, has noticeably more advanced formatting, and supports a command line interface. You should also be aware that SQL Complete is updated more frequently. If you're looking for an SSMS Tools Pack alternative, then SQL Complete is the best solution for you. It's compatible with both Microsoft Visual Studio and SQL Server Management Studio. It doesn't take a lot of effort to see how effective it is.
There is a conventional approach to learning a framework such as Flask: learn with a tutorial, then build something small, and gradually increase functionality. There are dozens to hundreds of such tutorials, and they are very helpful. Here we offer a complementary approach, one that entirely reverses the script. Build a complete running project you can explore within a minute, then learn how to alter it, debug it - and then how to create it, literally in seconds. App Fiddle: An In-Action Flask Tutorial Tools like JSFiddle are extremely useful. Without installation, you can use your browser to explore existing JavaScript/HTML code, alter it, and see the results. Here, we apply this approach to an entire app: an App (Flask) Fiddle (link at the end). Like a JSFiddle, it opens in your browser; no install. But it's a complete Flask app: a running project, with a database, accessed with SQLAlchemy Accessed via VSCode, running in your browser, courtesy of Codespaces Codespaces is a remarkable new product from GitHub. When you click the link at the end, it requisitions a server, installs your project (and all its dependencies, such as Python and Flask), and opens it in VSCode in your browser. You can also use this App Fiddle to explore Codespaces, how to set up a dev container, and use it on your own projects. The link (at the end) actually opens 3 projects. The first is a minimal Flask/SQLAlchemy app. It has a README: use it to explore the code, run it, alter/debug it, etc. Deliver While Learning But that's not all. While the first project shows it's pretty simple to create a single endpoint, gather some data, and return it, it's a lot more work to create an entire project (multiple endpoints, an Admin App, etc). That's a horse of an entirely different feather! So, we've created API Logic Server. It's an open-source Python app, already loaded into our Codespace project. It creates an entire Flask project with a single command, like this: ApiLogicServer create --project_name=ApiLogicProject --db_url=nw- This reads your database schema (here, a version of Northwind) and creates a complete, executable project, instantly: API: An endpoint for each table, with filtering, sorting, pagination, and related data access. Swagger is automatic. Admin UI: Multi-page/Multi-table apps, with page navigations, automatic joins, and declarative hide/show. It executes a YAML file, so basic customizations do not require HTML or JavaScript background. Custom UIs can be built using your tool of choice (React, Angular, etc), using the API. Fully Customizable: Standard Python, Flask, SQLAlchemy Creating the executable project requires no background in Flask, SQLAlchemy, or even Python. In fact, you can use the created project to learn these technologies, by "fiddling" with a running system that's already delivering value (e.g, enabling custom UI dev, integration, etc). That's because the created project is a standard Flask/SQLAlchemy project. Customize and extend it with all the fundamentals you learned in conventional tutorials, and in the App Fiddle, with your favorite IDE. Unique Spreadsheet-Like Business Rules As an experienced app developer, I think of projects as about half the back end and half the front end. Your mileage may vary, but the backend is certainly a lot of work. Multi-table derivations and constraints applied on update: E.g., the customer's balance - the sum of the unshipped order amounts - cannot exceed the credit limit Authorization and authentication E.g., users must enter a valid id and password to gain access Their roles determine what database rows they see (e.g., a multi-tenant application). API Logic Server enables you to declare spreadsheet-like rules to implement these. Rules are a very significant technology, but perhaps the most striking characteristic is that they are 40X more concise than code. These 5 rules represent the same logic as 200 lines of Python: Python Rule.constraint(validate=models.Customer, # logic design translates directly into rules as_condition=lambda row: row.Balance <= row.CreditLimit, error_msg="balance ({row.Balance}) exceeds credit ({row.CreditLimit})") Rule.sum(derive=models.Customer.Balance, # adjust iff AmountTotal or ShippedDate or CustomerID changes as_sum_of=models.Order.AmountTotal, where=lambda row: row.ShippedDate is None) # adjusts - *not* a sql select sum... Rule.sum(derive=models.Order.AmountTotal, # adjust iff Amount or OrderID changes as_sum_of=models.OrderDetail.Amount) Rule.formula(derive=models.OrderDetail.Amount, # compute price * qty as_expression=lambda row: row.UnitPrice * row.Quantity) Rule.copy(derive=models.OrderDetail.UnitPrice, # get Product Price (e,g., on insert, or ProductId change) from_parent=models.Product.UnitPrice) The third project in the fiddle illustrates both the rules and some "standard" Flask/SQLAlchemy customizations. A tutorial is included to help you explore these, run them, see how to debug them, etc. You can access the app fiddle by following the links at the top of this blog.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs
Sahiti Kappagantula
Product Associate,
Oorwin