Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
In this article, we will discuss some of the most popular algorithm problems using arrays and hashing approaches. Some of these problems I received during interviews. Let's start with a problem: Contains Duplicate Description: Given an integer array nums, return true if any value appears at least twice in the array, and return false if every element is distinct. Solution: What if we add an additional data structure like a HashSet and put elements inside? If we have the same elements in Set before insert, we will return true, and that is it. So simple, isn't it? Java public boolean containsDuplicate(int[] nums) { Set<Integer> set = new HashSet<>(); for(int n : nums){ if(set.contains(n)){ return true; } else { set.add(n); } } return false; } Moving on to our next task : Valid Anagram Description: Given two strings s and t, return true if t is an anagram of s, and false otherwise. An Anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. Example 1: Input: s = "anagram", t = "nagaram" Output: true Example 2: Input: s = "rat", t = "car" Output: false Solution: First of all, we should understand what an anagram is. Two words will be anagrams only if they have the same characters. That means that we should compare characters. Characters can be in a different order. We can use a few approaches how to handle it. In the first variant, we can sort characters in each word and then compare them. Or we can create a HashMap and, for one word, add characters, and for another, substruct them. Below is the variant with the sorting algorithm. Java public boolean isAnagram(String s, String t) { if(s == null && t == null){ return true; } else if(s == null || t == null){ return false; } if(s.length() != t.length()){ return false; } char[] sCh = s.toCharArray(); char[] tCh = t.toCharArray(); Arrays.sort(sCh); Arrays.sort(tCh); for(int i = 0; i < s.length(); i ++){ if(sCh[i] != tCh[i]){ return false; } } return true; } Is it clear? Please, let me know in the comments. Our next problem: Two Sum Description: Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1]. Example 2: Input: nums = [3,2,4], target = 6 Output: [1,2] Example 3: Input: nums = [3,3], target = 6 Output: [0,1] Solution: This is one of the basic Hash problems. Let's find a brut force solution. We can prepare two for each loop, and iterate over elements and compare their sums. It works, but the time complexity will be O(N^2), and it could be very, very slow. But what if, instead of the second loop, we save all previous elements into HashMap? Will it be checked with current elements? For example, we have array [3,3] and target = 6. In the first iteration, we will put into map 3 as the key and 0(index) as the value. And then, on the next iteration, we check the map with target - cur In our case, it will be 6 - 3 = 3. We have to pair it in our map with element 3 and map it to get the response. Let's take a look at the code: Java public int[] twoSum(int[] nums, int target) { int[] rez = new int[2]; Map<Integer, Integer> map = new HashMap<>(); for (int i = 0; i < nums.length; i++){ int rest = target - nums[i]; if(map.containsKey(rest)){ rez[0] = map.get(rest); rez[1] = i; return rez; } else { map.put(nums[i], i); } } return rez; } For some of you, these problems may look easy, but not for me. I spent a lot of time trying to find a correct solution. Now we will look at the hardest problem in this article: Group Anagrams Description: Given an array of strings strs, group the anagrams together. You can return the answer in any order. An Anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. Example 1: Input: strs = ["eat","tea","tan","ate","nat","bat"] Output: [["bat"],["nat","tan"],["ate","eat","tea"]] Example 2: Input: strs = [""] Output: [[""]] Example 3: Input: strs = ["a"] Output: [["a"]] Solution: Do you remember the previous problem with Anagrams? I want to use the same approach. We remember that anagrams are words with the same characters, and the same characters count. What if we sort characters in the word and create a string from it? For example, we have [nat, tna]. We sort "nat" and receive "ant." We sort "tan" and again receive "ant." We can sort and put words into a map. And the key will be a sorted string, and the value will be the original word. Smart, isn't it? Time to look at the code: Java public List<List<String>> groupAnagrams(String[] strs) { Map<String, List<String>> map = new HashMap<>(); for (String s : strs) { char[] chars = s.toCharArray(); Arrays.sort(chars); String sorted = String.valueOf(chars); if (map.containsKey(sorted)) { map.get(sorted).add(s); } else { List<String> list = new ArrayList<>(); list.add(s); map.put(sorted, list); } } return new ArrayList<>(map.values()); } I hope you are enjoying this topic. Next time, I'm going to solve more complicated topics. Feel free to add your thoughts in the comments. I really appreciate your time and want to hear your feedback.
Sorting can take a lot of time when dealing with large amounts of data. It would be great if, instead of sorting the data every time, we could directly write them into memory in the correct position where they would already be sorted. This would allow us to always know in advance where to search for them, for example, starting from the center. We would know exactly where to go; to the left, discarding half of the data on the right, or to the right, discarding half of the data on the left. This means that the number of elements on each search operation would be halved. This gives us nothing but fast logarithmic complexity. But if we really want to insert it into the correct position, it will work quite slowly. After all, we first need to allocate new space in memory for the required number of elements. Then copy a part of the old data there, then put the new data, and then copy all remaining elements to the new locations. In other words, we get a fast search but slow insertion. In such a situation, a linked list is more suitable. The insertion in it really happens instantly in O(1) time without any copying. However, before we can do this, we will have to find the place for insertion by going through most of the nodes of the linked list, which will again lead us to the linear complexity of O(N). To make these operations fast, a binary search tree was invented. It is a hierarchical data structure consisting of nodes, each of which stores data, a key-value pair, and can have a maximum of two children. In a simple binary tree, data can be stored in any order. In a binary search tree, however, data can only be added according to special rules that allow the operations mentioned above to work quickly. To understand these rules, let's first create a foundation for this tree and its nodes. C# public class BinarySearchTree<T> where T : IComparable<T> { private Node? _head; private class Node { public readonly T Value; private Node? _left; private Node? _right; public Node(T value) { Value = value; } } } Insertion of a New Value So, let's implement the insertion of a new value. First of all, if the root does not exist initially, that is, the tree is completely empty, we simply add a new node, which becomes the root. Next, if the new element being added is less than the current node we are standing on, we recursively move to the left. If it is greater than or equal to the current node, we recursively move to the right. When we find the place where the child's parent is NULL, that is, it points to NULL, we perform the insertion. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public void Add(T value) { if (_head is null) { _head = new Node(value); } else { _head.Insert(value); } } ... } private class Node { ... public void Insert(T value) { ref var branch = ref value.CompareTo(Value) < 0 ? ref _left : ref _right; if (branch is null) { branch = new Node(value); } else { branch.Insert(value); } } } Thanks to recursion, all the elements that we traverse while searching for the correct location are stored in a stack. Once a new node is added and recursion begins to unwind, we will ascend through each ancestor node until we reach the root. This feature greatly simplifies the code in the future. Therefore, keep this in mind. If we continue to insert new elements in this way, we will notice that they will all eventually be sorted. Small nodes will always be on the left, while larger nodes will always be on the right. Thanks to this, we always quickly find the correct path for both insertions of new elements and searching without affecting the other nodes in the tree. So now we have search and insertion happening in O(log(N)) time. This also allows us to quickly find the minimum and maximum elements of the tree. The minimum will always be the lowest element on the left, while the maximum will be the lowest element on the right. Therefore, in the first case, we simply always recursively descend to the left until we hit NULL. In the second case, it is similar, but we descend to the right. It is important to understand that such nodes cannot have more than one child. Otherwise, they would not be the minimum or maximum. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public T Min() { if (_head is null) { throw new InvalidOperationException("The tree is empty."); } return _head.Min().Value; } public T Max() { if (_head is null) { throw new InvalidOperationException("The tree is empty."); } return _head.Max().Value; } ... } private class Node { ... public Node Min() { return _left is null ? this : _left.Min(); } public Node Max() { return _right is null ? this : _right.Max(); } } Item Search We will create a simple search function to find a specific node in the tree based on its value. Due to the structure of the tree, the search process is relatively straightforward. If the current node contains the given value, we will return it. If the value being searched for is less than the value of the current node, we will recursively search in the left subtree. Conversely, if the search value is greater than the current node's value, we will look in the right subtree. If we reach an empty subtree, i.e., it points to NULL, then the node with the searched value does not exist. The algorithm is similar to the insertion algorithm, and we will implement the Contains method for the tree and the Find method for the nodes. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public bool Contains(T value) { return _head?.Find(value) is not null; } ... } private class Node { ... public Node? Find(T value) { var comparison = value.CompareTo(Value); if (comparison == 0) { return this; } return comparison < 0 ? _left?.Find(value) : _right?.Find(value); } } Removing Values Now, in order for our tree not to break in case we want to delete some element from it, we need some special deletion rules. Firstly, if the element being deleted is a leaf node, then we simply replace it with NULL. If this element had one child, then we replace the deleted node not with NULL but with that child. So, in these two cases, deletion simply boils down to replacing the deleted node with its child, which can be either a regular existing node or NULL. Therefore, we simply check that the parent definitely does not have two children and overwrite the deleted node with one of its children, which, I repeat, can turn out to be either an existing node or NULL. The last case is when the deleted node had two children. In this case, the node that will come in place of the deleted node must be greater than all nodes in the left subtree of the deleted node and smaller than all nodes in the right subtree of the deleted node. Therefore, first, we need to find either the largest element in the left subtree or the smallest element in the right subtree. After we have found it, we overwrite the deleted node's data and recursively delete the node that moved to the deleted node's position. It will be deleted according to the same rules: it will either be replaced with NULL or its only child. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public bool Remove(T value) { return _head?.Remove(value, out _head) ?? false; } ... } private class Node { ... public bool Remove(T value, out Node? root) { var comparison = value.CompareTo(Value); if (comparison < 0) { root = this; return _left?.Remove(value, out _left) ?? false; } if (comparison > 0) { root = this; return _right?.Remove(value, out _right) ?? false; } if (_left is null || _right is null) { root = _left ?? _right; return true; } var leftMax = _left.Max(); _left.Remove(leftMax.Value, out _left); Value = leftMax.Value; root = this; return true; } } Tree Traversal To check that the deletion was successful and the order of all nodes has been preserved, there is a special tree traversal method that allows us to output all nodes in ascending order. This method is called an in-order traversal. It involves recursively outputting first the left child, then the parent, and then the right child. Let's convert the tree to a regular list using an in-order traversal. C# public class BinarySearchTree<T> where T : IComparable<T> { ... public List<T> ToList() { var list = new List<T>(); _head?.AddTo(list); return list; } ... } private class Node { ... public void AddTo(ICollection<T> list) { _left?.AddTo(list); list.Add(Value); _right?.AddTo(list); } } Now we have a simple way to output the tree to the console. Let's do that and make sure that the deletion works correctly. C# var tree = new BinarySearchTree<int>(); tree.Add(50); tree.Add(40); tree.Add(30); tree.Add(45); tree.Add(35); Print(tree.ToList()); tree.Remove(35); tree.Remove(40); Print(tree.ToList()); void Print(List<int> list) { foreach (var value in list) { Console.Write(value); Console.Write(" "); } Console.WriteLine(); } Another type of tree traversal is called a pre-order traversal. It involves outputting first the parent, then the left child, and then the right child. This can be useful, for example, when copying a tree in memory, because we traverse the nodes in the exact same order as they were placed in the tree from top to bottom. There are other types of binary search tree traversals, but their implementation differs little. Conclusion Finally, we have a data structure that can do everything quickly. Let's take a moment to think and create a tree from elements 1 to 5. If each subsequent node is always greater or always less than the previous node, then we again get an ordinary linked list with operations of complexity O(N). Therefore, our tree turns out to be completely useless. Fortunately, people quickly realized this and came up with a more advanced tree called AVL with self-balancing, which will again allow us to achieve logarithmic complexity regardless of the incoming data. But we will cover this type of tree and its balancing implementation in the next article Understanding AVL Trees in C#: A Guide to Self-Balancing Binary Search Trees.
This blog post is for folks interested in learning how to use Golang and AWS Lambda to build a serverless solution. You will be using the aws-lambda-go library along with the AWS Go SDK v2 for an application that will process records from an Amazon Kinesis data stream and store them in a DynamoDB table. But that's not all! You will also use Go bindings for AWS CDK to implement "Infrastructure-as-code" for the entire solution and deploy it with the AWS CDK CLI. Introduction Amazon Kinesis is a platform for real-time data processing, ingestion, and analysis. Kinesis Data Streams is a serverless streaming data service (part of the Kinesis streaming data platform, along with Kinesis Data Firehose, Kinesis Video Streams, and Kinesis Data Analytics) that enables developers to collect, process, and analyze large amounts of data in real-time from various sources such as social media, IoT devices, logs, and more. AWS Lambda, on the other hand, is a serverless compute service that allows developers to run their code without having to manage the underlying infrastructure. The integration of Amazon Kinesis with AWS Lambda provides an efficient way to process and analyze large data streams in real time. A Kinesis data stream is a set of shards and each shard contains a sequence of data records. A Lambda function can act as a consumer application and process data from a Kinesis data stream. You can map a Lambda function to a shared-throughput consumer (standard iterator), or to a dedicated-throughput consumer with enhanced fan-out. For standard iterators, Lambda polls each shard in your Kinesis stream for records using HTTP protocol. The event source mapping shares read throughput with other consumers of the shard. Amazon Kinesis and AWS Lambda can be used together to build many solutions including real-time analytics (allowing businesses to make informed decisions), log processing (use logs to proactively identify and address issues in server/applications, etc. before they become critical), IoT data processing (analyze device data in real-time and trigger actions based on the results), clickstream analysis (provide insights into user behavior), fraud detection (detect and prevent fraudulent card transactions) and more. As always, the code is available on GitHub. Prerequisites Before you proceed, make sure you have the Go programming language (v1.18 or higher) and AWS CDK installed. Clone the GitHub repository and change to the right directory: git clone https://github.com/abhirockzz/kinesis-lambda-events-golang cd kinesis-lambda-events-golang Use AWS CDK To Deploy the Solution To start the deployment, simply invoke cdk deploy and wait for a bit. You will see a list of resources that will be created and will need to provide your confirmation to proceed. cd cdk cdk deploy # output Bundling asset KinesisLambdaGolangStack/kinesis-function/Code/Stage... ✨ Synthesis time: 5.94s This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening). Please confirm you intend to make the following modifications: //.... omitted Do you wish to deploy these changes (y/n)? y This will start creating the AWS resources required for our application. If you want to see the AWS CloudFormation template which will be used behind the scenes, run cdk synth and check the cdk.out folder. You can keep track of the progress in the terminal or navigate to the AWS console: CloudFormation > Stacks > KinesisLambdaGolangStack. Once all the resources are created, you can try out the application. You should have: A Lambda function A Kinesis stream A DynamoDB table Along with a few other components (like IAM roles, etc.) Verify the Solution You can check the table and Kinesis stream info in the stack output (in the terminal or the Outputs tab in the AWS CloudFormation console for your Stack): Publish a few messages to the Kinesis stream. For the purposes of this demo, you can use the AWS CLI: export KINESIS_STREAM=<enter the Kinesis stream name from cloudformation output> aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user1@foo.com --data $(echo -n '{"name":"user1", "city":"seattle"}' | base64) aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user2@foo.com --data $(echo -n '{"name":"user2", "city":"new delhi"}' | base64) aws kinesis put-record --stream-name $KINESIS_STREAM --partition-key user3@foo.com --data $(echo -n '{"name":"user3", "city":"new york"}' | base64) Check the DynamoDB table to confirm that the file metadata has been stored. You can use the AWS console or the AWS CLI aws dynamodb scan --table-name <enter the table name from cloudformation output>. Don’t Forget To Clean Up Once you're done, to delete all the services, simply use: cdk destroy #output prompt (choose 'y' to continue) Are you sure you want to delete: KinesisLambdaGolangStack (y/n)? You were able to set up and try the complete solution. Before we wrap up, let's quickly walk through some of the important parts of the code to get a better understanding of what's going the behind the scenes. Code Walkthrough Some of the code (error handling, logging, etc.) has been omitted for brevity since we only want to focus on the important parts. AWS CDK You can refer to the CDK code here. We start by creating the DynamoDB table: table := awsdynamodb.NewTable(stack, jsii.String("dynamodb-table"), &awsdynamodb.TableProps{ PartitionKey: &awsdynamodb.Attribute{ Name: jsii.String("email"), Type: awsdynamodb.AttributeType_STRING}, }) table.ApplyRemovalPolicy(awscdk.RemovalPolicy_DESTROY) We create the Lambda function (CDK will take care of building and deploying the function) and make sure we provide it with appropriate permissions to write to the DynamoDB table. function := awscdklambdagoalpha.NewGoFunction(stack, jsii.String("kinesis-function"), &awscdklambdagoalpha.GoFunctionProps{ Runtime: awslambda.Runtime_GO_1_X(), Environment: &map[string]*string{"TABLE_NAME": table.TableName()}, Entry: jsii.String(functionDir), }) table.GrantWriteData(function) Then, we create the Kinesis stream and add that as an event source to the Lambda function. kinesisStream := awskinesis.NewStream(stack, jsii.String("lambda-test-stream"), nil) function.AddEventSource(awslambdaeventsources.NewKinesisEventSource(kinesisStream, &awslambdaeventsources.KinesisEventSourceProps{ StartingPosition: awslambda.StartingPosition_LATEST, })) Finally, we export the Kinesis stream and DynamoDB table name as CloudFormation outputs. awscdk.NewCfnOutput(stack, jsii.String("kinesis-stream-name"), &awscdk.CfnOutputProps{ ExportName: jsii.String("kinesis-stream-name"), Value: kinesisStream.StreamName()}) awscdk.NewCfnOutput(stack, jsii.String("dynamodb-table-name"), &awscdk.CfnOutputProps{ ExportName: jsii.String("dynamodb-table-name"), Value: table.TableName()}) Lambda Function You can refer to the Lambda Function code here. The Lambda function handler iterates over each record in the Kinesis stream, and for each of them: Unmarshals the JSON payload in the Kinesis stream into a Go struct Stores the stream data partition key as the primary key attribute (email) of the DynamoDB table The rest of the information is picked up from the stream data and also stored in the table. func handler(ctx context.Context, kinesisEvent events.KinesisEvent) error { for _, record := range kinesisEvent.Records { data := record.Kinesis.Data var user CreateUserInfo err := json.Unmarshal(data, &user) item, err := attributevalue.MarshalMap(user) if err != nil { return err } item["email"] = &types.AttributeValueMemberS{Value: record.Kinesis.PartitionKey} _, err = client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(table), Item: item, }) } return nil } type CreateUserInfo struct { Name string `json:"name"` City string `json:"city"` } Wrap Up In this blog, you saw an example of how to use Lambda to process messages in a Kinesis stream and store them in DynamoDB, thanks to the Kinesis and Lamdba integration. The entire infrastructure life-cycle was automated using AWS CDK. All this was done using the Go programming language, which is well-supported in DynamoDB, AWS Lambda, and AWS CDK. Happy building!
This is the first blog in a series that will focus on Snowflake, where we’ll cover best practices for using Snowflake, explore various Snowflake functionalities, discuss how to maximize the benefits of Snowflake, and address the challenges that come with its implementation or migration. In this blog, we’ll start by discussing setting up a Snowflake account, especially for those new to the Snowflake ecosystem. With a Snowflake account readily available to use and a limited understanding of its system-defined roles, it usually becomes a challenge for a team lead or an admin to set up the environments with proper access controls to its developers or users. To start with the account setup, first, you would be needing a user which has ACCOUNTADMIN role access for the Snowflake account. This can be provided by a user who has ORGADMIN Snowflake account access. This is understood by the example below: An organization has one Snowflake organization-wide account and is managed by ORGADMIN. ORGADMIN can create multiple accounts under the same organization in Snowflake, which can be separately managed by different teams within the organization. Before starting to create users, roles, warehouses, databases, etc., you need to first understand the below System Defined Roles in Snowflake and what Snowflake recommends as a best practice while setting up the account. System-Defined Roles USERADMIN: The initial part of the account creation process is creating users and roles within an account. USERADMIN roles' purpose is users and role creation. This role is granted with CREATE USER and CREATE ROLE security privileges. SECURITYADMIN: A role is incomplete without any grants on it, and SECURITYADMIN role is solely used for granting. Anything relating to grants in Snowflake is completely managed by SECURITYADMIN role. Once users and roles are created by USERADMIN, you can use SECURITYADMIN to grant the users appropriate roles. You can grant warehouses, databases, schemas, integration objects, and access to create tables, stages, views, etc., to a role using SECURITYADMIN role. SECURITYADMIN role inherits the privileges of the USERADMIN role via the system role hierarchy. Note that Snowflake doesn't have the concept of user groups. Instead, the Users are created, and necessary roles are granted to the user. SYSADMIN: SYSADMIN creates the objects like databases, warehouses, schemas, etc., in an account. Although it creates the objects like databases, warehouses, etc., it doesn’t grant access to these objects to the roles. It's done by SECURITYADMIN. ACCOUNTADMIN: ACCOUNTADMIN role encapsulates the SYSADMIN and SECURITYADMIN system-defined roles. It is the top-level role in the system and should be granted only to a limited/controlled number of users in your account. Other than this, ACCOUNTADMIN only has access to CREATE INTEGRATION objects in Snowflake. As a best practice, enable Users with ACCOUNTADMIN roles should have MFA enabled. ORGADMIN: This role is mainly used to create accounts within an organization. Each account acts as a separate entity and will have its own databases, warehouses, and other objects. PUBLIC: As the name suggests, this role can be accessed by every other user in an account. Objects created as a part of a PUBLIC role can be accessed by anyone and used when there is no need for access controls over the objects, and can be shared across the account. Generally, non recommended to use this role for production purposes. Setting up an Account With an Example Since now it's clear what every system-defined role is meant to do in Snowflake, let's see some basic examples of setting up an account using them. Assuming that you have logged in using a user having ACCOUNTADMIN access, let's see below the use case: There are four users named: meghna, adnan, kaushik, and shushant meghna and adnan are from an analytics team who build reports using reporting tools. Hence they only need the read access for the objects created. kaushik and shushant are from the data engineering team, and build pipelines to load the data into the Snowflake databases. Since it's a development environment, they will have read-and-write access to the objects created. So, let's use the usernames as their first names: meghna, adnan, kaushik, shushant Since they are working in an analytics project and dev environment, we can create two roles: One for read named ROLE_DEV_ANALYTICS_RO One for read/write access named ROLE_DEV_ANALYTICS_RW Steps: First, as discussed, let's create the users using the USERADMIN role. SQL use role USERADMIN; – Create the roles create role ROLE_DEV_ANALYTICS_RO; create role ROLE_DEV_ANALYTICS_RW; – create the users create user meghna password='abc123' default_role = ROLE_DEV_ANALYTICS_RO default_secondary_roles = ('ALL') must_change_password = true; create user adnan password='abc123' default_role = ROLE_DEV_ANALYTICS_RO default_secondary_roles = ('ALL') must_change_password = true; create user kaushik password='abc123' default_role = ROLE_DEV_ANALYTICS_RW default_secondary_roles = ('ALL') must_change_password = true; create user shushant password='abc123' default_role = ROLE_DEV_ANALYTICS_RW default_secondary_roles = ('ALL') must_change_password = true; Note that all four users are created using the same password with the argument must_change_password = true, which will force them to change the passwords upon the first login. Use SECURITYADMIN to grant users their respective roles: SQL use role SECURITYADMIN; – Grant the Roles created to SYSADMIN grant role ROLE_DEV_ANALYTICS_RO to role SYSADMIN; grant role ROLE_DEV_ANALYTICS_RW to role SYSADMIN; This is done so that objects like tables, stages, views, etc., created using the roles should be accessible by SYSADMIN as well. If this is not granted, SYSADMIN wouldn’t be able to access or manage the objects created by these roles. SQL use role SECURITYADMIN; – Grant the users to the roles grant ROLE ROLE_DEV_ANALYTICS_RO to user meghna; grant ROLE ROLE_DEV_ANALYTICS_RO to user adnan; grant ROLE ROLE_DEV_ANALYTICS_RW to user kaushik; grant ROLE ROLE_DEV_ANALYTICS_RW to user shushant; Now let's use SYSADMIN to create the warehouse, databases, schemas, etc. SQL use role SYSADMIN; -- Create database and schemas create database analytics_dev; create schema analytics_dev.analytics_master; create schema analytics_dev.analytics_summary; -- Create warehouse create warehouse analytics_small with warehouse_size = 'SMALL' warehouse_type = 'STANDARD' auto_suspend = 60 auto_resume = TRUE ; The above SQL is creating a small warehouse that can suspend in 60 seconds of inactivity and auto resume whenever queries are triggered. Now, since the database, schema, and warehouse is ready, it is time to grant the roles the necessary accesses using SECURITYADMIN. Let's assume that only tables and views are used for this project. SQL use role SECURITYADMIN; – Granting the usage access to ROLE_DEV_ANALYTICS_RO grant usage on database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant usage on all schemas in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on future tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on all tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on future views in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; grant select on all views in database analytics_dev to role ROLE_DEV_ANALYTICS_RO; – Granting the usage access to ROLE_DEV_ANALYTICS_RW grant usage on database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant usage on all schemas in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on future tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on all tables in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on future views in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant select on all views in database analytics_dev to role ROLE_DEV_ANALYTICS_RW; grant create table on schema analytics_dev.analytics_master to role ROLE_DEV_ANALYTICS_RW; grant create view on schema analytics_dev.analytics_master to role ROLE_DEV_ANALYTICS_RW; grant create table on schema analytics_dev.analytics_summary to role ROLE_DEV_ANALYTICS_RW; grant create view on schema analytics_dev.analytics_summary to role ROLE_DEV_ANALYTICS_RW; As seen above, ROLE_DEV_ANALYTICS_RO has been granted read access only and ROLE_DEV_ANALYTICS_RW is granted both read and write access. Finally, let's grant the warehouse to the roles. SQL use role SECURITYADMIN; grant USAGE , OPERATE on warehouse analytics_small to role ROLE_DEV_ANALYTICS_RO; grant USAGE , OPERATE on warehouse analytics_small to role ROLE_DEV_ANALYTICS_RW; Users with appropriate permissions should now be able to log in to Snowflake and should be able to use only the roles associated with appropriate permissions. Thank you for reading through the entire article. In the next installment of this series, we will delve into some of the most effective practices for loading files into Snowflake.
What We Use ClickHouse For The music library of Tencent Music contains data of all forms and types: recorded music, live music, audio, videos, etc. As data platform engineers, our job is to distill information from the data, based on which our teammates can make better decisions to support our users and musical partners. Specifically, we do an all-round analysis of the songs, lyrics, melodies, albums, and artists, turn all this information into data assets, and pass them to our internal data users for inventory counting, user profiling, metrics analysis, and group targeting. We stored and processed most of our data in Tencent Data Warehouse (TDW), an offline data platform where we put the data into various tag and metric systems and then created flat tables centering each object (songs, artists, etc.). Then we imported the flat tables into ClickHouse for analysis and Elasticsearch for data searching and group targeting. After that, our data analysts used the data under the tags and metrics they needed to form datasets for different usage scenarios, during which they could create their own tags and metrics. The data processing pipeline looked like this: Why ClickHouse Is Not a Good Fit When working with the above pipeline, we encountered a few difficulties: Partial Update: Partial update of columns was not supported. Therefore, any latency from any one of the data sources could delay the creation of flat tables and thus undermine data timeliness. High storage cost: Data under different tags and metrics was updated at different frequencies. As much as ClickHouse excelled in dealing with flat tables, it was a huge waste of storage resources to just pour all data into a flat table and partition it by day, not to mention the maintenance cost coming with it. High maintenance cost: Architecturally speaking, ClickHouse was characterized by the strong coupling of storage nodes and compute nodes. Its components were heavily interdependent, adding to the risks of cluster instability. Plus, for federated queries across ClickHouse and Elasticsearch, we had to take care of a huge amount of connection issues. That was just tedious. Transition to Apache Doris Apache Doris, a real-time analytical database, boasts a few features that are exactly what we needed to solve our problems: Partial update: Doris supports a wide variety of data models, among which the Aggregate Model supports the real-time partial update of columns. Building on this, we can directly ingest raw data into Doris and create flat tables there. The ingestion goes like this: Firstly, we use Spark to load data into Kafka; then, any incremental data will be updated to Doris and Elasticsearch via Flink. Meanwhile, Flink will pre-aggregate the data so as to release the burden on Doris and Elasticsearch. Storage cost: Doris supports multi-table join queries and federated queries across Hive, Iceberg, Hudi, MySQL, and Elasticsearch. This allows us to split the large flat tables into smaller ones and partition them by update frequency. The benefits of doing so include relief of storage burden and an increase in query throughput. Maintenance cost: Doris is of simple architecture and is compatible with MySQL protocol. Deploying Doris only involves two processes (FE and BE) with no dependency on other systems, making it easy to operate and maintain. Also, Doris supports querying external ES data tables. It can easily interface with the metadata in ES and automatically map the table schema from ES so we can conduct queries on Elasticsearch data via Doris without grappling with complex connections. What's more, Doris supports multiple data ingestion methods, including batch import from remote storage such as HDFS and S3, data reads from MySQL binlog and Kafka, and real-time data synchronization or batch import from MySQL, Oracle, and PostgreSQL. It ensures service availability and data reliability through a consistency protocol and is capable of auto-debugging. This is great news for our operators and maintainers. Statistically speaking, these features have cut our storage cost by 42% and development cost by 40%. During our usage of Doris, we have received lots of support from the open-source Apache Doris community and timely help from the SelectDB team, which is now running a commercial version of Apache Doris. Further Improvements To Serve Our Needs Introduce a Semantic Layer Speaking of the datasets, on the bright side, our data analysts are given the liberty of redefining and combining the tags and metrics at their convenience. But on the dark side, high heterogeneity of the tag and metric systems leads to more difficulty in their usage and management. Our solution is to introduce a semantic layer in our data processing pipeline. The semantic layer is where all the technical terms are translated into more comprehensible concepts for our internal data users. In other words, we are turning the tags and metrics into first-class citizens for data definement and management. Why Would This Help? For data analysts, all tags and metrics will be created and shared at the semantic layer so there will be less confusion and higher efficiency. For data users, they no longer need to create their own datasets or figure out which one is applicable for each scenario but can simply conduct queries on their specified tagset and metricset. Upgrade the Semantic Layer Explicitly defining the tags and metrics at the semantic layer was not enough. In order to build a standardized data processing system, our next goal was to ensure consistent definition of tags and metrics throughout the whole data processing pipeline. For this sake, we made the semantic layer the heart of our data management system: How Does It Work? All computing logics in TDW will be defined at the semantic layer in the form of a single tag or metric. The semantic layer receives logic queries from the application side, selects an engine accordingly, and generates SQL. Then it sends the SQL command to TDW for execution. Meanwhile, it might also send configuration and data ingestion tasks to Doris and decide which metrics and tags should be accelerated. In this way, we have made the tags and metrics more manageable. A fly in the ointment is that since each tag and metric is individually defined, we are struggling with automating the generation of a valid SQL statement for the queries. If you have any idea about this, you are more than welcome to talk to us. Give Full Play to Apache Doris As you can see, Apache Doris has played a pivotal role in our solution. Optimizing the usage of Doris can largely improve our overall data processing efficiency. So, in this part, we are going to share with you what we do with Doris to accelerate data ingestion and queries and reduce costs. What We Want Currently, we have 800+ tags and 1300+ metrics derived from the 80+ source tables in TDW. When importing data from TDW to Doris, we hope to achieve: Real-time availability: In addition to the traditional T+1 offline data ingestion, we require real-time tagging. Partial update: Each source table generates data through its own ETL task at various paces and involves only part of the tags and metrics, so we require support for partial update of columns. High performance: We need a response time of only a few seconds in group targeting, analysis, and reporting scenarios. Low costs: We hope to reduce costs as much as possible. What We Do Generate Flat Tables in Flink Instead of TDW Generating flat tables in TDW has a few downsides: High storage cost: TDW has to maintain an extra flat table apart from the discrete 80+ source tables. That's huge redundancy. Low real-timeliness: Any delay in the source tables will be augmented and retard the whole data link. High development cost: To achieve real-timeliness would require extra development efforts and resources. On the contrary, generating flat tables in Doris is much easier and less expensive. The process is as follows: Use Spark to import new data into Kafka in an offline manner. Use Flink to consume Kafka data. Create a flat table via the primary key ID. Import the flat table into Doris. As is shown below, Flink has aggregated the five lines of data, of which "ID"=1, into one line in Doris, reducing the data writing pressure on Doris. This can largely reduce storage costs since TDW no longer has to maintain two copies of data, and KafKa only needs to store the new data pending for ingestion. What's more, we can add whatever ETL logic we want into Flink and reuse lots of development logic for offline and real-time data ingestion. Name the Columns Smartly As we mentioned, the Aggregate Model of Doris allows for a partial update of columns. Here we provide a simple introduction to other data models in Doris for your reference: Unique Model: This is applicable for scenarios requiring primary key uniqueness. It only keeps the latest data of the same primary key ID. (As far as we know, the Apache Doris community is planning to include partial update of columns in the Unique Model, too.) Duplicate Model: This model stores all original data exactly as it is without any pre-aggregation or deduplication. After determining the data model, we had to think about how to name the columns. Using the tags or metrics as column names was not a choice because: Our internal data users might need to rename the metrics or tags, but Doris 1.1.3 does not support the modification of column names. Tags might be taken online and offline frequently. If that involves the adding and dropping of columns, it will be not only time-consuming but also detrimental to query performance. Instead, we do the following: For flexible renaming of tags and metrics, we use MySQL tables to store the metadata (name, globally unique ID, status, etc.). Any change to the names will only happen in the metadata but will not affect the table schema in Doris. For example, if a song_name is given an ID of 4, it will be stored with the column name of a4 in Doris. Then if the song_nameis involved in a query, it will be converted to a4 in SQL. For the onlining and offlining of tags, we sort out the tags based on how frequently they are being used. The least used ones will be given an offline mark in their metadata. No new data will be put under the offline tags but the existing data under those tags will still be available. For real-time availability of newly added tags and metrics, we prebuild a few ID columns in Doris tables based on the mapping of name IDs. These reserved ID columns will be allocated to the newly added tags and metrics. Thus, we can avoid table schema change and the consequent overheads. Our experience shows that only 10 minutes after the tags and metrics are added, the data under them can be available. Noteworthily, the recently released Doris 1.2.0 supports Light Schema Change, which means that to add or remove columns, you only need to modify the metadata in FE. Also, you can rename the columns in data tables as long as you have enabled Light Schema Change for the tables. This is a big trouble saver for us. Optimize Date Writing Here are a few practices that have reduced our daily offline data ingestion time by 75% and our CUMU compaction score from 600+ to 100. Flink pre-aggregation: as is mentioned above. Auto-sizing of writing batch: To reduce Flink resource usage, we enable the data in one Kafka Topic to be written into various Doris tables and realize the automatic alteration of batch size based on the data amount. Optimization of Doris data writing: fine-tune the sizes of tablets and buckets as well as the compaction parameters for each scenario: max_XXXX_compaction_thread max_cumulative_compaction_num_singleton_deltas Optimization of the BE commit logic: conduct regular caching of BE lists, commit them to the BE nodes batch by batch, and use finer load balancing granularity. Use Dori-on-ES in Queries About 60% of our data queries involve group targeting. Group targeting is to find our target data by using a set of tags as filters. It poses a few requirements for our data processing architecture: Group targeting related to APP users can involve very complicated logic. That means the system must support hundreds of tags as filters simultaneously. Most group targeting scenarios only require the latest tag data. However, metric queries need to support historical data. Data users might need to perform further aggregated analysis of metric data after group targeting. Data users might also need to perform detailed queries on tags and metrics after group targeting. After consideration, we decided to adopt Doris-on-ES. Doris is where we store the metric data for each scenario as a partition table, while Elasticsearch stores all tag data. The Doris-on-ES solution combines the distributed query planning capability of Doris and the full-text search capability of Elasticsearch. The query pattern is as follows: SELECT tag, agg(metric) FROM Doris WHERE id in (select id from Es where tagFilter) GROUP BY tag As is shown, the ID data located in Elasticsearch will be used in the sub-query in Doris for metric analysis. In practice, we find that the query response time is related to the size of the target group. If the target group contains over one million objects, the query will take up to 60 seconds. If it is even larger, a timeout error might occur. After investigation, we identified our two biggest time wasters: When Doris BE pulls data from Elasticsearch (1024 lines at a time by default) for a target group of over one million objects, the network I/O overhead can be huge. After the data pulling, Doris BE needs to conduct Join operations with local metric tables via SHUFFLE/BROADCAST, which can cost a lot. Thus, we make the following optimizations: Add a query session variable es_optimize that specifies whether to enable optimization. In data writing into ES, add a BK column to store the bucket number after the primary key ID is hashed. The algorithm is the same as the bucketing algorithm in Doris (CRC32). Use Doris BE to generate a Bucket Join execution plan, dispatch the bucket number to BE ScanNode, and push it down to ES. Use ES to compress the queried data, turn multiple data fetch into one and reduce network I/O overhead. Make sure that Doris BE only pulls the data of buckets related to the local metric tables and conducts local Join operations directly to avoid data shuffling between Doris BEs. As a result, we reduce the query response time for large group targeting from 60 seconds to a surprising 3.7 seconds. Community information shows that Doris is going to support inverted indexing since version 2.0.0, which is soon to be released. With this new version, we will be able to conduct a full-text search on text types, equivalence or range filtering of texts, numbers, and datetime, and conveniently combine AND, OR, NOT logic in filtering since the inverted indexing supports array types. This new feature of Doris is expected to deliver 3~5 times better performance than Elasticsearch on the same task. Refine the Management of Data Doris' capability of cold and hot data separation provides the foundation of our cost reduction strategies in data processing. Based on the TTL mechanism of Doris, we only store data of the current year in Doris and put the historical data before that in TDW for lower storage cost. We vary the number of copies for different data partitions. For example, we set three copies for data from the recent three months, which is used frequently, one copy for data older than six months, and two copies for data in between. Doris supports turning hot data into cold data, so we only store data of the past seven days in SSD and transfer data older than that to HDD for less expensive storage. Conclusion Thank you for scrolling all the way down here and finishing this long read. We've shared our cheers and tears, lessons learned, and a few practices that might be of some value to you during our transition from ClickHouse to Doris. We really appreciate the help from the Apache Doris community and the SelectDB team, but we might still be chasing them around for a while since we attempt to realize auto-identification of cold and hot data, pre-computation of frequently used tags/metrics, simplification of code logic using Materialized Views, and so on and so forth. (This article is co-written by me and my colleague Kai Dai. We are both data platform engineers at Tencent Music (NYSE: TME), a music streaming service provider with a whopping 800 million monthly active users. To drop the number here is not to brag but to give a hint of the sea of data that my poor coworkers and I have to deal with every day.)
Data streaming is one of the most relevant buzzwords in tech to build scalable real-time applications in the cloud and innovative business models. Do you wonder about my predicted TOP 5 data streaming trends in 2023 to set data in motion? Check out the following presentation and learn what role Apache Kafka plays. Learn about decentralized data mesh, cloud-native lakehouse, data sharing, improved user experience, and advanced data governance. Some followers might notice that this became a series with past posts about the top 5 data streaming trends for 2021 and the top 5 for 2022. Data streaming with Apache Kafka is a journey and evolution to set data in motion. Trends change over time, but the core value of a scalable real-time infrastructure as the central data hub stays. Gartner Top Strategic Technology Trends for 2023 The research and consulting company Gartner defines the top strategic technology trends every year. This time, the trends are more focused on particular niche concepts. On a higher level, it is all about optimizing, scaling, and pioneering. Here is what Gartner expects for 2023: Source Gartner It is funny (but not surprising): Gartner’s predictions overlap and complement the five trends I focus on for data streaming with Apache Kafka looking forward to 2023. I explore how data streaming enables better time to market with decentralized optimized architectures, cloud-native infrastructure for elastic scale, and pioneering innovative use cases to build valuable data products. Hence, here you go with the top 5 trends in data streaming for 2023. The Top 5 Data Streaming Trends for 2023 I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe: Cloud-native lakehouses Decentralized data mesh Data sharing in real-time Improved developer and user experience Advanced data governance and policy enforcement The following sections describe each trend in more detail. The end of the article contains the complete slide deck. The trends are relevant for various scenarios. No matter if you use open source Apache Kafka, a commercial platform, or a fully-managed cloud service like Confluent Cloud. Kafka as Data Fabric for Cloud-Native Lakehouses Many data platform vendors pitch the lakehouse vision today. That's the same story as the data lake in the Hadoop era with few new nuances. Put all your data into a single data store to save the world and solve every problem and use case: In the last ten years, most enterprises realized this strategy did not work. The data lake is great for reporting and batch analytics, but not the right choice for every problem. Besides technical challenges, new challenges emerged: data governance, compliance issues, data privacy, and so on. Applying a best-of-breed enterprise architecture for real-time and batch data analytics using the right tool for each job is a much more successful, flexible, and future-ready approach: Data platforms like Databricks, Snowflake, Elastic, MongoDB, BigQuery, etc., have their sweet spots and trade-offs. Data streaming increasingly becomes the real-time data fabric between all the different data platforms and other business applications leveraging the real-time Kappa architecture instead of the much more batch-focused Lamba architecture. Decentralized Data Mesh With Valuable Data Products Focusing on business value by building data products in independent domains with various technologies is key to success in today's agile world with ever-changing requirements and challenges. Data mesh came to the rescue and emerged as a next-generation design pattern, succeeding service-oriented architectures and microservices. Two main proposals exist by vendors for building a data mesh: Data integration with data streaming enables fully decentralized business products. On the other side, data virtualization provides centralized queries: Centralized queries are simple but do not provide a clean architecture and decoupled domains and applications. It might work well to solve a single problem in a project. However, I highly recommend building a decentralized data mesh with data streaming to decouple the applications, especially for strategic enterprise architectures. Collaboration Within and Across Organizations in Real Time Collaborating within and outside the organization with data sharing using Open APIs, streaming data exchange, and cluster linking enable many innovative business models: The difference between data streaming to a database, data warehouse, or data lake is crucial: All these platforms enable data sharing at rest. The data is stored on a disk before it is replicated and shared within the organization or with partners. This is not real time. You cannot connect a real-time consumer to data at rest. However, real-time data beats slow data. Hence, sharing data in real time with data streaming platforms like Apache Kafka or Confluent Cloud enables accurate data as soon as a change happens. A consumer can be real-time, near real-time, or batch. A streaming data exchange puts data in motion within the organization or for B2B data sharing and Open API business models. AsyncAPI Spec for Apache Kafka API Schemas AsyncAPI allows developers to define the interfaces of asynchronous APIs. It is protocol agnostic. Features include: Specification of OpenAPI contracts (= schemas in the data streaming world) Documentation of APIs Code generation for many programming languages Data governance And much more... Confluent Cloud recently added a feature for generating an AsyncAPI specification for Apache Kafka clusters. We don't know yet where the market is going. Will AsynchAPI become the standard for OpenAPI in data streaming? Maybe. I see increasing demand for this specification by customers. Let's review the status of AsynchAPI in a few quarters or years. But it has the potential. Improved Developer Experience With Low-Code/No-Code Tools for Apache Kafka Many analysts and vendors pitch low code/no code tools. Visual coding is nothing new. Very sophisticated, powerful, and easy-to-use solutions exist as IDE or cloud applications. The significant benefit is time-to-market for developing applications and easier maintenance. At least in theory. These tools support various personas like developers, citizen integrators, and data scientists. At least in theory. The reality is that: Code is king Development is about evolution Open platforms win Low code/no code is great for some scenarios and personas. But it is just one option of many. Let's look at a few alternatives for building Kafka-native applications: These Kafka-native technologies have their trade-offs. For instance, the Confluent Stream Designer is perfect for building streaming ETL pipelines between various data sources and sinks. Just click the pipeline and transformations together. Then deploy the data pipeline into a scalable, reliable, and fully-managed streaming application. The difference to separate tools like Apache Nifi is that the generated code run in the same streaming platform, i.e., one infrastructure end-to-end. This makes ensuring SLAs and latency requirements much more manageable and the whole data pipeline more cost-efficient. However, the simpler a tool is, the less flexible it is. It is that easy. No matter which product or vendor you look at. This is not just true for Kafka-native tools. And you are flexible with your tool choice per project or business problem. Add your favorite non-Kafka stream processing engine to the stack, for instance, Apache Flink. Or use a separate iPaaS middleware like Dell Boomi or SnapLogic. Domain-Driven Design With Dumb Pipes and Smart Endpoints The real benefit of data streaming is the freedom of choice for your favorite Kafka-native technology, open-source stream processing framework, or cloud-native iPaaS middleware. Choose the proper library, tool, or SaaS for your project. Data streaming enables a decoupled domain-driven design with dumb pipes and smart endpoints: Data streaming with Apache Kafka is perfect for domain-driven design (DDD). On the contrary, often used point-to-point microservice architecture HTTP/REST web service or push-based message brokers like RabbitMQ create much stronger dependencies between applications. Data Governance Across the Data Streaming Pipeline An enterprise architecture powered by data streaming enables easy access to data in real-time. Many enterprises leverage Apache Kafka as the central nervous system between all data sources and sinks. The consequence of being able to access all data easily across business domains is two conflicting pressures on organizations: Unlock the data to enable innovation versus Lock up the data to keep it safe. Achieving data governance across the end-to-end data streams with data lineage, event tracing, policy enforcement, and time travel to analyze historical events is critical for strategic data streaming in the enterprise architecture. Data governance on top of the streaming platform is required for end-to-end visibility, compliance, and security: Policy Enforcement With Schemas and API Contracts The foundation for data governance is the management of API contracts (so-called schemas in data streaming platforms like Apache Kafka). Solutions like Confluent enforce schemas along the data pipeline, including data producer, server, and consumer: Additional data governance tools like data lineage, catalog, or police enforcement are built on this foundation. The recommendation for any serious data streaming project is to use schema from the beginning. It is unnecessary for the first pipeline. But the following producers and consumers need a trusted environment with enforced policies to establish a decentralized data mesh architecture with independent but connected data products. Slides and Video for Data Streaming Use Cases in 2023 Here is the slide deck from my presentation: Fullscreen Mode And here is the free on-demand video recording. Data Streaming Goes Up in the Maturity Curve in 2023 It is still an early stage for data streaming in most enterprises. But the discussion goes beyond questions like "when to use Kafka?" or "which cloud service to use?"... In 2023, most enterprises look at more sophisticated challenges around their numerous data streaming projects. The new trends are often related to each other. A data mesh enables the building of independent data products that focus on business value. Data sharing is a fundamental requirement for a data mesh. New personas access the data stream. Often, citizen developers or data scientists need easy tools to pioneer new projects. The enterprise architecture requires and enforces data governance across the pipeline for security, compliance, and privacy reasons. Scalability and elasticity need to be there out of the box. Fully-managed data streaming is a brilliant opportunity for getting started in 2023 and moving up in the maturity curve from single projects to a central nervous system of real-time data. What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What are your strategy and timeline?
Data engineering is the practice of managing large amounts of data efficiently, from storing and processing to analyzing and visualizing. Therefore, data engineers must be well-versed in data structures and algorithms that can help them manage and manipulate data efficiently. This article will explore some of the most important data structures and algorithms that data engineers should be familiar with, including their uses and advantages. Data Structures Relational Databases Relational databases are one of the most common data structures used by data engineers. A relational database consists of a set of tables with defined relationships between them. These tables are used to store structured data, such as customer information, sales data, and product inventory. Relational databases are typically used in transactional systems like e-commerce platforms or banking applications. They are highly scalable, provide data consistency and reliability, and support complex queries. NoSQL Databases NoSQL databases are a type of non-relational database used to store and manage unstructured or semi-structured data. Unlike relational databases, NoSQL databases do not use tables or relationships. Instead, they store data using documents, graphs, or key-value pairs. NoSQL databases are highly scalable and flexible, making them ideal for handling large volumes of unstructured data, such as social media feeds, sensor data, or log files. They are also highly resilient to failures, provide high performance, and are easy to maintain. Data Warehouses Data warehouses are specialized databases designed for storing and processing large amounts of data from multiple sources. Data warehouses are typically used for data analytics and reporting and can help streamline and optimize data processing workflows. Data warehouses are highly scalable, support complex queries, and perform well. They are also highly reliable and support data consolidation and normalization. Distributed File Systems Distributed file systems such as Hadoop Distributed File System (HDFS) are used to store and manage large volumes of data across multiple machines. In addition, these highly scalable file systems provide fault tolerance and support batch processing. Distributed file systems are used to store and process large volumes of unstructured data, such as log files or sensor data. They are also highly resilient to failures and support parallel processing, making them ideal for big data processing. Message Queues Message queues are used to manage the data flow between different components of a data processing pipeline. They help to decouple different parts of the system, improve scalability and fault tolerance, and support asynchronous communication. Message queues are used to implement distributed systems, such as microservices or event-driven architectures. They are highly scalable, support high throughput, and provide resilience to system failures. Algorithms Sorting Algorithms Sorting algorithms are used to arrange data in a specific order. Sorting is an essential operation in data engineering as it can significantly improve the performance of various operations such as search, merge, and join. Sorting algorithms can be classified into two categories: comparison-based sorting algorithms and non-comparison-based sorting algorithms. Comparison-based sorting algorithms such as bubble sort, insertion sort, quicksort, and mergesort compare elements in the data to determine the order. These algorithms have a time complexity of O(n log n) in the average case and O(n^2) in the worst case. Non-comparison-based sorting algorithms such as counting sort, radix sort, and bucket sort do not compare elements to determine the order. As a result, these algorithms have a time complexity of O(n) in the average case and worst case. Sorting algorithms are used in various data engineerings tasks, such as data preprocessing, data cleaning, and data analysis. Searching Algorithms Searching algorithms are used to find specific elements in a dataset. Searching algorithms are essential in data engineering as they enable efficient retrieval of data from large datasets. Searching algorithms can be classified into two categories: linear search and binary search. Linear search is a simple algorithm that checks each element in a dataset until the target element is found. Linear search has a time complexity of O(n) in the worst case. Binary search is a more efficient algorithm that works on sorted datasets. Binary search divides the dataset in half at each step and compares the middle element to the target element. Binary search has a time complexity of O(log n) in the worst case. Searching algorithms are used in various data engineerings tasks such as data retrieval, data querying, and data analysis. Hashing Algorithms Hashing algorithms are used to map data of arbitrary size to fixed-size values. Hashing algorithms are essential in data engineering as they enable efficient data storage and retrieval. Hashing algorithms can be classified into two categories: cryptographic hashing and non-cryptographic hashing. Cryptographic hashing algorithms such as SHA-256 and MD5 are used for secure data storage and transmission. These algorithms produce a fixed-size hash value that is unique to the input data. Therefore, the hash value cannot be reversed to obtain the original input data. Non-cryptographic hashing algorithms such as MurmurHash and CityHash are used for efficient data storage and retrieval. These algorithms produce a fixed-size hash value that is based on the input data. The hash value can be used to quickly search for the input data in a large dataset. Hashing algorithms are used in various data engineerings tasks such as data storage, data retrieval, and data analysis. Graph Algorithms Graph algorithms are used to analyze data that can be represented as a graph. Graphs are used to represent relationships between data elements such as social networks, web pages, and molecules. Graph algorithms can be classified into two categories: traversal algorithms and pathfinding algorithms. Traversal algorithms such as breadth-first search (BFS) and depth-first search (DFS) are used to visit all the nodes in a graph. Traversal algorithms can be used to find connected components, detect cycles, and perform topological sorting. Pathfinding algorithms such as Dijkstra's algorithm and A* algorithm are used to find the shortest path between two nodes in a graph. For example, pathfinding algorithms can be used to find the shortest path in a road network, find the optimal route for a delivery truck, and find the most efficient path for a robot. Data structures and algorithms are essential tools for data engineers, enabling them to build scalable, efficient, and optimized solutions for managing and processing large datasets.
A data warehouse was defined by Bill Inmon as "a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management's decisions" over 30 years ago. However, the initial data warehouses were unable to store massive heterogeneous data, hence the creation of data lakes. In modern times, data lakehouse emerges as a new paradigm. It is an open data management architecture featured by strong data analytics and governance capabilities, high flexibility, and open storage. If I could only use one word to describe the next-gen data lakehouse, it would be unification: Unified data storage to avoid the trouble and risks brought by redundant storage and cross-system ETL. Unified governance of both data and metadata with support for ACID, Schema Evolution, and Snapshot. Unified data application that supports data access via a single interface for multiple engines and workloads. Let's look into the architecture of a data lakehouse. We will find that it is not only supported by table formats such as Apache Iceberg, Apache Hudi, and Delta Lake, but more importantly, it is powered by a high-performance query engine to extract value from data. Users are looking for a query engine that allows quick and smooth access to the most popular data sources. What they don't want is for their data to be locked in a certain database and rendered unavailable for other engines or to spend extra time and computing costs on data transfer and format conversion. To turn these visions into reality, a data query engine needs to figure out the following questions: How to access more data sources and acquire metadata more easily? How to improve query performance on data coming from various sources? How to enable more flexible resource scheduling and workload management? Apache Doris provides a possible answer to these questions. It is a real-time OLAP database that aspires to build itself into a unified data analysis gateway. This means it needs to be easily connected to various RDBMS, data warehouses, and data lake engines (such as Hive, Iceberg, Hudi, Delta Lake, and Flink Table Store) and allow for quick data writing from and queries on these heterogeneous data sources. The rest of this article is an in-depth explanation of Apache Doris' techniques in the above three aspects: metadata acquisition, query performance optimization, and resource scheduling. Metadata Acquisition and Data Access Apache Doris 1.2.2 supports a wide variety of data lake formats and data access from various external data sources. Besides, via the Table Value Function, users can analyze files in object storage or HDFS directly. To support multiple data sources, Apache Doris puts efforts into metadata acquisition and data access. Metadata Acquisition Metadata consists of information about the databases, tables, partitions, indexes, and files from the data source. Thus, metadata of various data sources come in different formats and patterns, adding to the difficulty of metadata connection. An ideal metadata acquisition service should include the following: A metadata structure that can accommodate heterogeneous metadata. An extensible metadata connection framework that enables quick and low-cost data connection. Reliable and efficient metadata access that supports real-time metadata capture. Custom authentication services to interface with external privilege management systems and thus reduce migration costs. Metadata Structure Older versions of Doris support a two-tiered metadata structure: database and table. As a result, users need to create mappings for external databases and tables one by one, which is heavy work. Thus, Apache Doris 1.2.0 introduced the Multi-Catalog functionality. With this, you can map to external data at the catalog level, which means: You can map to the whole external data source and ingest all metadata from it. You can manage the properties of the specified data source at the catalog level, such as connection, privileges, and data ingestion details, and easily handle multiple data sources. Data in Doris falls into two types of catalogs: Internal Catalog: Existing Doris databases and tables all belong to the Internal Catalog. External Catalog: This is used to interface with external data sources. For example, HMS External Catalog can be connected to a cluster managed by Hive Metastore, and Iceberg External Catalog can be connected to an Iceberg cluster. You can use the SWITCH statement to switch catalogs. You can also conduct federated queries using fully qualified names. For example: SELECT * FROM hive.db1.tbl1 a JOIN iceberg.db2.tbl2 b ON a.k1 = b.k1; Extensible Metadata Connection Framework The introduction of the catalog level also enables users to add new data sources simply by using the CREATE CATALOG statement: CREATE CATALOG hive PROPERTIES ( 'type'='hms', 'hive.metastore.uris' = 'thrift://172.21.0.1:7004', ); In data lake scenarios, Apache Doris currently supports the following metadata services: Hive Metastore-compatible metadata services Alibaba Cloud Data Lake Formation AWS Glue This also paves the way for developers who want to connect to more data sources via External Catalog. All they need is to implement the access interface. Efficient Metadata Access Access to external data sources is often hindered by network conditions and data resources. This requires extra efforts of a data query engine to guarantee reliability, stability, and real-timeliness in metadata access. Doris enables high efficiency in metadata access by Meta Cache, which includes Schema Cache, Partition Cache, and File Cache. This means that Doris can respond to metadata queries on thousands of tables in milliseconds. In addition, Doris supports manual refresh of metadata at the Catalog/Database/Table level. Meanwhile, it enables auto synchronization of metadata in Hive Metastore by monitoring Hive Metastore Event, so any changes can be updated within seconds. Custom Authorization External data sources usually come with their own privilege management services. Many companies use one single tool (such as Apache Ranger) to provide authorization for their multiple data systems. Doris supports a custom authorization plugin, which can be connected to the user's own privilege management system via the Doris Access Controller interface. As a user, you only need to specify the authorization plugin for a newly created catalog, and then you can readily perform authorization, audit, and data encryption on external data in Doris. Data Access Doris supports data access to external storage systems, including HDFS and S3-compatible object storage: Query Performance Optimization After clearing the way for external data access, the next step for a query engine would be to accelerate data queries. In the case of Apache Doris, efforts are made in data reading, execution engine, and optimizer. Data Reading Reading data on remote storage systems is often bottlenecked by access latency, concurrency, and I/O bandwidth, so reducing reading frequency will be a better choice. Native File Format Reader Improving data reading efficiency entails optimizing the reading of Parquet files and ORC files, which are the most commonly seen data files. Doris has refactored its File Reader, which is fine-tuned for each data format. Take the Native Parquet Reader as an example: Reduce format conversion: It can directly convert files to the Doris storage format or to a format of higher performance using dictionary encoding. Smart indexing of finer granularity: It supports Page Index for Parquet files, so it can utilize Page-level smart indexing to filter Pages. Predicate pushdown and late materialization: It first reads columns with filters first and then reads the other columns of the filtered rows. This remarkably reduces file read volume since it avoids reading irrelevant data. Lower read frequency: Building on the high throughput and low concurrency of remote storage, it combines multiple data reads into one in order to improve overall data reading efficiency. File Cache Doris caches files from remote storage in local high-performance disks as a way to reduce overhead and increase performance in data reading. In addition, it has developed two new features that make queries on remote files as quick as those on local files: Block cache: Doris supports the block cache of remote files and can automatically adjust the block size from 4KB to 4MB based on the read request. The block cache method reduces read/write amplification and read latency in cold caches. Consistent hashing for caching: Doris applies consistent hashing to manage cache locations and schedule data scanning. By doing so, it prevents cache failures brought about by the online and offlining of nodes. It can also increase cache hit rate and query service stability. Execution Engine Developers surely don't want to rebuild all the general features for every new data source. Instead, they hope to reuse the vectorized execution engine and all operators in Doris in the data lakehouse scenario. Thus, Doris has refactored the scan nodes: Layer the logic: All data queries in Doris, including those on internal tables, use the same operators, such as Join, Sort, and Agg. The only difference between queries on internal and external data lies in data access. In Doris, anything above the scan nodes follows the same query logic, while below the scan nodes, the implementation classes will take care of access to different data sources. Use a general framework for scan operators: Even for the scan nodes, different data sources have a lot in common, such as task splitting logic, scheduling of sub-tasks and I/O, predicate pushdown, and Runtime Filter. Therefore, Doris uses interfaces to handle them. Then, it implements a unified scheduling logic for all sub-tasks. The scheduler is in charge of all scanning tasks in the node. With global information of the node in hand, the schedular is able to do fine-grained management. Such a general framework makes it easy to connect a new data source to Doris, which will only take a week of work for one developer. Query OptimizerDoris supports a range of statistical information from various data sources, including Hive Metastore, Iceberg Metafile, and Hudi MetaTable. It has also refined its cost model inference based on the characteristics of different data sources to enhance its query planning capability. PerformanceWe tested Doris and Presto/Trino on HDFS in flat table scenarios (ClickBench) and multi-table scenarios (TPC-H). Here are the results: As is shown, with the same computing resources and on the same dataset, Apache Doris takes much less time to respond to SQL queries in both scenarios, delivering a 3~10 times higher performance than Presto/Trino. Workload Management and Elastic Computing Querying external data sources requires no internal storage of Doris. This makes elastic stateless computing nodes possible. Apache Doris 2.0 is going to implement Elastic Compute Node, which is dedicated to supporting query workloads of external data sources. Stateless computing nodes are open for quick scaling so users can easily cope with query workloads during peaks and valleys and strike a balance between performance and cost. In addition, Doris has optimized itself for Kubernetes cluster management and node scheduling. Now Master nodes can automatically manage the onlining and offlining of Elastic Compute Nodes, so users can govern their cluster workloads in cloud-native and hybrid cloud scenarios without difficulty. Use Case Apache Doris has been adopted by a financial institution for risk management. The user's high demands for data timeliness makes their data mart built on Greenplum and CDH, which could only process data from one day ago, no longer a great fit. In 2022, they incorporated Apache Doris in their data production and application pipeline, which allowed them to perform federated queries across Elasticsearch, Greenplum, and Hive. A few highlights from the user's feedback include: Doris allows them to create one Hive Catalog that maps to tens of thousands of external Hive tables and conducts fast queries on them. Doris makes it possible to perform real-time federated queries using Elasticsearch Catalog and achieve a response time of mere milliseconds. Doris enables the decoupling of daily batch processing and statistical analysis, bringing less resource consumption and higher system stability.
Recently, I posted a tutorial on how I monitored my Raspberry-pi based "pihole" server using New Relic and Flex. Like many tutorials, what you read, there is the end result, a narrative of the perfect execution of a well-conceived idea where all steps and variables are foreseen beforehand. This, my friends, is not how a day in IT typically goes. The truth is that I had a lot of false starts and a couple of sleepless nights trying to get everything put together just the way I wanted it. Aspects of both the pihole itself and New Relic didn't work the way I initially expected, and I had to find workarounds. In the end, if it weren't for the help of several colleagues - including Zameer Fouzan, Kav Pather, Haihong Ren, Before you call for your fainting couch and smelling salts, shocked as I know you are to hear me imply that New Relic isn't pure perfection and elegant execution, I want to be clear: While no tool is perfect or does all things for all people, the issues I ran into were completely normal and the ultimate solutions were both simple to understand and easy to execute. What I initially struggled with was trying to make New Relic operate based on my biases of how I thought things ought to work rather than understanding and accepting how they did work. The Problem in a Nutshell But I'm being unnecessarily vague. Let me get to the specifics: On the pihole, you can query the API for data like this:http://pi.hole/admin/api.php?summary And it will give you output like this: JSON { "domains_being_blocked": "177,888", "dns_queries_today": "41,240", "ads_blocked_today": "2,802", "ads_percentage_today": "6.8", "unique_domains": "8,001", "queries_forwarded": "18,912", "queries_cached": "19,266", "clients_ever_seen": "34", "unique_clients": "28", "dns_queries_all_types": "41,240", "reply_UNKNOWN": "258", "reply_NODATA": "1,155", "reply_NXDOMAIN": "11,989", "reply_CNAME": "12,296", "reply_IP": "15,436", "reply_DOMAIN": "48", "reply_RRNAME": "0", "reply_SERVFAIL": "2", "reply_REFUSED": "0", "reply_NOTIMP": "0", "reply_OTHER": "0", "reply_DNSSEC": "0", "reply_NONE": "0", "reply_BLOB": "56", "dns_queries_all_replies": "41,240", "privacy_level": "0", "status": "enabled", "gravity_last_updated": { "file_exists": true, "absolute": 1676309149, "relative": { "days": 4, "hours": 0, "minutes": 27 } } } While it may not be immediately obvious, let me draw your attention to the issue:"domains_being_blocked":"177,888" Being surrounded by quotes, New Relic will treat the number "177,888" as a string (text) rather than as a number. NRQL to the Rescue! My first attempt to fix this leveraged the obvious (and ultimately incomplete) approach of changing the type of input using a function. In this case, numeric() is purpose-built to do just that — take data that are "typed" as a string and treat it as a number. Easy-peasy right? If you've worked in IT for more than 15 minutes, you know the answer is, "of course not." This technique only worked for numbers less than 1,000 The reason for this is that numeric() can't handle formatted numbers — meaning strings with symbols for currency, percentage, or — to my chagrin — commas. At that point, my colleague and fellow DevRel Advocate Zameer Fouzan came to the rescue. He helped me leverage one of the newer capabilities in NRQL — the ability to parse out sub-elements in a table. The feature is named aparse() which stands for "anchor parse" You can find more information about it here, but in brief, it lets you name a field, describe how you want to separate it, and then rename the separated parts. Like this: aparse(unique_domains ,'*,*' ) As (n1,n2) In plain English, this says, "take the data in the unique_domains field, put everything before the comma into one variable (called n1), and everything after the comma into another variable (called n2)." Now I have the two halves of my number, and I can recombine them: numeric(concat(n1,n2)) The result looks like this: Which might have been the end of my problems, except if the numbers go into the millions. A More FLEX-able Approach The penultimate step for resolving this issue was to take it back to the source — in this case, the New Relic Flex integration, to see if I couldn't reformat the numbers before sending them into New Relic. Which is absolutely possible. Within the Flex YAML file, there are a lot of possibilities for parsing, re-arranging, and reformatting the data prior to passing it into the data store. One of the most powerful of these is jq. You can find the New Relic documentation on jq here. But for a deeper dive into the utility itself, you should go to the source. I can't describe how jq works any better than the author: "A jq program is a "filter": it takes an input, and produces an output. There are a lot of builtin filters for extracting a particular field of an object, or converting a number to a string, or various other standard tasks." To be a little more specific, jq will take JSON input, and for every key that matches your search parameters, it will output the value of that key and reformat it in the process if you tell it to. Therefore, I could create the most basic search filter like this:jq > .domains_being_blocked ...and it would output "177,888". But that wouldn't solve my issue. HOWEVER, using additional filters, you can split the output by its comma, join the two parts back together, set the output as a number, and come out the other side with a beautiful (and correct) output set. But I don't want you to think this solution occurred to me all on its own or that I was able to slap it all together with minimal effort. This was all as new to me as it may be to you, and what you're reading below comes from the amazing and generous minds of Kav Pather (who basically invented the Flex integration) and Senior Solutions Architect Haihong Ren. Unwinding the fullness of the jq string below is far beyond the scope of this blog. But Kav and Haihong have helped me to understand enough to summarize it as: Pull out the "status:" key and the entire "gravity_last_updated" section and keep it as-is. For everything else, split the value on the comma put the component parts back together (without the comma) output it as a number rather than a string Finally, output everything (status, gravity_last_updated, and all the other values) as a single data block which Flex will pack up and send to New Relic. The full YAML file looks like this:: YAML integrations: - name: nri-flex config: name: pihole_test apis: - name: pihole_test url: http://pi.hole/admin/api.php?summary&auth=a11049ddbf38fc1b678f4c4b17b87999a35a1d56617a9e2dcc36f1cc176ab7ce jq: > .[]|with_entries( select(.key | test("^gravity_last_updated|status|api"))) as $xx | with_entries( select(.key | test("^gravity_last_updated|status|api")|not)) |to_entries|map({(.key):(.value|split(",")|join("")|tonumber)})|add+$xx headers: accept: application/json remove_keys: - timestamp Which results in data that looks like this: Plain Text "domains_being_blocked": 182113 "dns_queries_today": 41258 "ads_blocked_today": 3152 (and so on) This is a wonderfully effective solution to the entire issue! A Summary, With a Plot Twist To sum everything up: We learned how to convert strings to numbers in NRQL using numeric(). We learned how to split strings in NRQL based on a delimiter (or even text) using aparse(). We learned how to put the split parts back together in NRQL using concat(). We learned how to use jq in Flex to perform fairly complex string and data manipulations. But most importantly, we learned that asking colleagues for help isn't a sign of weakness. It's a sign of maturity and that, as author and speaker Ken Blanchard said, "None of us is as smart as all of us." But... after all of my searching, all of my queries to coworkers, all of my testing and troubleshooting, and after all the associated tears, rage, and frustration - I discovered I didn't need any of it. Remember what I said at the start of this blog? What I initially struggled with was trying to make New Relic operate based on my biases of how I thought things ought to work, rather than understanding and accepting how they did work. In my race to force New Relic to do all the heavy lifting, in my egocentric need to force New Relic to accept any type of data I threw at it, I ignored the simplest and most elegant solution of all: Pihole has the option to output raw, unformatted data. In place of the standard URL:http://pi.hole/admin/api.php?summary If, instead, I had used:http://pi.hole/admin/api.php?summaryRaw ...all of the numbers come out in a way that New Relic can take and use without any additional manipulation. Plain Text { "domains_being_blocked": 182113, "dns_queries_today": 42825, "ads_blocked_today": 1846, "ads_percentage_today": 4.310566, "unique_domains": 9228, "queries_forwarded": 25224, "queries_cached": 15547, "clients_ever_seen": 36, (and so on) This simply goes to show that the solutions to our problems are out there as long as we have the patience and perseverance to find them.
Data Science is a rapidly developing discipline that has the power to completely change how one conducts business and addresses issues. In order to apply the most efficient techniques and tools available, it is crucial for data scientists to stay current with the most recent trends and technology. In this article, you will discover ways to keep up with the most recent data science trends and technologies. You will learn about the latest industry trends and make sure that you are keeping pace with the advancements in the field. By the end of this article, you will have the knowledge and resources to stay current in the world of data science. Latest Trends in Data Science Data Science is rapidly advancing, and the latest trends continue to bring together the worlds of data and technology. Artificial intelligence, machine learning, and deep learning are just some of the new-generation tools taking the industry by storm. With the ability to quickly gain insights from massive amounts of data, these innovative techniques are changing the game, giving organizations valuable new ways to manage their data and get ahead of the competition. 1. Automated Machine Learning Automated machine learning (AutoML) is an emerging field of data science that uses algorithms to automate the process of building and optimizing data models. It uses a combination of feature engineering, model selection, hyperparameter optimization, and model ensembling to obtain the best possible performance from a machine-learning model. Automated machine learning offers the potential for data scientists to streamline their workflows, reduce time to market, and increase model performance. 2. Blockchain Technology Blockchain technology has become a hot topic in data science circles lately. This technology allows data to be stored securely in a distributed and immutable ledger. It can support complex multi-party transactions and also potentially add a layer of data security by ensuring that data can only be accessed by authorized users. Blockchain technology and its application is still in their early stages, but it holds promise for data science applications and could become an important tool for securing large datasets. 3. Immersive Experiences Immersive data science experiences, such as augmented reality and virtual reality, offer a new way for data scientists to interact with their data. By allowing users to navigate datasets in a 3D environment, immersive experiences can open up new ways of understanding complex data and uncovering insights. These experiences can also be used to create interactive data visualizations that better convey the importance of data science. 4. Robotic Process Automation Robotic process automation (RPA) is a form of automation that uses software robots to automate mundane, repetitive tasks. In the data science field, this technology can be used to automate data collection, cleansing, and preparation tasks, helping data scientists save time and focus on more advanced analytics. RPA can also be used to improve the accuracy of data collection, as it will reduce the potential for human error. 5. AI-Powered Virtual Assistants AI-powered virtual assistants are becoming increasingly popular in data science circles. These virtual assistants use natural language processing and machine learning algorithms to understand complex conversations and respond in appropriate ways. They can be used to automate data analysis tasks and help data scientists spend less time on mundane tasks and more time on higher-value activities. 6. Natural Language Processing Natural language processing (NLP) is a sub-domain of artificial intelligence that focuses on enabling computers to understand and generate human speech. This technology is becoming increasingly important in data science as it allows machines to understand natural language queries better, making it easier for data scientists to ask complex questions. NLP can also be used to automatically generate questions based on the data and provide more detailed insight. 7. Graph Analytics Graph analytics is a branch of data science that uses graph theory to analyze interconnected data sets. It can be used to uncover relationships and patterns existing in the data sets, as well as to analyze the structure of networks and make data-driven decisions. Graph analytics is often used with other analytical techniques like machine learning and predictive analytics to get a more comprehensive view of the data sets. 8. Artificial Intelligence Artificial Intelligence (AI) is a broad field of data science that focuses on machines that are programmed to behave in a similar manner to humans. AI and data science systems can perform tasks more efficiently than humans, making them invaluable in various industries and applications. AI is used in tasks such as facial recognition, speech recognition, and driverless cars, as well as in data analysis, natural language processing, and robotics. 9. Image Processing Image processing is a branch of data science that deals with the analysis of digital images and videos. It is used for various purposes, such as detecting objects and recognizing faces and activities. Image processing techniques can be used to analyze large sets of image data and to extract information from digital images. 10. Text Mining Text mining is a field of data science that focuses on the analysis of text data. It is used to uncover patterns in text data and gain insights from it. Text mining techniques can be applied to large volumes of text data, such as from social media, web pages, and news articles. 11. Internet of Things The Internet of Things (IoT) is a term that describes a network of internet-connected devices that can share data with each other. These devices can collect data from different sources and can be integrated into data science projects. IoT technology can be used to monitor large datasets in real time, allowing data scientists to uncover insights quickly. 12. Big Data Big data describes datasets that are too complex and large to be managed by traditional databases. Big data sets pose a challenge to data scientists, as they require new tools and techniques to be able to process and analyze them. Fortunately, new technologies such as Apache Hadoop and Spark have made it easier to manage and analyze large datasets, making the process more efficient and increasing the potential for data scientists to uncover valuable insights. 13. Data Visualization Data visualization is one of the most important tools used in data science, both to explore and analyze data, as well as to communicate results to others. Data visualization tools are becoming increasingly powerful and user-friendly, allowing users to quickly and easily visualize and communicate even the most complex datasets. Popular visualization tools such as Tableau, Qlik, and Power BI are allowing data scientists to quickly and easily create interactive visualizations that can be easily shared and understood by stakeholders and colleagues. Additionally, tools such as Matplotlib, Seaborn, and Bokeh allow data scientists to create and customize more sophisticated visualizations for their own analysis and exploration. 14. Cloud Computing Cloud computing is an increasingly popular tool for data science. It provides data scientists with quick and easy access to the computing power and storage capacity needed to run large data analysis projects and the ability to share results with colleagues and stakeholders quickly and easily. By leveraging cloud computing, data scientists are able to access large datasets and distributions of computing resources that would otherwise be unavailable or too costly to access. 15. Predictive Analytics Predictive analytics is a data science method that uses data-driven algorithms to make predictions about the future. This technology can be used to anticipate customer behavior, detect patterns, and identify trends. By using predictive analytics, data scientists can gain valuable insights into the future, allowing them to make better-informed decisions. 16. Augmented Analytics Augmented analytics is a field of data science that uses machine learning and natural language processing to automate and enhance the data analysis process. Augmented analytics uses advanced analytics techniques, such as natural language queries and automated machine learning, to make data analysis easier and more efficient. It also enables data scientists to gain deeper insights into datasets and make better-informed decisions. These technologies in data science are essential for uncovering previously unknown insights from data and helping drive better decision-making. In addition, integrating these technologies into data analysis can result in more accurate insights and analysis and improve the accuracy of predictions and forecasts. As AI, image processing, text mining, and other technologies continue to expand, the potential applications for data science and analytics become even more exciting. Bottom Line The key to staying ahead in the data science field is to stay informed and knowledgeable of the latest trends and technologies in the field in order to ensure success. From reading industry publications and attending conferences to taking data science courses and using online resources, there are many ways to stay connected to the data science community and the most innovative technologies and trends in the field. With these tools, you can stay ahead of the pack and be the first to know about new opportunities and advancements to hone your skills and maximize your career potential.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Kai Wähner
Technology Evangelist,
Confluent
Gilad David Maayan
CEO,
Agile SEO
Grant Fritchey
Product Advocate,
Red Gate Software