Access Control is a process we use to manage access to data and resources within an organization. We can implement access control in many different ways, such as using passwords, PINs, and biometric authentication, often combining methods via multi-factor authentication (MFA).
We refer to a wide range of data analysis techniques and tools when we use the term Advanced Analytics. These methods go beyond traditional business intelligence (BI) to uncover deeper insights, make predictions, or generate recommendations. As this umbrella term covers a great variety of advanced analytic techniques, here is a non-exhaustive list of examples: extracting valuable information from data and text, using algorithms to learn and make predictions, identifying patterns in data, predicting future outcomes, displaying data in visual forms, creating models to simulate real-world scenarios, and using neural networks to solve complex problems.
Analytics allows us to make better decisions by taking raw data and uncovering its meaningful patterns. Essentially, analytics applies statistics, computer programming, and operations research to quantify and gain insight into the meanings of data. While every sector can benefit from analytics, data-rich sectors find it especially useful.
An API (application programming interface) is a program that defines how different software applications can interact with each other. An API can be thought of as a waiter at a restaurant. Just as you, the customer, would use a menu to place an order with the waiter, a developer would use an API to request specific information or actions from a software program or web-based service. The waiter would then go to the kitchen, gather the requested dishes and return them to you at the table, in the same way, the API retrieves the requested data or performs the requested actions and returns it to the developer.
When we talk about Artificial Intelligence (AI), we refer to a broad field of computer science that gives machines the ability to sense, reason, engage and learn. AI allows machines to do tasks that would typically require human intelligence, including speech recognition, problem-solving, decision-making, planning, and visual perception.
Batch Ingestion allows us to collect and transfer data in chunks or groups. This process usually processes a bulk of data and is triggered or scheduled at the same intervals. We usually opt for this solution when we want to collect specific data points frequently or when we don’t need the data immediately available for real-time analysis.
When we talk about Big Data, we refer to the massive amounts of structured and unstructured data collected from different sources. Given its size and complexity, such data is almost impossible to process using traditional methods. Big Data comprises three characteristics, known as the three Vs:
- Volume: Data doesn’t just come from one source. We might derive it from videos, images, transactions, audio, social media, transactions, smart devices (IoT), or more. Previously, storing such data was resource-intense; however, data lakes, Hadoop, and the cloud have made storing much more accessible.
- Velocity: Gathering real-time data for immediate implementation is a vital challenge to sectors such as traffic control and the global supply chain. Big Data analysis answers this need through innovative processing technologies.
- Variety: Organizations collect data from all types of sources, resulting in various data formats – from structured, numeric data to unstructured text documents, emails, videos, and audio.
BigQuery is a fully managed, cloud-based data warehouse that helps us manage and analyze data via machine learning, geospatial analysis, and business intelligence. Because it allows us to query terabytes in seconds and petabytes in minutes, this solution is often our go-to option to process and analyze massive datasets quickly and easily.
Blockchain is a digital, decentralized ledger that records transactions on multiple computers rather than storing them in a central location. Each block in the chain contains a cryptographic hash of the previous block, as well as an accurate timestamp and transaction data. Consequently, no block can be altered without changing all previous blocks and – more importantly – the network’s consensus.
Business Intelligence (BI) is all about using data to make smarter business decisions. It involves collecting, storing, and analyzing data from various sources to help us better understand businesses, customers, and operations. Thanks to BI tools and techniques, we can turn all that data into valuable insights, identify trends, spot opportunities, and make better informed decisions.
CI/CD (Continuous Integration / Continuous Delivery) pipeline is a series of automated processes that helps us build, test, and deliver software quicker and more reliably. It typically includes steps such as pulling code changes from version control, building and testing the code, and then deploying it to production. By automating these processes, a CI/CD pipeline helps us get new features and updates out the door faster and with fewer mistakes.
When we talk about Cloud Computing, we refer to the use of off-site systems hosted on the cloud instead of on one’s computer or other local storage. Cloud computing can include various services, such as email servers, software programs, data storage, and even additional processing power.
Clustering is a technique in data analysis that involves grouping similar data points into clusters or segments based on their similarities to identify meaningful patterns or relationships within a dataset that may not be immediately apparent.
Code Review describes the early-stage systematic testing of software source code by fellow programmers to find and remove bugs and address vulnerabilities. It’s a powerful way to improve the overall quality of our code, and it should be an ongoing practice within any software development team.
We use a Control Tower to centralize the management of the organization’s resources. It provides us with immediate insight into the location of our resources and helps to streamline operations using dashboards and other visual tools. We find it especially useful in supply chain management because it helps us keep track of critical issues and allows us to resolve them quickly.
CSV (Comma Separated Values)
CSV (.csv) is a file extension that we use when we want to store data in a tabular format, with each row on a new line and each value within a row separated by a comma. It’s a plain text file that’s simple and easy to use, and it can be opened and edited with many different software programs, such as Microsoft Excel, Google Sheets, or even text editors. Python provides designated libraries to work with CSVs.
Customer Data Platform (CDP)
A Customer Data Platform, or CDP, gathers all the data from multiple sources and shapes a set of unique customer profiles. Customer data might come from numerous sources like CRMs, websites, social media, and ERPs. By merging data related to the same customer, the CDP allows a 360-degree understanding of each customer. Moreover, the customer data is stored, enabling us to follow the customer journey over time.
A Dashboard is a user interface that uses visuals to represent data in an organized and easy-to-read way. Well-designed dashboards allow us to keep track of KPIs and data trends for better decision-making. Dashboards might be used to monitor business performance, customer service metrics, financial performance, or any other data we want to keep an eye on.
Data Accuracy refers to whether the stored data values are correct, ensuring that we can use records as a reliable source of information. Data accuracy is essential for informed decision-making and data-driven insights. Poor data accuracy can lead to incorrect conclusions, which in turn can cause costly errors.
Data Architecture describes the design and organization of data assets we use in an organization. A data architecture typically includes several key components, such as:
- data sources,
- data models representing and defining the relationships between different data elements,
- storage infrastructure (for example, databases, data warehouses, and data lakes), and
- data governance frameworks.
Data as a Service (DaaS)
When we talk about Data as a Service (DaaS), we refer to a data management strategy that leverages the cloud to make data available on demand across all departments. DaaS enables data storage, integration, processing, and analytics to take place over the network and, as a result, benefits us with speed, scalability, and flexibility.
A Data Breach happens when unauthorized individuals or groups access, steal or use sensitive information. This can occur in multiple ways, including hacking into computer systems, stealing physical data storage devices, or unintentionally exposing sensitive information.
We use a Data Catalog as an organization’s data assets inventory to help technical and non-technical users locate relevant information. Data Catalogs can help facilitate better decision-making by providing a single source of truth for organizational data. Not only do Data Catalogs provide an organized inventory of all the data assets, but they can also offer insights into how data is being used and which pieces of it are most valuable to different teams and stakeholders.
Data Cleaning is a process that aims to fix or remove incorrect and/or incomplete data from a dataset to maintain high levels of data quality. Without the cleaning processes, data collection and combination often result in duplicated, mislabeled, or even corrupted records.
When we talk about Data Collection, we refer to the process of gathering and measuring data in an organized manner, enabling us to elicit insights into phenomena and forecast future trends. Regardless of the field, accurate data collection is essential for reliable results. For example, in medicine, data collection can help us uncover correlations between diseases and their causes. In marketing, data collection can provide valuable insights into consumer behavior.
A Data Consumer role refers to a person or a system that uses or processes data in some way, including analyzing data to make decisions, displaying data to users, or using data to train a machine learning model.
When we talk about Data Democratization, we refer to a powerful concept that aims to make digital information accessible to everyone – including individuals without a technical background. Data Democratization is about giving people access to data, insights, and tools without relying on a third party like system administrators, data stewards, or IT departments.
Being one of many data masking techniques, Data Encryption is a way to protect your sensitive information by scrambling it into a secret code that only authorized people can decipher using a decryption key or password.
Data Frame describes a structure in which we can store and organize data in rows and columns, similar to a spreadsheet. A data frame is handy for analyzing and manipulating data because it allows you to work with many different variables simultaneously while keeping everything organized and easy to understand.
Data Governance is an essential part of managing data quality, security, and privacy. It describes the system we use to manage information, as well as the actions taken with that information, including the availability, usability, integrity, and security of data in an organization.
Data Ingestion is the first step in collecting data from various sources. At this stage, we can understand the size and complexity of the data, influencing how we use the data later on in terms of access, use, and analysis. The goal of data ingestion is to collect data into any system requiring a particular structure or format for downstream data use.
Data Integration is the process of consolidating data from multiple sources into a single location to achieve a unified view of data for improved decision-making. Based on their data pipelines needs, companies usually use one of two data integration approaches: ETL (extract, transform, load) or ELT (extract, load, transform).
A Data Lake is a storage repository designed to store both structured and unstructured data in any form. This means we can use the data in raw and unprocessed forms. How does a data lake benefit us? Simply put, storing data from any source, at any size, speed, and structure, makes more robust and diverse queries, data science use cases, and new information discoveries possible. Unlike storing data in individual databases, using a data lake is cheaper, more flexible, scalable, easy to use, and provides superior data quality.
Data Lakehouse is a storage architecture that combines the cost-effectiveness of a data lake with a data warehouse’s analytic and structural benefits. A data lakehouse enables us to use the same large data sets for different types of machine learning and business intelligence workloads.
Data Lineage visually describes data flow over time, detailing its origin, iterations, and destination. The process allows us to track day-to-day use and error resolution operations while maintaining the accuracy and reliability of the data and compliance with regulatory requirements.
We use Data Mapping to match data from one data model to another by drawing connections and relationships between them. By doing so, we ensure our data is accurate and standardized across the organization.
Unlike a central data warehouse, a Data Mart is a smaller storage focused on specific data used by a particular organization’s business unit, such as the Finance or Marketing department. Since data marts store only the data specific to a single group, they require less storage space, thus, are often faster and more easily accessible.
Data Masking is disguising sensitive data to enable us to use data accurately without exposing private information. There are many common data masking techniques, including:
● Nulling – Returning data values as blank or replaced with placeholders.
● Anagramming – Shuffling characters or digit order for each entry.
● Substitution – Replacing each value with a randomly selected value.
● Encryption – Translating sensitive exported data into a cipher that requires a password or key.
Data Mesh refers to a decentralized data architecture and operating model focused on bringing data closer to the teams where it is generated and treating data as a product.
There are four core principles of data mesh:
- Domain-oriented data ownership and architecture: The data mesh creates a communication structure between data owners, preventing data siloing.
- Data as a product: The change in perspective impacts the way we collect, serve, and manage data and, thus, boosts data quality and user satisfaction.
- Self-service data platform: The domain should have the appropriate infrastructure to support data democratization and empower domain teams.
- Federated computational governance: The data mesh is an ecosystem that enables interoperability across different data sources.
Data Migration is all about moving data from one system to another, typically between applications, formats, existing databases, or storage systems. In general, Data Migration involves tasks such as preparing, extracting, transforming data from source systems, and loading data to the target system, as well as testing for quality and validating the outcomes.
When we talk about Data Mining, we refer to extracting information from massive data sets to identify patterns and trends for use. It typically involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Through data mining, we can assess valuable information necessary to mitigate risk effectively, anticipate demands, monitor operational performance, and acquire new customers.
A Data Model is a framework for organizing and representing data, helping us to understand the relationships between different pieces of data and how they relate to real-world concepts. We can think of it as a blueprint for our data – just as a blueprint helps us to construct a building, a data model helps us to construct a database or other data storage systems. There are many different types of data models, but some of the most common include relational models, hierarchical models, and object-oriented models.
When we talk about Data Modeling, we refer to creating a conceptual representation of data and its relationships to other data. Essentially, it’s a way to map out how different pieces of data are connected and how they relate to each other. This is typically done using diagrams or other visual representations that allow us to easily understand the relationships between different data entities.
Data Monitoring refers to oversight mechanisms helping us to ensure that we use accurate, valid, and consistent data while maintaining security. It typically involves a manual or automated reporting process, with the ability to notify administrators of important events.
Data orchestration, a relatively new discipline in computer engineering, aims to match the right data with the right purpose. It does so by automating processes related to managing data, including collecting, combining, and preparing data for analysis.
With a Data Pipeline, we can move data from one place to another, typically through some kind of data transformation, including filtering, masking, and aggregations, to ensure both data integration and standardization.
We use a Data Platform as a central repository that combines and utilizes the features and capabilities of several big data applications. It supports us in the acquisition, storage, preparation, delivery, and the governance of our data while also ensuring high levels of security.
Data Platform as a Service (DPaaS)
Data Platform as a Service (DPaaS) enables companies to collect, manage, monitor, analyze and present data via a centralized platform. When using a DPaaS, we ensure strict governance, privacy, and security features for data protection and integrity.
A Data Producer is a root source of any data, any entity that collects, stores, and provides data as a result of its activities. It might be a device, service, software, or organization.
Data Quality tells us about the condition of a particular set of data, including its completeness, accuracy, consistency, timeliness, validity, and uniqueness. Data quality activities can involve data integration, cleaning, rationalization, and validation.
Data Security involves procedures and specific controls to protect data from accidental data exposure, unauthorized access, or data loss. We apply various techniques to mitigate security threats, including data encryption, data erasure, data resilience, or data masking.
When we talk about Data Silos, we refer to individual data repositories held by one group and isolated from the rest of the organization, thus, remaining inaccessible to the others. Data silos can exist within a single department, branch, or company or even be shared between multiple organizations.
When we talk about a Data Source, we refer to the original location from which a piece or a set of data comes. It can be a database, a data warehouse, a flat file, an XML file, or any other readable format.
Data Stewardship is a set of people, processes, and tools that ensure the accuracy, reliability, security, and proper management of data. Data Stewardship involves:
- Defining roles and responsibilities for data management,
- Establishing a set of procedures and policies,
- Implementing appropriate security controls,
- Monitoring data usage.
Data Storytelling is all about creating a compelling narrative from data analysis results, aiming to translate complex ideas into actionable insights tailored to a specific audience. Among the most critical aspects of data storytelling are narrative and visualizations.
We perform Data Validation to check and confirm that data is accurate, complete, and meets certain requirements or standards. Data Validation includes a range of activities and techniques, such as:
- Data profiling: Analyze the data to understand its structure, format, and quality.
- Data cleaning: Identify and correct errors or inconsistencies in the data.
- Data verification: Check that the data is correct and complete.
- Data testing: Test the data against specific requirements or standards.
- Data documentation: Document the validation process and results to ensure transparency and repeatability.
We use Data Visualization to represent information graphically, highlight patterns and trends and transform complex data into a more accessible and understandable form, such as a chart or graph.
We use a Data Warehouse as a central repository for storing data from multiple sources within an organization, allowing us to make better decisions based on a single source of truth. Unlike data lakes that contain a vast amount of raw data, data warehouses store structured, processed data ready for strategic data analysis.
Database Management System
A Database Management System (DBMS) is a software system that serves as an interface for interacting with a database. It allows end-users to access their databases to organize and access the information as needed, thus ensuring data security and integrity.
When we talk about ELT (Extract, Load, Transform), we refer to a three-step process of moving data from one place to another and preparing it for analysis. Here’s how it works:
- Extract: we get data out of its current location, whether that’s a database, a file, or somewhere else.
- Load: we move the data to its final destination, which is typically a data warehouse or another type of database optimized for storing and analyzing large amounts of data.
- Transform: change the data into a format that’s easier to work with. This could involve tasks like sorting, cleaning, and aggregating the data.
First-Party Data describes customer data collected directly through customer interactions, including demographic data, purchase history, and preferences. This data comes from the organization’s own sources, such as Google Analytics, social media, mobile apps, or CRM systems. First-party data must comply with GDPR and CCPA regulations, as the entity collecting the data is the owner of consent.
A Log File is a file that contains information about the events, processes, messages, and other data arising from an operator’s use of devices, applications, and operating systems. The log file will record any program running, background scripts, and website visits.
Machine Learning (ML) is a type of artificial intelligence that allows computer systems to automatically improve their performance on a specific task, by learning from data, without being explicitly programmed. As we introduce new input data to the trained ML model, it can learn, grow, and develop to result in a new, more effective, and predictive algorithm.
Master Data represents the key data entities critical to an organization’s operations and strategy, such as customers, products, suppliers, employees, and assets, and is typically stored in a centralized system or database. Customer master data, for example, might include information about the customer’s name, address, contact information, and purchase history. Master Data is often mistaken for reference data.
Metadata, commonly described as “data about data,” is structured information that describes, locates, or makes it easier to use or manage an information resource. Typical metadata includes file size, image color/resolution, date, authorship, and keywords.
MLOps, or DevOps for machine learning, combines software development (Dev) and operations (Ops) to streamline the process of building, testing, and deploying machine learning models. In other words, it’s all about making sure that we develop machine learning models in a reliable and scalable manner, just like any other software.
Essentially, Model Deployment refers to putting a machine learning model into production to make predictions on new, unseen data. It’s the final step in the machine learning pipeline, where the model is taken from development and integrated into a real-world application so we can use it to deliver valuable insights and make data-driven decisions.
When we talk about Model Retraining, we refer to updating an ML model with new data to improve its performance. We can do it manually or automate the process by applying Continuous Training (CT), a part of the MLOps practices.
Natural Language Processing (NLP)
Natural Language Processing (NLP) falls into the area of artificial intelligence (AI) that enables computers to read, analyze, and respond to human language, just like humans do. In a nutshell, NLP makes it possible for us to communicate with computers in a way that feels natural and human-like. NLP is used in various applications such as voice assistants, language translation, chatbots, sentiment analysis, and many more.
NoSQL, or “not only SQL,” is a database designed to handle large volumes of unstructured or semi-structured data. We mainly classify NoSQL databases into four types: document-based databases, key-value stores, column-oriented databases, and graph-based databases. Unlike traditional relational databases, NoSQL databases allow for more flexible and scalable data storage, enabling organizations to store data in more intuitive ways that are closer to how users access data. Thus, they are widely used in real-time web applications and big data.
Apache Parquet is a free and open-source column-based format used to store big data. As opposed to row-based formats such as CSV and JSON, Apache Parquet organizes files by column, therefore saving storage space and increasing performance.
Power Query is a Microsoft tool used to perform ETL, a three-step process where data is extracted, transformed, and loaded into its final destination. With Power Query, we can import data from various sources and perform data preparation according to our needs.
Predictive Analytics combines historical and current data with advanced statistics and machine learning techniques to forecast possible future outcomes. Many companies use predictive analytics to identify customer behavior patterns, assess risk, reduce downtime, and personalize treatment plans for patients.
Qualitative Analysis refers to analyzing non-numerical data such as words, images, or observations. This method comes in handy when we strive to gain a deeper understanding of a phenomenon, often by examining the context and meaning of the data. One example of qualitative analysis is a case study of a company’s organizational culture, which involves analyzing the data collected from interviews with employees, internal documents, and observation notes.
In contrast to qualitative analysis, Quantitative Analysis is a method of measuring and interpreting numerical data to provide objective and precise measurements and data that we can use to draw conclusions or make predictions. For example, a quantitative survey can collect numerical data on consumer preferences for different types of products or services to identify behavioral trends or patterns.
A Query in SQL is a request for data within the database environment. When we talk about queries in a database, we refer to either a select query or an action query. We use a select query to retrieve data from a database, while an action query helps us perform other operations on data, including adding, removing, or changing data.
When we talk about Raw Data (sometimes called source data, atomic data, or primary data), we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use, thus remaining unactionable. Raw data can come in various forms, such as numbers, text, images, video, or audio.
Real-Time Analytics refers to analyzing and processing data in real-time or near real-time as it is generated or received. So, instead of waiting for a batch of data to build up, real-time analytics processes data as it comes in, allowing for more immediate insights and decision-making.
We generally use Reference Data to classify, categorize, or provide context for other data with the aim of standardizing and harmonizing data across different systems and applications. Different product codes, for instance, can be reference data used by a manufacturing company to introduce a standardized product classification system to simplify comparing sales data across regions. This way, they ensure that data is consistent and accurate across the organization and, thus, improve decision-making and operational efficiency. Reference Data is often mistaken for master data.
Relational Database (RDBMS)
A Relational Database (RDBMS) is a type of database that stores data in tables with columns and rows, similar to a spreadsheet. In an RDBMS, data is stored in tables related to each other based on common attributes or keys. For example, a customer table might be related to an orders table through a customer ID key. This allows data to be organized and accessed in a logical and efficient way.
Simply put, Second-Party Data is someone else’s first-party data. It is gathered by a trusted partner or other business entity and typically comes from various sources, including websites, apps, and social media.
Self-Service Tools are applications or platforms that empower people to access information and perform tasks independently without needing external support or assistance. By equipping users with the power and autonomy to select, filter, compare, visualize, and analyze data, companies can drastically reduce the need for costly and extensive IT training, freeing up resources and improving efficiency.
Semi-Structured Data is a form of structured data that is not stored in a tabular format but contains tags and other elements that make it possible to group and label data. Examples of semi-structured data sources include zip files, emails, XML, and web pages.
When we talk about Structured Data, we refer to quantitive, predefined, and formatted data, typically stored in tabular form in relational databases (RDBMSs). Structured Data usually comes in the form of numbers and values and thus is easier to search, analyze and comprehend.
Structured Query Language (SQL)
Structured Query Language (SQL) is a standardized programming language designed for managing relational databases and carrying out operations on the data, such as storing, searching, removing, and retrieving data, as well as maintaining and optimizing database performance.
In contrast to Unsupervised Learning, Supervised Learning is a way of training an AI algorithm to use labeled examples or input-output pairs. Simply put, in Supervised Learning, the algorithm is explicitly taught what answers are “right” and uses this information to make predictions on new data.
Unlike first- and second-party data, Third-Party Data comes from an outside entity that often sells previously collected customer data to companies for advertising purposes. In this case, organizations possess customer data without having direct relationships with their customers.
In the software development process, Unit Testing allows developers to test individual units or components of code independently from other parts of the system. This way, they ensure the piece of software is fit for use.
When we talk about Unstructured Data or Raw Data, we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use. Unstructured data can come in various forms, such as numbers, text, images, video, or audio.
Unlike Supervised Learning, in Unsupervised Learning, the algorithm is not given any labeled examples and must instead find patterns and relationships independently without being explicitly trained or guided by classified or labeled examples.
User Acceptance Testing (UAT)
UAT stands for User Acceptance Testing, the final round of testing a software product goes through before its release. The main idea of UAT is to test a solution with the help of real users to see if it works as expected and meets their needs. The users receive a set of test cases to follow. As they perform these tasks, they report any issues they encounter to the development team, which then implements the necessary changes in the system.
A VPN, or Virtual Private Network, creates a secure tunnel between a user’s device and the internet by assigning the user a new anonymous IP address, rerouting the internet connection through a server in its network, and encrypting all data. A VPN masks the user’s identity and online traffic from internet service providers, hackers, and third parties. Overall, VPNs are a helpful tool for anyone who wants to enhance their online privacy and security or who needs to access private networks from remote locations.
XML, or Extensible Markup Language, is a markup language that uses tags and other elements to define the structure and meaning of the data it contains. For example, an XML file that stores information about books might use tags such as <title>, <author>, and <publisher> to define different pieces of data associated with each book.
An XML (Extensible Markup Language) Database is a database management system designed to store and manage data in XML format. We use XML databases to store and manage complex and hierarchical data structures, particularly in applications where data needs to be easily queried, modified, and updated.