Data Glossary
Advanced Analytics
We refer to a wide range of data analysis techniques and tools when we use the term Advanced Analytics. These methods go beyond traditional business intelligence (BI) to uncover deeper insights, make predictions, or generate recommendations. As this umbrella term covers a great variety of advanced analytic techniques, here is a non-exhaustive list of examples: extracting valuable information from data and text, using algorithms to learn and make predictions, identifying patterns in data, predicting future outcomes, displaying data in visual forms, creating models to simulate real-world scenarios, and using neural networks to solve complex problems.
Analytics
Analytics allows us to make better decisions by taking raw data and uncovering its meaningful patterns. Essentially, analytics applies statistics, computer programming, and operations research to quantify and gain insight into the meanings of data. While every sector can benefit from analytics, data-rich sectors find it especially useful.
API
An API (application programming interface) is a program that defines how different software applications can interact with each other. An API can be thought of as a waiter at a restaurant. Just as you, the customer, would use a menu to place an order with the waiter, a developer would use an API to request specific information or actions from a software program or web-based service. The waiter would then go to the kitchen, gather the requested dishes and return them to you at the table, in the same way, the API retrieves the requested data or performs the requested actions and returns it to the developer.
Artificial Intelligence
When we talk about Artificial Intelligence (AI), we refer to a broad field of computer science that gives machines the ability to sense, reason, engage and learn. AI allows machines to do tasks that would typically require human intelligence, including speech recognition, problem-solving, decision-making, planning, and visual perception.
Batch Ingestion
Batch Ingestion allows us to collect and transfer data in chunks or groups. This process usually processes a bulk of data and is triggered or scheduled at the same intervals. We usually opt for this solution when we want to collect specific data points frequently or when we don’t need the data immediately available for real-time analysis.
Big Data
When we talk about Big Data, we refer to the massive amounts of structured and unstructured data collected from different sources. Given its size and complexity, such data is almost impossible to process using traditional methods. Big Data comprises three characteristics, known as the three Vs:
- Volume: Data doesn’t just come from one source. We might derive it from videos, images, transactions, audio, social media, transactions, smart devices (IoT), or more. Previously, storing such data was resource-intense; however, data lakes, Hadoop, and the cloud have made storing much more accessible.
- Velocity: Gathering real-time data for immediate implementation is a vital challenge to sectors such as traffic control and the global supply chain. Big Data analysis answers this need through innovative processing technologies.
- Variety: Organizations collect data from all types of sources, resulting in various data formats – from structured, numeric data to unstructured text documents, emails, videos, and audio.
BigQuery
BigQuery is a fully managed, cloud-based data warehouse that helps us manage and analyze data via machine learning, geospatial analysis, and business intelligence. Because it allows us to query terabytes in seconds and petabytes in minutes, this solution is often our go-to option to process and analyze massive datasets quickly and easily.
Blockchain
Blockchain is a digital, decentralized ledger that records transactions on multiple computers rather than storing them in a central location. Each block in the chain contains a cryptographic hash of the previous block, as well as an accurate timestamp and transaction data. Consequently, no block can be altered without changing all previous blocks and – more importantly – the network’s consensus.
Business Intelligence
Business Intelligence (BI) is all about using data to make smarter business decisions. It involves collecting, storing, and analyzing data from various sources to help us better understand businesses, customers, and operations. Thanks to BI tools and techniques, we can turn all that data into valuable insights, identify trends, spot opportunities, and make better informed decisions.
CI/CD Pipeline
CI/CD (Continuous Integration / Continuous Delivery) pipeline is a series of automated processes that helps us build, test, and deliver software quicker and more reliably. It typically includes steps such as pulling code changes from version control, building and testing the code, and then deploying it to production. By automating these processes, a CI/CD pipeline helps us get new features and updates out the door faster and with fewer mistakes.
Cloud Computing
When we talk about Cloud Computing, we refer to the use of off-site systems hosted on the cloud instead of on one’s computer or other local storage. Cloud computing can include various services, such as email servers, software programs, data storage, and even additional processing power.
Code Review
Code Review describes the early-stage systematic testing of software source code by fellow programmers to find and remove bugs and address vulnerabilities. It’s a powerful way to improve the overall quality of our code, and it should be an ongoing practice within any software development team.
Control Tower
We use a Control Tower to centralize the management of the organization’s resources. It provides us with immediate insight into the location of our resources and helps to streamline operations using dashboards and other visual tools. We find it especially useful in supply chain management because it helps us keep track of critical issues and allows us to resolve them quickly.
CSV (Comma Separated Values)
CSV (.csv) is a file extension that we use when we want to store data in a tabular format, with each row on a new line and each value within a row separated by a comma. It’s a plain text file that’s simple and easy to use, and it can be opened and edited with many different software programs, such as Microsoft Excel, Google Sheets, or even text editors. Python provides designated libraries to work with CSVs.
Dashboard
A Dashboard is a user interface that uses visuals to represent data in an organized and easy-to-read way. Well-designed dashboards allow us to keep track of KPIs and data trends for better decision-making. Dashboards might be used to monitor business performance, customer service metrics, financial performance, or any other data we want to keep an eye on.
Data Accuracy
Data Accuracy refers to whether the stored data values are correct, ensuring that we can use records as a reliable source of information. Data accuracy is essential for informed decision-making and data-driven insights. Poor data accuracy can lead to incorrect conclusions, which in turn can cause costly errors.
Data as a Service (DaaS)
When we talk about Data as a Service (DaaS), we refer to a data management strategy that leverages the cloud to make data available on demand across all departments. DaaS enables data storage, integration, processing, and analytics to take place over the network and, as a result, benefits us with speed, scalability, and flexibility.
Data Catalog
We use a Data Catalog as an organization’s data assets inventory to help technical and non-technical users locate relevant information. Data Catalogs can help facilitate better decision-making by providing a single source of truth for organizational data. Not only do Data Catalogs provide an organized inventory of all the data assets, but they can also offer insights into how data is being used and which pieces of it are most valuable to different teams and stakeholders.
Data Cleaning
Data Cleaning is a process that aims to fix or remove incorrect and/or incomplete data from a dataset to maintain high levels of data quality. Without the cleaning processes, data collection and combination often result in duplicated, mislabeled, or even corrupted records.
Data Collection
When we talk about Data Collection, we refer to the process of gathering and measuring data in an organized manner, enabling us to elicit insights into phenomena and forecast future trends. Regardless of the field, accurate data collection is essential for reliable results. For example, in medicine, data collection can help us uncover correlations between diseases and their causes. In marketing, data collection can provide valuable insights into consumer behavior.
Data Consumer
A Data Consumer role refers to a person or a system that uses or processes data in some way, including analyzing data to make decisions, displaying data to users, or using data to train a machine learning model.
Data Democratization
When we talk about Data Democratization, we refer to a powerful concept that aims to make digital information accessible to everyone – including individuals without a technical background. Data Democratization is about giving people access to data, insights, and tools without relying on a third party like system administrators, data stewards, or IT departments.
Data Governance
Data Governance is an essential part of managing data quality, security, and privacy. It describes the system we use to manage information, as well as the actions taken with that information, including the availability, usability, integrity, and security of data in an organization.
Data Ingestion
Data Ingestion is the first step in collecting data from various sources. At this stage, we can understand the size and complexity of the data, influencing how we use the data later on in terms of access, use, and analysis. The goal of data ingestion is to collect data into any system requiring a particular structure or format for downstream data use.
Data Integration
Data Integration is the process of consolidating data from multiple sources into a single location to achieve a unified view of data for improved decision-making. Based on their data pipelines needs, companies usually use one of two data integration approaches: ETL (extract, transform, load) or ELT (extract, load, transform).
Data Lake
A Data Lake is a storage repository designed to store both structured and unstructured data in any form. This means we can use the data in raw and unprocessed forms. How does a data lake benefit us? Simply put, storing data from any source, at any size, speed, and structure, makes more robust and diverse queries, data science use cases, and new information discoveries possible. Unlike storing data in individual databases, using a data lake is cheaper, more flexible, scalable, easy to use, and provides superior data quality.
Data Lakehouse
Data Lakehouse is a storage architecture that combines the cost-effectiveness of a data lake with a data warehouse’s analytic and structural benefits. A data lakehouse enables us to use the same large data sets for different types of machine learning and business intelligence workloads.
Data Lineage
Data Lineage visually describes data flow over time, detailing its origin, iterations, and destination. The process allows us to track day-to-day use and error resolution operations while maintaining the accuracy and reliability of the data and compliance with regulatory requirements.
Data Masking
Data Masking is disguising sensitive data to enable us to use data accurately without exposing private information. There are many common data masking techniques, including:
● Nulling – Returning data values as blank or replaced with placeholders.
● Anagramming – Shuffling characters or digit order for each entry.
● Substitution – Replacing each value with a randomly selected value.
● Encryption – Translating sensitive exported data into a cipher that requires a password or key.
Data Mining
When we talk about Data Mining, we refer to extracting information from massive data sets to identify patterns and trends for use. It typically involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Through data mining, we can assess valuable information necessary to mitigate risk effectively, anticipate demands, monitor operational performance, and acquire new customers.
Data Monitoring
Data Monitoring refers to oversight mechanisms helping us to ensure that we use accurate, valid, and consistent data while maintaining security. It typically involves a manual or automated reporting process, with the ability to notify administrators of important events.
Data Orchestration
Data orchestration, a relatively new discipline in computer engineering, aims to match the right data with the right purpose. It does so by automating processes related to managing data, including collecting, combining, and preparing data for analysis.
Data Pipeline
With a Data Pipeline, we can move data from one place to another, typically through some kind of data transformation, including filtering, masking, and aggregations, to ensure both data integration and standardization.
Data Platform
We use a Data Platform as a central repository that combines and utilizes the features and capabilities of several big data applications. It supports us in the acquisition, storage, preparation, delivery, and the governance of our data while also ensuring high levels of security.
Data Platform as a Service (DPaaS)
Data Platform as a Service (DPaaS) enables companies to collect, manage, monitor, analyze and present data via a centralized platform. When using a DPaaS, we ensure strict governance, privacy, and security features for data protection and integrity.
Data Quality
Data Quality tells us about the condition of a particular set of data, including its completeness, accuracy, consistency, timeliness, validity, and uniqueness. Data quality activities can involve data integration, cleaning, rationalization, and validation.
Data Security
Data Security involves procedures and specific controls to protect data from accidental data exposure, unauthorized access, or data loss. We apply various techniques to mitigate security threats, including data encryption, data erasure, data resilience, or data masking.
Data Source
When we talk about a Data Source, we refer to the original location from which a piece or a set of data comes. It can be a database, a data warehouse, a flat file, an XML file, or any other readable format.
Data Storytelling
Data Storytelling is all about creating a compelling narrative from data analysis results, aiming to translate complex ideas into actionable insights tailored to a specific audience. Among the most critical aspects of data storytelling are narrative and visualizations.
Data Visualization
We use Data Visualization to represent information graphically, highlight patterns and trends and transform complex data into a more accessible and understandable form, such as a chart or graph.
Data Warehouse
We use a Data Warehouse as a central repository for storing data from multiple sources within an organization, allowing us to make better decisions based on a single source of truth. Unlike data lakes that contain a vast amount of raw data, data warehouses store structured, processed data ready for strategic data analysis.
Database Management System
A Database Management System (DBMS) is a software system that serves as an interface for interacting with a database. It allows end-users to access their databases to organize and access the information as needed, thus ensuring data security and integrity.
ELT
When we talk about ELT (Extract, Load, Transform), we refer to a three-step process of moving data from one place to another and preparing it for analysis. Here’s how it works:
- Extract: we get data out of its current location, whether that’s a database, a file, or somewhere else.
- Load: we move the data to its final destination, which is typically a data warehouse or another type of database optimized for storing and analyzing large amounts of data.
- Transform: change the data into a format that’s easier to work with. This could involve tasks like sorting, cleaning, and aggregating the data.
JSON
JSON (JavaScript Object Notation) is a way of encoding data so that it can be easily read and processed by computers. Thanks to its flexible and lightweight format, it makes sharing data between different applications much easier.
Log File
A Log File is a file that contains information about the events, processes, messages, and other data arising from an operator’s use of devices, applications, and operating systems. The log file will record any program running, background scripts, and website visits.
Machine Learning
Machine Learning (ML) is a type of artificial intelligence that allows computer systems to automatically improve their performance on a specific task, by learning from data, without being explicitly programmed. As we introduce new input data to the trained ML model, it can learn, grow, and develop to result in a new, more effective, and predictive algorithm.
Master Data
Master Data represents the key data entities critical to an organization’s operations and strategy, such as customers, products, suppliers, employees, and assets, and is typically stored in a centralized system or database. Customer master data, for example, might include information about the customer’s name, address, contact information, and purchase history. Master Data is often mistaken for reference data.
Metadata
Metadata, commonly described as “data about data,” is structured information that describes, locates, or makes it easier to use or manage an information resource. Typical metadata includes file size, image color/resolution, date, authorship, and keywords.
MLOps
MLOps, or DevOps for machine learning, combines software development (Dev) and operations (Ops) to streamline the process of building, testing, and deploying machine learning models. In other words, it’s all about making sure that we develop machine learning models in a reliable and scalable manner, just like any other software.
Model Deployment
Essentially, Model Deployment refers to putting a machine learning model into production to make predictions on new, unseen data. It’s the final step in the machine learning pipeline, where the model is taken from development and integrated into a real-world application so we can use it to deliver valuable insights and make data-driven decisions.
Model Retraining
When we talk about Model Retraining, we refer to updating an ML model with new data to improve its performance. We can do it manually or automate the process by applying Continuous Training (CT), a part of the MLOps practices.
Natural Language Processing (NLP)
Natural Language Processing (NLP) falls into the area of artificial intelligence (AI) that enables computers to read, analyze, and respond to human language, just like humans do. In a nutshell, NLP makes it possible for us to communicate with computers in a way that feels natural and human-like. NLP is used in various applications such as voice assistants, language translation, chatbots, sentiment analysis, and many more.
NoSQL
NoSQL, or “not only SQL,” is a database designed to handle large volumes of unstructured or semi-structured data. We mainly classify NoSQL databases into four types: document-based databases, key-value stores, column-oriented databases, and graph-based databases. Unlike traditional relational databases, NoSQL databases allow for more flexible and scalable data storage, enabling organizations to store data in more intuitive ways that are closer to how users access data. Thus, they are widely used in real-time web applications and big data.
Power Query
Power Query is a Microsoft tool used to perform ETL, a three-step process where data is extracted, transformed, and loaded into its final destination. With Power Query, we can import data from various sources and perform data preparation according to our needs.
Predictive Analytics
Predictive Analytics combines historical and current data with advanced statistics and machine learning techniques to forecast possible future outcomes. Many companies use predictive analytics to identify customer behavior patterns, assess risk, reduce downtime, and personalize treatment plans for patients.
Raw Data
When we talk about Raw Data (sometimes called source data, atomic data, or primary data), we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use, thus remaining unactionable. Raw data can come in various forms, such as numbers, text, images, video, or audio.
Real-Time Analytics
Real-Time Analytics refers to analyzing and processing data in real-time or near real-time as it is generated or received. So, instead of waiting for a batch of data to build up, real-time analytics processes data as it comes in, allowing for more immediate insights and decision-making.
Reference Data
We generally use Reference Data to classify, categorize, or provide context for other data with the aim of standardizing and harmonizing data across different systems and applications. Different product codes, for instance, can be reference data used by a manufacturing company to introduce a standardized product classification system to simplify comparing sales data across regions. This way, they ensure that data is consistent and accurate across the organization and, thus, improve decision-making and operational efficiency. Reference Data is often mistaken for master data.
Relational Database (RDBMS)
A Relational Database (RDBMS) is a type of database that stores data in tables with columns and rows, similar to a spreadsheet. In an RDBMS, data is stored in tables related to each other based on common attributes or keys. For example, a customer table might be related to an orders table through a customer ID key. This allows data to be organized and accessed in a logical and efficient way.
Second-Party Data
Simply put, Second-Party Data is someone else’s first-party data. It is gathered by a trusted partner or other business entity and typically comes from various sources, including websites, apps, and social media.
Self-Service Tools
Self-Service Tools are applications or platforms that empower people to access information and perform tasks independently without needing external support or assistance. By equipping users with the power and autonomy to select, filter, compare, visualize, and analyze data, companies can drastically reduce the need for costly and extensive IT training, freeing up resources and improving efficiency.
Semi-Structured Data
Semi-Structured Data is a form of structured data that is not stored in a tabular format but contains tags and other elements that make it possible to group and label data. Examples of semi-structured data sources include zip files, emails, XML, and web pages.
Structured Data
When we talk about Structured Data, we refer to quantitive, predefined, and formatted data, typically stored in tabular form in relational databases (RDBMSs). Structured Data usually comes in the form of numbers and values and thus is easier to search, analyze and comprehend.
Structured Query Language (SQL)
Structured Query Language (SQL) is a standardized programming language designed for managing relational databases and carrying out operations on the data, such as storing, searching, removing, and retrieving data, as well as maintaining and optimizing database performance.
Third-Party Data
Unlike first- and second-party data, Third-Party Data comes from an outside entity that often sells previously collected customer data to companies for advertising purposes. In this case, organizations possess customer data without having direct relationships with their customers.
Unit Testing
In the software development process, Unit Testing allows developers to test individual units or components of code independently from other parts of the system. This way, they ensure the piece of software is fit for use.
Unstructured Data
When we talk about Unstructured Data or Raw Data, we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use. Unstructured data can come in various forms, such as numbers, text, images, video, or audio.