Glossary
A
Access Control
Access Control is a process we use to manage access to data and resources within an organization. We can implement access control in many different ways, such as using passwords, PINs, and biometric authentication, often combining methods via multi-factor authentication (MFA).
Advanced Analytics
We refer to a wide range of data analysis techniques and tools when we use the term Advanced Analytics. These methods go beyond traditional business intelligence (BI) to uncover deeper insights, make predictions, or generate recommendations.
As this umbrella term covers a great variety of advanced analytic techniques, here is a non-exhaustive list of examples: extracting valuable information from data and text, using algorithms to learn and make predictions, identifying patterns in data, predicting
future outcomes, displaying data in visual forms, creating models to simulate real-world scenarios, and using neural networks to solve complex problems.
Analytics
Analytics allows us to make better decisions by taking raw data and uncovering its meaningful patterns. Essentially, analytics applies statistics, computer programming, and operations research to quantify and gain insight into the meanings of data. While every sector can benefit from analytics, data-rich sectors find it especially useful.
API
An API (application programming interface) is a program that defines how different software applications can interact with each other. An API can be thought of as a waiter at a restaurant. Just as you, the customer, would use a menu to place an order with the
waiter, a developer would use an API to request specific information or actions from a software program or web-based service. The waiter would then go to the kitchen, gather the requested dishes and return them to you at the table, in the same way, the API retrieves the requested data or performs the requested actions and returns it to the developer.
Artificial Intelligence
When we talk about Artificial Intelligence (AI), we refer to a broad field of computer science that gives machines the ability to sense, reason, engage and learn. AI allows machines to do tasks that would typically require human intelligence, including speech recognition, problem-solving, decision-making, planning, and visual perception.
B
Batch Ingestion
Batch Ingestion allows us to collect and transfer data in chunks or groups. This process usually processes a bulk of data and is triggered or scheduled at the same intervals. We usually opt for this solution when we want to collect specific data points frequently or when we don’t need the data immediately available for real-time analysis.
Big Data
When we talk about Big Data, we refer to the massive amounts of structured and unstructured data collected from different sources. Given its size and complexity, such data is almost impossible to process using traditional methods. Big Data comprises three characteristics, known as the three Vs:
- Volume: Data doesn’t just come from one source. We might derive it from videos, images, transactions, audio, social media, transactions, smart devices (IoT), or more. Previously, storing such data was resource-intense; however, data lakes, Hadoop, and the cloud have made storing much more accessible.
- Velocity: Gathering real-time data for immediate implementation is a vital challenge to sectors such as traffic control and the global supply chain. Big Data analysis answers this need through innovative processing technologies.
- Variety: Organizations collect data from all types of sources, resulting in various data formats – from structured, numeric data to unstructured text documents, emails, videos, and audio.
As this umbrella term covers a great variety of advanced analytic techniques, here is a non-exhaustive list of examples: extracting valuable information from data and text, using algorithms to learn and make predictions, identifying patterns in data, predicting future outcomes, displaying data in visual forms, creating models to simulate real-world scenarios, and using neural networks to solve complex problems.
BigQuery
BigQuery is a fully managed, cloud-based data warehouse that helps us manage and analyze data via machine learning, geospatial analysis, and business intelligence. Because it allows us to query terabytes in seconds and petabytes in minutes, this solution is often our go-to option to process and analyze massive datasets quickly and easily.
Blockchain
Blockchain is a digital, decentralized ledger that records transactions on multiple computers rather than storing them in a central location. Each block in the chain contains a cryptographic hash of the previous block, as well as an accurate timestamp and transaction data. Consequently, no block can be altered without changing all previous blocks and – more importantly – the network’s consensus.
Business Intelligence
Business Intelligence (BI) is all about using data to make smarter business decisions. It involves collecting, storing, and analyzing data from various sources to help us better understand businesses, customers, and operations. Thanks to BI tools and techniques, we can turn all that data into valuable insights, identify trends, spot opportunities, and make better informed decisions.
C
CI/CD Pipeline
CI/CD (Continuous Integration / Continuous Delivery) pipeline is a series of automated processes that helps us build, test, and deliver software quicker and more reliably. It typically includes steps such as pulling code changes from version control, building and testing the code, and then deploying it to production. By automating these processes, a CI/CD pipeline helps us get new features and updates out the door faster and with fewer mistakes.
Cloud Computing
When we talk about Cloud Computing, we refer to the use of off-site systems hosted on the cloud instead of on one’s computer or other local storage. Cloud computing can include various services, such as email servers, software programs, data storage, and even additional processing power.
Clustering
Clustering is a technique in data analysis that involves grouping similar data points into clusters or segments based on their similarities to identify meaningful patterns or relationships within a dataset that may not be immediately apparent.
Code Review
Code Review describes the early-stage systematic testing of software source code by fellow programmers to find and remove bugs and address vulnerabilities. It’s a powerful way to improve the overall quality of our code, and it should be an ongoing practice within any software development team.
Control Tower
We use a Control Tower to centralize the management of the organization’s resources. It provides us with immediate insight into the location of our resources and helps to streamline operations using dashboards and other visual tools. We find it especially useful in supply chain management because it helps us keep track of critical issues and allows us to resolve them quickly.
CRM Database
A CRM (customer relationship management) database serves organizations as a valuable resource that houses all the important information about clients. This data often includes contacts, leads, accounts, activity logs, performance indicators, and more. The CRM database plays a critical role in facilitating sales and marketing campaigns.
CSV (Comma Separated Values)
CSV (.csv) is a file extension that we use when we want to store data in a tabular format, with each row on a new line and each value within a row separated by a comma. It’s a plain text file that’s simple and easy to use, and it can be opened and edited with many different software programs, such as Microsoft Excel, Google Sheets, or even text editors. Python provides designated libraries to work with CSVs.
Customer Data Platform (CDP)
A Customer Data Platform, or CDP, gathers all the data from multiple sources and shapes a set of unique customer profiles. Customer data might come from numerous sources like CRMs, websites, social media, and ERPs. By merging data related to the same customer, the CDP allows a 360-degree understanding of each customer. Moreover, the customer data is stored, enabling us to follow the customer journey over time.
D
Dashboard
A Dashboard is a user interface that uses visuals to represent data in an organized and easy-to-read way. Well-designed dashboards allow us to keep track of KPIs and data trends for better decision-making. Dashboards might be used to monitor business performance, customer service metrics, financial performance, or any other data we want to keep an eye on.
Data Accuracy
Data Accuracy refers to whether the stored data values are correct, ensuring that we can use records as a reliable source of information. Data accuracy is essential for informed decision-making and data-driven insights. Poor data accuracy can lead to incorrect conclusions, which in turn can cause costly errors.
Data Architecture
Data Architecture describes the design and organization of data assets we use in an organization. A data architecture typically includes several key components, such as:
- data sources,
- data models representing and defining the relationships between different data elements,
- storage infrastructure (for example, databases, data warehouses, and data lakes), and
- data governance frameworks.
Data as a Service (DaaS)
When we talk about Data as a Service (DaaS), we refer to a data management strategy that leverages the cloud to make data available on demand across all departments. DaaS enables data storage, integration, processing, and analytics to take place over the network and, as a result, benefits us with speed, scalability, and flexibility.
Data Breach
A Data Breach happens when unauthorized individuals or groups access, steal or use sensitive information. This can occur in multiple ways, including hacking into computer systems, stealing physical data storage devices, or unintentionally exposing sensitive information.
Data Catalog
We use a Data Catalog as an organization’s data assets inventory to help technical and non-technical users locate relevant information. Data Catalogs can help facilitate better decision-making by providing a single source of truth for organizational data. Not only do Data Catalogs provide an organized inventory of all the data assets, but they can also offer insights into how data is being used and which pieces of it are most valuable to different teams and stakeholders.
Data Classification
Data Classification is the process of categorizing and labeling data based on its sensitivity, importance, value, and other relevant attributes. By classifying data, we can determine which data we need to keep confidential and which we can share freely. This way, we ensure that sensitive information is only accessed by those with a legitimate need for it, preventing data breaches and unauthorized access.
Data Cleaning
Data Cleaning is a process that aims to fix or remove incorrect and/or incomplete data from a dataset to maintain high levels of data quality. Without the cleaning processes, data collection and combination often result in duplicated, mislabeled, or even corrupted records.
Data Collection
When we talk about Data Collection, we refer to the process of gathering and measuring data in an organized manner, enabling us to elicit insights into phenomena and forecast future trends. Regardless of the field, accurate data collection is essential for reliable results. For example, in medicine, data collection can help us uncover correlations between diseases and their causes. In marketing, data collection can provide valuable insights into consumer behavior.
Data Consumer
A Data Consumer role refers to a person or a system that uses or processes data in some way, including analyzing data to make decisions, displaying data to users, or using data to train a machine learning model.
Data Democratization
When we talk about Data Democratization, we refer to a powerful concept that aims to make digital information accessible to everyone – including individuals without a technical background. Data Democratization is about giving people access to data, insights, and tools without relying on a third party like system administrators, data stewards, or IT departments.
Data Discovery
When we talk about Data Discovery, we refer to the process of locating and identifying relevant data assets within an organization’s systems or repositories. It aims to enable users of all levels to uncover hidden patterns in data and make better-informed decisions.
Data Encryption
Being one of many data masking techniques, Data Encryption is a way to protect your sensitive information by scrambling it into a secret code that only authorized people can decipher using a decryption key or password.
Data Enrichment
Data Enrichment involves the process of integrating fresh updates and information into an organization’s existing database. The aim is to enhance accuracy and fill in any missing details. It differs from data cleansing, which primarily focuses on removing inaccurate, irrelevant, or outdated data. Data enrichment, on the other hand, is primarily concerned with supplementing the existing data with additional valuable information.
Data Exchange
Data Exchange allows data to be shared between different systems by taking various forms of data from one location or source to another. Such a data exchange maintains the meaning of the transferred information by simplifying the acquisition and integration of data.
Data Governance
Data Governance is an essential part of managing data quality, security, and privacy. It describes the system we use to manage information, as well as the actions taken with that information, including the availability, usability, integrity, and security of data in an organization.
Data Frame
A Data Frame describes a structure in which data can be stored and organized in rows and columns, similar to a spreadsheet. It is handy for analyzing and manipulating data because it allows you to work with many different variables simultaneously while keeping everything organized and easy to understand.
Data Ingestion
Data Ingestion is the first step in collecting data from various sources. At this stage, we can understand the size and complexity of the data, influencing how we use the data later on in terms of access, use, and analysis. Data ingestion aims to collect data into any system requiring a particular structure or format for downstream data use
Data Integration
Data Integration is the process of consolidating data from multiple sources into a single location to achieve a unified view of data for improved decision-making. Based on their data pipeline needs, companies usually use one of two data integration approaches: ETL (extract, transform, load) or ELT (extract, load, transform).
Data Lake
A Data Lake is a storage repository designed to store both structured and unstructured data in any form. This means we can use the data in raw and unprocessed forms. How does a data lake benefit us? Simply put, storing data from any source, at any size, speed, and structure, makes more robust and diverse queries, data science use cases, and new information discoveries possible. Unlike storing data in individual databases, using a data lake is cheaper, more flexible, scalable, easy to use, and provides superior data quality.
Data Lakehouse
Data Lakehouse is a storage architecture that combines the cost-effectiveness of a data lake with a data warehouse’s analytic and structural benefits. A data lakehouse enables us to use the same large data sets for different types of machine learning and business intelligence workloads.
Data Lineage
Data Lineage visually describes data flow over time, detailing its origin, iterations, and destination. This process allows us to track day-to-day use and error resolution operations while maintaining the data’s accuracy and reliability and compliance with regulatory requirements.
Data Literacy
When we talk about Data Literacy, we refer to an individual’s or organization’s ability to read, work with, analyze, and communicate with data, as well as having an understanding of such fundamental concepts as:
Data sources
Types of data
Types of data analysis
Data cleansing practices
Data tools, techniques, and frameworks
Data Mapping
We use Data Mapping to match data from one data model to another by drawing connections and relationships between them. Doing so ensures our data is accurate and standardized across the organization.
Data Mart
Unlike a central data warehouse, a Data Mart is a smaller storage space focused on specific data used by a particular organization’s business unit, such as the Finance or Marketing department. Since data marts store only data specific to a single group, they require less storage space and are, thus, often faster and more easily accessible.
Data Masking
Data Masking is disguising sensitive data to enable us to use data accurately without exposing private information. There are many common data masking techniques, including:
- Nulling – Returning data values as blank or replaced with placeholders.
- Anagramming – Shuffling characters or digit order for each entry.
- Substitution – Replacing each value with a randomly selected value.
- Encryption – Translating sensitive exported data into a cipher that requires a password or key.
Data Mesh
Data Mesh refers to a decentralized data architecture and operating model focused on bringing data closer to the teams where it is generated and treating data as a product.
There are four core principles of data mesh:
- Domain-oriented data ownership and architecture: The data mesh creates a communication structure between data owners, preventing data siloing.
- Data as a product: The change in perspective impacts the way we collect, serve, and manage data and, thus, boosts data quality and user satisfaction.
- Self-service data platform: The domain should have the appropriate infrastructure to support data democratization and empower domain teams.
Federated computational governance: The data mesh is an ecosystem that enables interoperability across different data sources.
Data Migration
Data Migration is all about moving data from one system to another, typically between applications, formats, existing databases, or storage systems. In general, Data Migration involves tasks such as preparing, extracting, transforming data from source systems, loading data to the target system, testing for quality, and validating the outcomes.
Data Mining
When we talk about Data Mining, we refer to extracting information from massive data sets to identify patterns and trends for use. It typically involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Through data mining, we can assess valuable information necessary to mitigate risk effectively, anticipate demands, monitor operational performance, and acquire new customers.
Data Model
A Data Model is a framework for organizing and representing data, helping us understand the relationships between different pieces of data and how they relate to real-world concepts. We can think of it as a blueprint for our data—just as a blueprint helps us construct a building, a data model helps us construct a database or other data storage systems. There are many types of data models, but some of the most common include relational, hierarchical, and object-oriented models.
Data Modeling
When we talk about Data Modeling, we refer to creating a conceptual representation of data and its relationships to other data. Essentially, it’s a way to map out how different pieces of data are connected and how they relate to each other. This is typically done using diagrams or other visual representations that allow us to easily understand the relationships between different data entities.
Data Monitoring
Data Monitoring refers to oversight mechanisms helping us to ensure that we use accurate, valid, and consistent data while maintaining security. It typically involves a manual or automated reporting process, with the ability to notify administrators of important events.
Data Orchestration
Data orchestration, a relatively new discipline in computer engineering, aims to match the right data with the right purpose. It does so by automating processes related to managing data, including collecting, combining, and preparing data for analysis.
Data Pipeline
With a Data Pipeline, we can move data from one place to another, typically through some kind of data transformation, including filtering, masking, and aggregations, to ensure both data integration and standardization.
Data Platform
We use a Data Platform as a central repository that combines and utilizes the features and capabilities of several big data applications. It supports us in the acquisition, storage, preparation, delivery, and governance of our data while ensuring high security levels.
Data Platform as a Service (DPaaS)
Data Platform as a Service (DPaaS) enables companies to collect, manage, monitor, analyze, and present data via a centralized platform. When using a DPaaS, we ensure strict governance, privacy, and security features for data protection and integrity.
Data Producer
A Data Producer is a root source of any data, any entity that collects, stores, and provides data as a result of its activities. It might be a device, service, software, or organization.
Data Profiling
Data Profiling is the process of analyzing and examining data to gain a better understanding of its characteristics, quality, and structure. It involves collecting metadata, tagging data with keywords and categories, and discovering relationships between different data sources.
Data Replication
Data Replication involves duplicating data and storing it in multiple locations. This is done for the purpose of creating backups, ensuring resilience in case of system failures, and improving data accessibility.
Data Security
Data Security involves procedures and specific controls to protect data from accidental exposure, unauthorized access, or loss. We apply various techniques to mitigate security threats, including data encryption, erasure, resilience, and masking.
Data Silo
When we talk about Data Silos, we refer to individual data repositories held by one group and isolated from the rest of the organization, thus, remaining inaccessible to the others. Data silos can exist within a single department, branch, or company or even be shared between multiple organizations.
Data Source
When we talk about a Data Source, we refer to the original location from which a piece or a set of data comes. It can be a database, a data warehouse, a flat file, an XML file, or any other readable format.
Data Stewardship
Data Stewardship is a set of people, processes, and tools that ensure the accuracy, reliability, security, and proper management of data. Data Stewardship involves:
- Defining roles and responsibilities for data management,
- Establishing a set of procedures and policies,
- Implementing appropriate security controls,
- Monitoring data usage.
Data Storytelling
Data Storytelling is all about creating a compelling narrative from data analysis results. It aims to translate complex ideas into actionable insights tailored to a specific audience. Among the most critical aspects of data storytelling are narrative and visualizations.
Data Quality
Data Quality tells us about the condition of a particular set of data, including its completeness, accuracy, consistency, timeliness, validity, and uniqueness. Data quality activities can involve data integration, cleaning, rationalization, and validation.
Data Validation
We perform Data Validation to check and confirm that data is accurate, complete, and meets certain requirements or standards. Data Validation includes a range of activities and techniques, such as:
- Data profiling: Analyze the data to understand its structure, format, and quality.
- Data cleaning: Identify and correct errors or inconsistencies in the data.
- Data verification: Check that the data is correct and complete.
- Data testing: Test the data against specific requirements or standards.
- Data documentation: Document the validation process and results to ensure transparency and repeatability.
Data Visualization
We use Data Visualization to represent information graphically, highlight patterns and trends, and transform complex data into a more accessible and understandable form, such as a chart or graph.
Data Warehouse
We use a Data Warehouse as a central repository for storing data from multiple sources within an organization. This allows us to make better decisions based on a single source of truth. Unlike data lakes, which contain a vast amount of raw data, data warehouses store structured, processed data ready for strategic data analysis.
Data Workflow
Data Workflow encompasses the activities of gathering, arranging, transforming, and analyzing data to guarantee its accurate storage and effective management. The objective is to ensure that the information is readily accessible to anyone who needs it at any time.
Data Workflow Diagram
A Data Workflow Diagram visually outlines the workflow of data from its source to processing, storage, assignment, and, finally, the output of results in a user-friendly format, making it easier to understand.
Database Management System
A Database Management System (DBMS) is a software system that serves as an interface for interacting with a database. It allows end-users to access their databases to organize and access the information as needed, thus ensuring data security and integrity.
Dataset
A Dataset is a collection of related data organized and stored in a specific format. It can be as simple as a collection of numbers or as complex as a collection of images, videos, and text.
dbt
dbt, or data build tool, is an open-source command line tool we use to transform and structure data in data warehouses. With dbt, we can write SQL transformations as modular and reusable code, thus, saving time and effort by not having to rewrite the same SQL code multiple times, improving the codebase’s maintainability, and making it easier to collaborate on projects with multiple team members.
Descriptive Analytics
Descriptive Analytics involves analyzing past and present data to find patterns and connections to answer questions like “What happened?” or “What is happening?”. This process typically uses common visual aids like bar charts, pie charts, tables, line graphs, and written reports.
Diagnostic Analytics
Diagnostic Analytics uses data to determine the causes of trends and correlations between variables and ultimately answer the question: “Why did it happen?” Diagnostic analytics is characterized by techniques such as data discovery, data mining, and correlations. It usually occurs after the conclusion of descriptive analytics.
E
ELT (Extract, Load, Transform)
When we talk about ELT (Extract, Load, Transform), we refer to a three-step process of moving data from one place to another and preparing it for analysis. Here’s how it works:
- Extract: we get data out of its current location, whether that’s a database, a file, or somewhere else.
- Load: we move the data to its final destination, which is typically a data warehouse or another type of database optimized for storing and analyzing large amounts of data.
- Transform: change the data into a format that’s easier to work with. This could involve tasks like sorting, cleaning, and aggregating the data.
Entity Relationship Diagram (ERD)
Also known as an Entity Relationship Model, an Entity Relationship Diagram visualizes the attributes and the relationships between entities like people, things, or concepts in a database, thus illustrating its logical structure. An ERD is useful for sketching out a design of a new database, analyzing existing databases, and uncovering information more easily to resolve issues and improve results.
ETL
When we talk about ETL (Extract, Transform, Load), we refer to a three-step process of moving data from one place to another and preparing for analysis. Here’s how it works:
- Extract: we get data out of its current location, whether that’s a database, a file, or somewhere else.
- Transform: we change the data into a format that’s easier to work with. This could involve tasks like sorting, cleaning, and aggregating the data.
- Load: we move the data to its final destination, which is typically a data warehouse or another type of database optimized for storing and analyzing large amounts of data.
F
First-Party Data
First-Party Data describes customer data collected directly through customer interactions, including demographic data, purchase history, and preferences. This data comes from the organization’s own sources, such as Google Analytics, social media, mobile apps, or CRM systems. First-party data must comply with GDPR and CCPA regulations, as the entity collecting the data is the owner of consent.
J
JSON
JSON (JavaScript Object Notation) is a way of encoding data so that it can be easily read and processed by computers. Thanks to its flexible and lightweight format, it makes sharing data between different applications much easier.
K
KPI Dashboard
A KPI (key performance indicator) dashboard is a visual reporting tool that enables organizations to consolidate vast amounts of data for a single-screen view of performance over time for a specific objective. Creating effective dashboards starts with collecting data from different tools, then organizing, filtering, and analyzing the data, until ultimately visualizing the data. It helps users to easily access data, identify trends and make data-driven decisions.
L
Log File
A Log File is a file that contains information about the events, processes, messages, and other data arising from an operator’s use of devices, applications, and operating systems. The log file will record any program running, background scripts, and website visits.
M
Machine Learning
Machine Learning (ML) is a type of artificial intelligence that allows computer systems to automatically improve their performance on a specific task, by learning from data, without being explicitly programmed. As we introduce new input data to the trained ML model, it can learn, grow, and develop to result in a new, more effective, and predictive algorithm.
Master Data
Master Data represents the key data entities critical to an organization’s operations and strategy, such as customers, products, suppliers, employees, and assets, and is typically stored in a centralized system or database. Customer master data, for example, might include information about the customer’s name, address, contact information, and purchase history. Master Data is often mistaken for reference data.
Metadata
Metadata, commonly described as “data about data,” is structured information that describes, locates, or makes it easier to use or manage an information resource. Typical metadata includes file size, image color/resolution, date, authorship, and keywords.
MLOps
MLOps, or DevOps for machine learning, combines software development (Dev) and operations (Ops) to streamline the process of building, testing, and deploying machine learning models. In other words, it’s all about making sure that we develop machine learning models in a reliable and scalable manner, just like any other software.
Model Deployment
Essentially, Model Deployment refers to putting a machine learning model into production to make predictions on new, unseen data. It’s the final step in the machine learning pipeline, where the model is taken from development and integrated into a real-world application so we can use it to deliver valuable insights and make data-driven decisions.
Model Retraining
When we talk about Model Retraining, we refer to updating an ML model with new data to improve its performance. We can do it manually or automate the process by applying Continuous Training (CT), a part of the MLOps practices.
Monitoring
Monitoring refers to the practice of collecting data about a system to detect problems. Because it relies on predefined metrics and logs, it gives a much more limited view of the system than observability. While monitoring notifies you that a system is at fault, observability helps you understand why issues occurred.
MVP (Minimum Viable Product)
An MVP (Minimum Viable Product) is an Agile concept that refers to the most basic version of a product, just with enough features to satisfy the needs of early users. The idea behind an MVP is to quickly gather users’ feedback and consistently improve the product with minimal effort and resources. By starting with an MVP, we can prioritize essential features and learn along the way while reducing the time and cost of development and avoiding building unnecessary features.
N
Natural Language Processing (NLP)
Natural Language Processing (NLP) falls into the area of artificial intelligence (AI) that enables computers to read, analyze, and respond to human language, just like humans do. In a nutshell, NLP makes it possible for us to communicate with computers in a way that feels natural and human-like. NLP is used in various applications such as voice assistants, language translation, chatbots, sentiment analysis, and many more.
NoSQL
NoSQL, or “not only SQL,” is a database designed to handle large volumes of unstructured or semi-structured data. We mainly classify NoSQL databases into four types: document-based databases, key-value stores, column-oriented databases, and graph-based databases. Unlike traditional relational databases, NoSQL databases allow for more flexible and scalable data storage, enabling organizations to store data in more intuitive ways that are closer to how users access data. Thus, they are widely used in real-time web applications and big data.
O
Observability
Observability refers to an organization’s ability to understand the health and state of data in their systems based on external outputs such as logs, metrics, and traces. Unlike traditional monitoring that relies on predefined metrics and logs, observability provides a broader and more dynamic view of the system. It allows organizations to explore how and why issues occur rather than simply receiving alerts. At its core, data observability rests on five pillars:
- Freshness: tracks how up-to-date the data is.
- Distribution: details if collected data matches expected values.
- Volume: checks the number of expected values to see if data is complete.
- Schema: monitors changes in data organization.
- Lineage: describes data flow over time, detailing its origin, iterations, and destination.
P
Parquet
Apache Parquet is a free and open-source column-based format used to store big data. As opposed to row-based formats such as CSV and JSON, Apache Parquet organizes files by column, therefore saving storage space and increasing performance.
PostgreSQL
PostgreSQL, often called “Postgres,” is an open-source relational database management system (RDBMS) known for its robustness, reliability, and flexibility. It supports many data types, including complex data like arrays and JSON. Its powerful query language allows users to retrieve, manipulate, and analyze data with ease.
Power Query
Power Query is a Microsoft tool used to perform ETL, a three-step process where data is extracted, transformed, and loaded into its final destination. With Power Query, we can import data from various sources and perform data preparation according to our needs.
Predictive Analytics
Predictive Analytics combines historical and current data with advanced statistics and machine learning techniques to forecast possible future outcomes. Many companies use predictive analytics to identify customer behavior patterns, assess risk, reduce downtime, and personalize treatment plans for patients.
Prescriptive Analytics
Prescriptive Analytics focuses on finding possibilities and recommendations for a particular scenario, answering the question, “What should be done?”. While using descriptive and predictive analytics, prescriptive analytics aims to gain actionable insights guided by algorithmic models rather than data monitoring.
Q
Query
A Query in SQL is a request for data within the database environment. When we talk about queries in a database, we refer to either a select query or an action query. We use a select query to retrieve data from a database, while an action query helps us perform other operations on data, including adding, removing, or changing data.
Qualitative Analysis
Qualitative Analysis refers to analyzing non-numerical data such as words, images, or observations. This method comes in handy when we strive to gain a deeper understanding of a phenomenon, often by examining the context and meaning of the data. One example of qualitative analysis is a case study of a company’s organizational culture, which involves analyzing the data collected from interviews with employees, internal documents, and observation notes.
Quantitative Analysis
In contrast to qualitative analysis, Quantitative Analysis is a method of measuring and interpreting numerical data to provide objective and precise measurements and data that we can use to draw conclusions or make predictions. For example, a quantitative survey can collect numerical data on consumer preferences for different types of products or services to identify behavioral trends or patterns.
R
Raw Data
When we talk about Raw Data (sometimes called source data, atomic data, or primary data), we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use, thus remaining unactionable. Raw data can come in various forms, such as numbers, text, images, video, or audio.
Real-Time Analytics
Real-Time Analytics refers to analyzing and processing data in real-time or near real-time as it is generated or received. So, instead of waiting for a batch of data to build up, real-time analytics processes data as it comes in, allowing for more immediate insights and decision-making.
Reference Data
We generally use Reference Data to classify, categorize, or provide context for other data with the aim of standardizing and harmonizing data across different systems and applications. Different product codes, for instance, can be reference data used by a manufacturing company to introduce a standardized product classification system to simplify comparing sales data across regions. This way, they ensure that data is consistent and accurate across the organization and, thus, improve decision-making and operational efficiency. Reference Data is often mistaken for master data.
Relational Database (RDBMS)
A Relational Database (RDBMS) is a type of database that stores data in tables with columns and rows, similar to a spreadsheet. In an RDBMS, data is stored in tables related to each other based on common attributes or keys. For example, a customer table might be related to an orders table through a customer ID key. This allows data to be organized and accessed in a logical and efficient way.
S
Schema
A database Schema refers to a blueprint describing how the database is organized and defining the relationships between various data elements, often visually presented in the form of diagrams.
Scrum
Scrum is a lightweight, Agile framework that helps teams deal with complex projects more effectively. It follows an iterative and incremental approach, where work is divided into small units called “user stories.” Scrum Teams select a set of user stories for each sprint (short, repeatable phases of varying lengths) and commit to reaching them within a short, time-boxed period. The Scrum team would usually contain the following roles: Development Team, Product Owner – a representative of the client’s goals, Scrum Master – a “coach” facilitating agility.
Second-Party Data
Simply put, Second-Party Data is someone else’s first-party data. It is gathered by a trusted partner or other business entity and typically comes from various sources, including websites, apps, and social media.
Self-Service Tools
Self-Service tools are applications or platforms that empower people to access information and perform tasks independently without needing external support or assistance. By equipping users with the power and autonomy to select, filter, compare, visualize, and analyze data, companies can drastically reduce the need for costly and extensive IT training, freeing up resources and improving efficiency.
Semi-structured Data
Semi-Structured Data is a form of structured data that is not stored in a tabular format but contains tags and other elements that make it possible to group and label data. Examples of semi-structured data sources include zip files, emails, XML, and web pages.
Spark
Apache Spark is an open-source distributed computing system we use to process data in parallel across a cluster of computers in various formats, including batch, streaming, and interactive modes. This way, we can handle large datasets quickly and efficiently.
Structured Data
When we talk about Structured Data, we refer to quantitive, predefined, and formatted data, typically stored in tabular form in relational databases (RDBMSs). Structured Data usually comes in the form of numbers and values and thus is easier to search, analyze and comprehend.
Structured Query Language (SQL)
Structured Query Language (SQL) is a standardized programming language designed for managing relational databases and carrying out operations on the data, such as storing, searching, removing, and retrieving data, as well as maintaining and optimizing database performance.
Supervised Learning
In contrast to Unsupervised Learning, Supervised Learning is a way of training an AI algorithm to use labeled examples or input-output pairs. Simply put, in Supervised Learning, the algorithm is explicitly taught what answers are “right” and uses this information to make predictions on new data.
T
Third-Party Data
Unlike first- and second-party data, Third-Party Data comes from an outside entity that often sells previously collected customer data to companies for advertising purposes. In this case, organizations possess customer data without having direct relationships with their customers.
U
Unit Testing
In the software development process, Unit Testing allows developers to test individual units or components of code independently from other parts of the system. This ensures that the software is fit for use.
Unstructured Data
When we talk about Unstructured Data or Raw Data, we refer to data collected from a source or sources that have not yet been processed, organized, or analyzed for use. Unstructured data can come in various forms, such as numbers, text, images, video, or audio.
Unsupervised Learning
Unlike Supervised Learning, in Unsupervised Learning, the algorithm is not given any labeled examples and must instead find patterns and relationships independently without being explicitly trained or guided by classified or labeled examples.
User Acceptance Testing (UAT)
UAT stands for User Acceptance Testing, the final round of testing a software product goes through before its release. The main idea of UAT is to test a solution with the help of real users to see if it works as expected and meets their needs. The users receive a set of test cases to follow. As they perform these tasks, they report any issues they encounter to the development team, which then implements the necessary changes in the system.
V
VPN
A VPN, or Virtual Private Network, creates a secure tunnel between a user’s device and the internet by assigning the user a new anonymous IP address, rerouting the internet connection through a server in its network, and encrypting all data. A VPN masks the user’s identity and online traffic from internet service providers, hackers, and third parties. Overall, VPNs are a helpful tool for anyone who wants to enhance their online privacy and security or who needs to access private networks from remote locations.
W
Web Scraping
We use Web Scraping tools to extract data from a website’s source code. Imagine you want to gather specific data from many web pages, such as product information from an online store. Instead of manually copying the data by hand from each webpage, web scraping can automate this process by:
- Scrolling through pages,
- Clicking on links,
- Locating specific information,
- Filling out forms,
- Extracting the desired information,
- Saving data in a spreadsheet or a database.
X
XML
XML, or Extensible Markup Language, is a markup language that uses tags and other elements to define the structure and meaning of the data it contains. For example, an XML file that stores information about books might use tags such as <title>, <author>, and <publisher> to define different pieces of data associated with each book.
XML Database
An XML (Extensible Markup Language) Database is a database management system designed to store and manage data in XML format. We use XML databases to store and manage complex and hierarchical data structures, particularly in applications where data needs to be easily queried, modified, and updated.