Data Quality with Great Expectation in Python and Airflow

Guide on using Great Expectations, a powerful Python library, to enhance data quality in your pipelines.

Maik Paixão
5 min readDec 8, 2023

Data quality is the cornerstone of reliable analytics and informed decision-making. It refers to the accuracy, consistency, completeness, reliability, and relevance of data.

High-quality data is crucial for generating valid insights, while poor data quality can lead to misleading analysis and erroneous business decisions. This section delves into the nuances of data quality, highlighting common issues like inaccuracies, duplications, and missing values.

Understanding these challenges is the first step in ensuring your data is trustworthy and fit for purpose. We’ll also discuss the broader impacts of data quality on business outcomes, emphasizing why it’s an essential aspect of any data strategy.

This tutorial provides an in-depth exploration of Great Expectations, guiding you through its installation, basic concepts, and advanced functionalities. By the end, you’ll be equipped to implement robust data quality checks seamlessly within your Python projects, enhancing the integrity and reliability of your data-driven insights.

Great Expectations ?

Great Expectations is an innovative open-source library in Python, designed to enhance data quality and testing. This tool provides a robust framework for validating, documenting, and profiling your data, which is critical for maintaining high data quality standards.

I’ll guide you through the installation process, making it easy to incorporate into your Python environment. By understanding Great Expectations’ fundamental concepts and capabilities, you’ll be well-equipped to start implementing effective data quality checks in your data projects.

Setting Up your Project

Creating on your first project with Great Expectations begins with setting up the necessary environment.

Start by installing the package using pip:

 pip install great_expectations 

Once installed, create a new project:

great_expectations init

This command sets up the structure of your project, including directories for Expectations, Checkpoints, and Data Docs.

Expectation Suites: Define specific tests for your data.

Data Context: Manages the configuration of your project.

Data Docs: Visual representations of your Expectations and validation results.

By following these steps, you’ll have a foundational setup ready for creating and managing your data quality checks.

Creating Expectations

Expectations are the core of Great Expectations, serving as assertions about your data’s quality.

In this section, we’ll explore how to create and define these expectations.

First, select a data source to work with. For example, if you’re using a CSV file:

from great_expectations.dataset import PandasDataset
import pandas as pd

data = pd.read_csv('your_data_file.csv')
dataset = PandasDataset(data)

Then, apply expectations to your dataset. For instance, to ensure a column contains unique values:

dataset.expect_column_values_to_be_unique(column='your_column_name')

Or to verify the number of rows in a table falls within a specific range:

dataset.expect_table_row_count_to_be_between(min_value=100, max_value=2000)

Great Expectations offers a wide array of built-in expectations, and you can also create custom ones tailored to your specific data needs.

Validating Data with Checkpoints

Checkpoints in Great Expectations are a way to automate the validation of your data against the defined Expectations.

Start by configuring a checkpoint:

from great_expectations.data_context.types.base import CheckpointConfig

checkpoint_config = CheckpointConfig(
name="sample_checkpoint",
config_version=1,
class_name="SimpleCheckpoint",
validations=[
{
"batch_request": {
"datasource_name": "your_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "your_data_asset",
},
"expectation_suite_name": "your_expectation_suite",
}
],
)
context.add_checkpoint(**checkpoint_config.to_dict())

Now, run the checkpoint to validate your data:

results = context.run_checkpoint(checkpoint_name="sample_checkpoint")

This code initiates the validation process using the specified checkpoint, which applies your defined Expectations to the data batch.

Successful validation will confirm that your data meets the quality standards set in your Expectation Suite, while any deviations will be reported, allowing for quick identification and rectification of data issues.

Integrating with Data Pipelines (Airflow)

Integrating Great Expectations with your data pipelines enhances the automation of data quality checks.

For example, integrating with an ETL pipeline in Python might look like this:

# Assuming you have an ETL function
def etl_process():
# Your ETL code here
pass

# After completing ETL, run a Great Expectations checkpoint
def run_data_quality_checks():
context.run_checkpoint(checkpoint_name="your_checkpoint_name")

# Main ETL process
def main():
etl_process()
run_data_quality_checks()

if __name__ == "__main__":
main()

In this snippet, after the ETL process completes, a Great Expectations checkpoint is triggered to validate the data. This ensures data quality checks are seamlessly integrated into your regular data processing activities.

Similarly, for more complex workflows, you can integrate Great Expectations with orchestration tools like Apache Airflow, using Airflow operators to trigger data quality checks as part of your pipeline:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 1, 1),
# Other default args
}

dag = DAG('etl_with_data_quality',
default_args=default_args,
schedule_interval='@daily')

def etl_task():
# ETL logic here
pass

def data_quality_task():
context.run_checkpoint(checkpoint_name="your_checkpoint_name")

etl = PythonOperator(
task_id='etl',
python_callable=etl_task,
dag=dag)

data_quality = PythonOperator(
task_id='data_quality_check',
python_callable=data_quality_task,
dag=dag)

etl >> data_quality

This integration allows automated execution of data quality checks within your existing data workflows, ensuring continuous monitoring and validation of data quality standards.

Advanced Features

Great Expectations not only provides essential data validation tools but also offers advanced features for sophisticated data quality management. One such feature is data profiling, which automatically generates Expectations based on the characteristics of your dataset.

This is particularly useful for getting a quick start on creating Expectations or understanding new data sources.

from great_expectations.profile.basic_suite_builder_profiler import BasicSuiteBuilderProfiler

suite = BasicSuiteBuilderProfiler().profile(dataset)
context.save_expectation_suite(suite, "your_new_expectation_suite_name")

This code snippet illustrates how to use data profiling to create an Expectation Suite. Additionally, managing Expectation Suites becomes crucial as your project grows. Organizing these suites, maintaining version control, and documenting changes are best practices that ensure long-term manageability.

For scaling data quality checks, consider optimizing performance for large datasets. This might involve selectively applying Expectations or using batch processing to manage resource utilization effectively.

Embracing these advanced features and best practices enhances the robustness of your data quality framework, ensuring it remains efficient and scalable as your data environment evolves.

That’s Great Expectations

Remember, the journey to excellent data quality is continuous. As data evolves, so should our strategies to maintain its integrity. Great Expectations offers the flexibility and depth required to meet these evolving challenges. By leveraging its capabilities, we can ensure our data remains a reliable foundation for insights and decisions, ultimately driving success in our data-centric initiatives.

For further exploration, the Great Expectations documentation is an invaluable resource. Additionally, the community forums and GitHub repository are great places for support and to stay updated with the latest developments. Keep experimenting, learning, and elevating the quality of your data with Great Expectations.

LinkedIn: https://www.linkedin.com/in/maikpaixao/
Twitter: https://twitter.com/maikpaixao
Facebook: https://www.facebook.com/maikpaixao
Youtube: https://www.youtube.com/@maikpaixao
Instagram: https://www.instagram.com/datamaikpaixao/
Github: https://github.com/maikpaixao

--

--

Maik Paixão

Data Scientist with expertise in building modern analysis on financial instruments. http://www.maikpaixao.com