Skip to main content
Version: 0.16.8

Quickstart with Great Expectations

Introduction

Few things are as daunting as taking your first steps with a new piece of software. This guide will introduce you to GX and demonstrate the ease with which you can implement the basic GX workflow. We will walk you through the entire process of installing GX, connecting to some sample data, building your first Expectation based off of an initial Batch of that data, validating your data with that Expectation, and finally reviewing the results of your validation.

Once you have completed this guide you will have a foundation in the basics of using GX. In the future you will be able to adapt GX to suit your specific needs by customizing the execution of the individual steps you will learn here.

Prerequisites

This guide assumes you have:
  • A supported version of Python (versions 3.7 to 3.10)
  • The ability to install Python packages with pip
  • A working internet browser
  • A passion for data quality

Overview

With GX you can get up and running with just a few lines of code. The full process you'll be using will look like:

Terminal input
pip install great_expectations
import great_expectations as gx


# Set up
context = gx.get_context()

# Connect to data
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

# Create Expectations
validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)

# Validate data
checkpoint = gx.checkpoint.SimpleCheckpoint(
name="my_quickstart_checkpoint",
data_context=context,
validator=validator,
)

checkpoint_result = checkpoint.run()

# View results
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

In the following steps we'll break down exactly what is happening here so that you can follow along and perform a Validation yourself.

Steps

1. Install GX and set up your code environment

1.1 Install GX using pip

Starting from an empty base directory inside a Python virtual environment, we use pip to install Great Expectations:

Terminal input
pip install great_expectations

When you run this command from the terminal you will see pip go through the process of installing GX and it's related dependencies. This may take a moment to complete.

1.2 Import Great Expectations

For the rest of this tutorial we will be working with Python code in a Jupyter Notebook. Jupyter is included with GX and provides a very convenient interface that lets us easily edit code and immediately see the result of our changes.

The code to import the great_expectations module is:

import great_expectations as gx

1.3 Instantiate a Data Context

We will get a DataContext object with the following code:

context = gx.get_context()

The Data Context will provide you with access to a variety of utility and convenience methods. It is the entry point for using the GX Python API.

2. Connect to data

For the purpose of this guide, we will connect to .csv data stored in our GitHub repo:

validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

The above code uses our Data Context's default Datasource for Pandas to access the .csv data in the file at the provided path.

3. Create Expectations

When we read our .csv data, we got a Validator instance back. A Validator is a robust object capable of storing Expectations about the data it is associated with, as well as performing introspections on that data.

In this guide, we will define two Expectations, one based on our domain knowledge (knowing that the pickup_datetime should not be null), and one by using GX to detect the range of values in the passenger_count column (using auto=True).

The code we will use for this is:

validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)

With the Expectation defined above, we are stating that we expect the column pickup_datetime to always be populated. That is: none of the column's values should be null.

In the future, you may define numerous Expectations about a Validator's associated data by calling multiple methods that follow the validator.expect_* syntax.

4. Validate data

4.1 Execute your defined Expectations

Now that we have defined our Expectations it is time for GX to introspect our data and see if it corresponds to what we told GX to expect. To do this, we define a Checkpoint (which will allow us to repeat the Validation in the future).

checkpoint = gx.checkpoint.SimpleCheckpoint(
name="my_quickstart_checkpoint",
data_context=context,
validator=validator,
)

Once we have created the Checkpoint, we will run it and get back the results from our Validation.

checkpoint_result = checkpoint.run()

4.2 Review your results

Great Expectations provides a friendly, human-readable way to view the results of Validations: Data Docs. Our Checkpoint will have automatically compiled new Data Docs to include the results of the Validation we ran, so we can view them immediately:

validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

Next Steps

Now that you've seen how easy it is to implement the GX workflow, it is time to customize that workflow to suit your specific use cases! To help with this we have prepared some more detailed guides on setting up and installing GX and getting an initial Data Context that are tailored to specific environments and resources.

Great Expectations Cloud

This guide has introduced you to the open source Python and command line use of Great Expectations. GX also offers an online interface, currently in Beta. The GX Cloud interface significantly simplifies collaboration between data teams and domain experts.

If you are interested in GX Cloud, you should join the GX Cloud Beta. During this program limited seats are available, but signing up will keep you informed of the product's process.

Sign up for the GX Cloud Beta!

Installing GX for specific environments and source data systems

Setup and installation of GX for local filesystems

For more details on installing GX for use with local filesystems, please see:

Setup and installation of GX for cloud storage systems

Setup and installation of GX for SQL databases

For information on installing GX for use with SQL databases, see:

Setup and installation of GX for hosted data systems

For instructions on installing GX for use with hosted data systems, read:

Initializing, instantiating, and saving a Data Context

Getting a Data Context

Saving a Data Context

Filesystem and Cloud Data Contexts automatically save any changes as they are made. The only type of Data Context that does not immediately save changes in a persisting way is the Ephemeral Data Context, which is an in-memory Data Context that will not persist beyond the current Python session. However, an Ephemeral Data Context can be converted to a Filesystem Data Context if you wish to save its contents for future use.

For more information, please see: