Quickstart with Great Expectations
Introduction
Few things are as daunting as taking your first steps with a new piece of software. This guide will introduce you to GX and demonstrate the ease with which you can implement the basic GX workflow. We will walk you through the entire process of installing GX, connecting to some sample data, building your first Expectation based off of an initial Batch of that data, validating your data with that Expectation, and finally reviewing the results of your validation.
Once you have completed this guide you will have a foundation in the basics of using GX. In the future you will be able to adapt GX to suit your specific needs by customizing the execution of the individual steps you will learn here.
Prerequisites
- A supported version of Python (versions 3.7 to 3.10)
- For details on how to download and install Python on your platform, please see Python's documentation and download sites
- The ability to install Python packages with pip
- A working internet browser
- A passion for data quality
Overview
With GX you can get up and running with just a few lines of code. The full process you'll be using will look like:
pip install great_expectations
import great_expectations as gx
# Set up
context = gx.get_context()
# Connect to data
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
# Create Expectations
validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)
# Validate data
checkpoint = gx.checkpoint.SimpleCheckpoint(
name="my_quickstart_checkpoint",
data_context=context,
validator=validator,
)
checkpoint_result = checkpoint.run()
# View results
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)
In the following steps we'll break down exactly what is happening here so that you can follow along and perform a Validation yourself.
Steps
1. Install GX and set up your code environment
1.1 Install GX using pip
Starting from an empty base directory inside a Python virtual environment, we use pip to install Great Expectations:
pip install great_expectations
When you run this command from the terminal you will see pip
go through the process of installing GX and it's related dependencies. This may take a moment to complete.
1.2 Import Great Expectations
For the rest of this tutorial we will be working with Python code in a Jupyter Notebook. Jupyter is included with GX and provides a very convenient interface that lets us easily edit code and immediately see the result of our changes.
The code to import the great_expectations
module is:
import great_expectations as gx
1.3 Instantiate a Data Context
We will get a DataContext
object with the following code:
context = gx.get_context()
The Data Context will provide you with access to a variety of utility and convenience methods. It is the entry point for using the GX Python API.
2. Connect to data
For the purpose of this guide, we will connect to .csv
data stored in our GitHub repo:
validator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
The above code uses our Data Context's default Datasource for Pandas to access the .csv
data in the file at the provided path
.
3. Create Expectations
When we read our .csv
data, we got a Validator instance back. A Validator is a robust object capable of storing Expectations about the data it is associated with, as well as performing introspections on that data.
In this guide, we will define two Expectations, one based on our domain knowledge (knowing that the pickup_datetime
should not be null), and one by using GX to detect the range of values in the passenger_count
column (using auto=True
).
The code we will use for this is:
validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)
With the Expectation defined above, we are stating that we expect the column pickup_datetime
to always be populated. That is: none of the column's values should be null.
In the future, you may define numerous Expectations about a Validator's associated data by calling multiple methods that follow the validator.expect_*
syntax.
4. Validate data
4.1 Execute your defined Expectations
Now that we have defined our Expectations it is time for GX to introspect our data and see if it corresponds to what we told GX to expect. To do this, we define a Checkpoint (which will allow us to repeat the Validation in the future).
checkpoint = gx.checkpoint.SimpleCheckpoint(
name="my_quickstart_checkpoint",
data_context=context,
validator=validator,
)
Once we have created the Checkpoint, we will run it and get back the results from our Validation.
checkpoint_result = checkpoint.run()
4.2 Review your results
Great Expectations provides a friendly, human-readable way to view the results of Validations: Data Docs. Our Checkpoint will have automatically compiled new Data Docs to include the results of the Validation we ran, so we can view them immediately:
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)
Next Steps
Now that you've seen how easy it is to implement the GX workflow, it is time to customize that workflow to suit your specific use cases! To help with this we have prepared some more detailed guides on setting up and installing GX and getting an initial Data Context that are tailored to specific environments and resources.
This guide has introduced you to the open source Python and command line use of Great Expectations. GX also offers an online interface, currently in Beta. The GX Cloud interface significantly simplifies collaboration between data teams and domain experts.
If you are interested in GX Cloud, you should join the GX Cloud Beta. During this program limited seats are available, but signing up will keep you informed of the product's process.
Installing GX for specific environments and source data systems
Setup and installation of GX for local filesystems
For more details on installing GX for use with local filesystems, please see:
Setup and installation of GX for cloud storage systems
For guides on installing GX for use with cloud storage systems, please reference:
Setup and installation of GX for SQL databases
For information on installing GX for use with SQL databases, see:
Setup and installation of GX for hosted data systems
For instructions on installing GX for use with hosted data systems, read:
Initializing, instantiating, and saving a Data Context
Getting a Data Context
Quickstart Data Context
Filesystem Data Contexts
- How to initialize a new Data Context with the CLI
- How to initialize a filesystem Data Context in Python
- How to instantiate a specific Filesystem Data Context
In-memory Data Contexts
Saving a Data Context
Filesystem and Cloud Data Contexts automatically save any changes as they are made. The only type of Data Context that does not immediately save changes in a persisting way is the Ephemeral Data Context, which is an in-memory Data Context that will not persist beyond the current Python session. However, an Ephemeral Data Context can be converted to a Filesystem Data Context if you wish to save its contents for future use.
For more information, please see: