Automated Identification and Graphing of SQL Dependencies

All of your queries are running, and your SQL-defined charts are returning results, but when it’s maintenance time, you don’t have a clear picture of the cascade of views, tables, and CSV files that your analysis depends on. At least it’s lucky that we’re good analysts, and we don’t make mistakes like circular references … right. How many of you ran straight for the Git repo?

You can sit down and start coding directed acyclical graphs (DAGs) in something like Airflow, but that’s a laborious task that involves reading through tons of SQL files and recording dependencies. There must be a better way!

I searched far and wide and found nothing, so I decided to dust off my Python skills from a previous life. This blog post will walk you through what I wrote, and leave you with a (mostly) working mapper, so you too can identify, clean up, and maintain your SQL ETL. If you want a copy of the script to run on your own set of SQL, email us at [email protected] and we’ll send it right over.

First things first, you’re going to need Python (I chose to use 2.7, because that’s what I’m used to), an IDE (Spyder was readily available), and a handful of packages. One of the folks on my team pointed me to Anaconda Navigator, which effortlessly set up all of the above.

Before we dive in, here’s an overview of what the code will do:

Import relevant packages

Open the CSV file that contains the relevant SQL

Iterate over the CSV, extract SQL and metadata

Clean SQL

Extract parents from SQL

Define relevant node-set

Graph and color the defined node-set

Import Relevant Packages

We’re going to need to import a couple of packages to make this work:

os—used to access the current working directory

argparse—parses command line arguments

re—regex library for parsing out table names from SQL

csv—a library for navigating CSV files

networkx—modern Python graphviz library used for both creating and visualizing the directed acyclical graph

Some of these may require additional installation; I had to use Brew to get it set up.

Open the CSV File That Contains the Relevant SQL

To begin, you need to get all your DDLs for your SQL-defined views, charts, and CSVs into a single table. We want the four following columns:

Query ID (can just be auto-increment)

Query Name

Query SQL

Object Type (CSV, Chart, View)

We use Periscope Data for our reporting needs, so it’s straightforward to access the repository of SQL that I’ve created for my data transformations. With Periscope Data’s new Usage Data feature, you can easily pull this information:

select

sql_views.id thing_id

, sql_views.name

, sql_views.sql

, 'view' as tpe

from

periscope_usage_data.sql_views

where

sql_views.deleted_at isnull

union all

select

charts.id

, charts.name

, charts.sql

, 'chart' as tpe

from

periscope_usage_data.charts

joinperiscope_usage_data.dashboards on

charts.dashboard_id = dashboards.id

where

charts.deleted_at is null

and dashboards.deleted_at is null

union all

select

csvs.id

, csvs.name

, '' as sql

, 'csv' as tpe

from

periscope_usage_data.csvs

Once we’ve got our query data, we can open it up in Python.

I’ve broken the code up into functional pieces, with everything coming together in the main() function near the bottom. The rest of the tutorial will bounce around a bit so we can follow the code as it executes, rather than in the order it is written. I’ll try my best to keep you oriented, but just in case, you can get the full Python file by emailing [email protected]

We will create two data structures from the information in this CSV:

A dictionary, whose keys are the names of all of the objects and whose values are the object type (view, csv, chart). This will later be used for node color coding.

A similar dictionary with the same keys, but with the direct parents of each object as the values.

Before going much further, let’s disambiguate some DAG terminology:

Each object is a node, each arrow is an edge describing a relationship. Parents, children, ancestors, and descendants are concisely explained in the following diagram:

Our first function will then be defined as:

defget_node_info(csv_infile_path):

It will take one parameter, the location and name of the CSV file we generated from the SQL above; for example:

'/Users/username/Gviz/digraph.csv'

We create the desired dictionaries using {} notation:

node_info_dict = {}

parent_dict = {}

and immediately open the file:

with open(csv_infile_path) asinfile:

Using the with command means we don’t have to worry about closing the file later, we just un-indent. This could also have been written as infile = open(csv_infile_path, “rb”), but would have required cleanup work at the end.

The open command opens the CSV as a text file, but we know it’s a CSV that has attributes like headers which we’d like to use. To take advantage of that, we will read our CSV file with csv.DictReader, which allows us to call values based on headers, among a few other neat tricks:

infile_reader = csv.DictReader(infile)

Though it may seem like we could have nested those commands into a single line, saving the naming of one variable, we actually can’t. With runs an exit method on un-indent, which doesn’t exist in csv.DictReader. This would have made our code error.

Iterate Over CSV, Extract SQL and Metadata

Once we have our CSV loaded, we can iterate over it and pull out all of the information we want to put in our node_info_dict:

for row in infile_reader:

# pull data from CSV

object_name = row['name']

object_type = row['tpe']

sql = row['sql']

We’re not done yet. We don’t just want the SQL, we want the parents in the SQL statement.

Luckily, there’s only two places where you can define a source table in SQL: immediately following a from or a joinstatement. It’s time for regex man!

Before we dig into the details, let’s review what we want to accomplish:

SQL cleanup

Lowercase everything

Remove block comments

Remove line comments

Replace all whitespace with single spaces

Get set of parents

Get set of Common Table Expressions(CTEs / with statement)

Remove CTEs from set of parents

Remove sub-selects

Clean up brackets*

*Periscope uses bracketed table names for views defined in the product; e.g., [customers_view].

To achieve this, we define a set of RegEx strings:

REG_BLOCK_COMMENT = re.compile("(/\*)[\w\W]*?(\*/)", re.I)

REG_LINE_COMMENT = re.compile('(--.*)', re.I)

REG_BRACKETS = re.compile("\[|\]|\)|\"", re.I)

REG_PARENTS = re.compile("(?<=join\s)+[\S\.\"\']+|(?<=from\s)+[\S\.\"\']+", re.I)

REG_CTES = re.compile("(\S+)\sas\W*\(", re.I)

I won’t dive into how these work—that could be a blog post all on its own. In the meantime, we've already written a few posts on RegEx: Getting Started With Regular Expressions and String Parsing in SQL.

SQL Cleanup

We use a handful of these in part one described above, by defining a clean_sql function:

def clean_sql(sql):

c_sql = sql.lower() # lowercase everything (for easier match)

c_sql = REG_BLOCK_COMMENT.sub('', c_sql) # remove block comments

c_sql = REG_LINE_COMMENT.sub('', c_sql) # remove line comments

c_sql = ' '.join(c_sql.split()) # replace \n and multi space w space

return c_sql

c_sql = ' '.join(c_sql.split()) is a little tricky. This bit splits the SQL statement into an array based on arbitrary whitespace as a delimiter, and rejoins that array with a single space as a delimiter. This is to reduce multi-spaces, line breaks, and combinations of the two down to a single space.

Extract Parents From SQL

Next we create functions that get the parents and CTEs:

# this returns the unique set of parents per query

def get_parents(c_sql):

parents = set(REG_PARENTS.findall(c_sql))

return parents

# this returns the unique set of ctes per query, so we can exclude them from the list of parents

def get_ctes(c_sql):

ctes = set(REG_CTES.findall(c_sql))

return ctes

coming back to our iteration over the CSV in get_node_info_dict,we call all of these:

# clean the sql

c_sql = clean_sql(sql)

# get the set of parents

parents = get_parents(c_sql)

# get set of ctes to exclude from parents

ctes = get_ctes(c_sql)

# remove CTES from parent dict

for cte in ctes:

parents.discard(cte)

# get rid of brackets in views

c_parents = set()

for parent inparents:

if not parent[:1] == '(':

c_parents.add(REG_BRACKETS.sub('', parent))

# add the object name and type and direct parents to the dict

node_info_dict[object_name] = object_type

parent_dict[object_name] = c_parents

return (parent_dict, node_info_dict)

Define Relevant Node-set

My first iteration of this project would draw a full map of the entire view and chart ecosystem. Here’s a screenshot of what that looked like:

Needless to say, that isn’t the most useful visualization of my career. I quickly realized that specifying a focal object around which to graph ancestors and descendants makes for a much more digestible map.

However, in order to achieve that, we first need to find all of the relatives of a given node and only graph that set.

This tool allows the user to specify a direction: ancestors, descendants, or both. In this tutorial, we will go through the process of identifying a node set that contains both, as it will require us to go through every relationship type.

We define a function that returns a relevant node set based on a parent dictionary, a focal node, and a direction:

def get_node_set(parent_dict, focal_node=None, direction=None):

As before, we define some placeholders to keep our data:

descendant_dict = {} # intended to store all descendants (any generation)

ancestor_dict = {} # intended to store all ancestry (any generation)

node_set = set() # final result is stored and then returned via this set

node_set.add(focal_node)

Here’s where it gets a little tough. If you recall, our parent_dict is a key:value dictionary that contains an object and its direct parents as a set. For example: parent_dict('A')=[C,B,D] where C, B, and D together create A:

create table A as

select

foo

from

C

join B using (id)

join D using (id)

But what if we wanted a dictionary of children instead, so that we can look at descendants and not just ancestors? This would make C, B, and D the keys, and A one of the values for each.

To derive that, we loop over all of the parents and create a reverse dictionary called child_dict. Notice how there are two loops: one over each key in parent_dict, and one over each node in the value set. Deduplication is handled natively by writing to set and dictionary objects:

# this reverses a parent tree to a child tree

def get_child_dict(parent_dict):

child_dict = {}

for node in parent_dict:

for parent inparent_dict[node]:

if not parent inchild_dict.keys():

child_dict[parent] = set(node)

else:

child_dict[parent].add(node)

return child_dict

We then call this function and store the child_dict for use in just a bit:

child_dict = get_child_dict(parent_dict) # immediate children

Now we have a parent list and a child list, but what we really want is all the ancestors and descendants of a given node. Given this data structure, we must execute a tree traversal! This is where a recursive function comes in handy, since the parent has parents, and so on. Here’s the result:

# this traverses an arbitrary tree (parents or children) to get all ancestors or descendants

def traverse_tree(node, d_tree, been_done=set()):

tree_outputs = set(d_tree.get(node, set())) # direct relatives

for key in d_tree.get(node, set()): # 2nd step relatives

if not key in been_done:

been_done.add(key) # to break any circular references

tree_outputs = tree_outputs.union(traverse_tree(key,

d_tree, been_done))

return tree_outputs

Using this function, we build a dictionary that contains ancestors, and another dictionary that contains descendants:

# build descendant dict

for node in child_dict:

descendant_dict[node] = traverse_tree(node, child_dict,

been_done=set())

# build ancestor dict

for node in parent_dict:

ancestor_dict[node] = traverse_tree(node, parent_dict,

been_done=set())

Depending on whether we want ancestors, descendants, or everything related to a node, we pick some overlap of these sets as our relevant set list.

Build the Graph

Finally, with the relevant nodes and the parent-child relationships, we can build our DAG:

def build_d_graph(parent_dict, node_set, node_type_dict):

G = nx.DiGraph() # initialize graph object

for node in node_set:

# Add color nodes

ifnode_type_dict.get(node, None) == 'view':

G.add_node(node, color='green')

elifnode_type_dict.get(node, None) == 'chart':

G.add_node(node, color='blue')

elifnode_type_dict.get(node, None) == 'csv':

G.add_node(node, color='red')

# add edges and non-color nodes

for parent inparent_dict.get(node, set()):

G.add_edge(parent, node)

return G

And done!

Learning Place For Developers

Search This Blog

Automated Identification and Graphing of SQL Dependencies

Labels

Comments

Post a Comment

Popular posts from this blog

JavaScript Array Methods