All of your queries are running, and your SQL-defined charts are returning results, but when it’s maintenance time, you don’t have a clear picture of the cascade of views, tables, and CSV files that your analysis depends on. At least it’s lucky that we’re good analysts, and we don’t make mistakes like circular references … right. How many of you ran straight for the Git repo?
You can sit down and start coding directed acyclical graphs (DAGs) in something like Airflow, but that’s a laborious task that involves reading through tons of SQL files and recording dependencies. There must be a better way!
I searched far and wide and found nothing, so I decided to dust off my Python skills from a previous life. This blog post will walk you through what I wrote, and leave you with a (mostly) working mapper, so you too can identify, clean up, and maintain your SQL ETL. If you want a copy of the script to run on your own set of SQL, email us at [email protected] and we’ll send it right over.
First things first, you’re going to need Python (I chose to use 2.7, because that’s what I’m used to), an IDE (Spyder was readily available), and a handful of packages. One of the folks on my team pointed me to Anaconda Navigator, which effortlessly set up all of the above.
Before we dive in, here’s an overview of what the code will do:
Import relevant packages
Open the CSV file that contains the relevant SQL
Iterate over the CSV, extract SQL and metadata
Clean SQL
Extract parents from SQL
Define relevant node-set
Graph and color the defined node-set
Import Relevant Packages
We’re going to need to import a couple of packages to make this work:
os—used to access the current working directory
argparse—parses command line arguments
re—regex library for parsing out table names from SQL
csv—a library for navigating CSV files
networkx—modern Python graphviz library used for both creating and visualizing the directed acyclical graph
Some of these may require additional installation; I had to use Brew to get it set up.
Open the CSV File That Contains the Relevant SQL
To begin, you need to get all your DDLs for your SQL-defined views, charts, and CSVs into a single table. We want the four following columns:
Query ID (can just be auto-increment)
Query Name
Query SQL
Object Type (CSV, Chart, View)
We use Periscope Data for our reporting needs, so it’s straightforward to access the repository of SQL that I’ve created for my data transformations. With Periscope Data’s new Usage Data feature, you can easily pull this information:
select
sql_views.id thing_id
, sql_views.name
, sql_views.sql
, 'view' as tpe
from
periscope_usage_data.sql_views
where
sql_views.deleted_at isnull
union all
select
charts.id
, charts.name
, charts.sql
, 'chart' as tpe
from
periscope_usage_data.charts
joinperiscope_usage_data.dashboards on
charts.dashboard_id = dashboards.id
where
charts.deleted_at is null
and dashboards.deleted_at is null
union all
select
csvs.id
, csvs.name
, '' as sql
, 'csv' as tpe
from
periscope_usage_data.csvs
Once we’ve got our query data, we can open it up in Python.
I’ve broken the code up into functional pieces, with everything coming together in the main() function near the bottom. The rest of the tutorial will bounce around a bit so we can follow the code as it executes, rather than in the order it is written. I’ll try my best to keep you oriented, but just in case, you can get the full Python file by emailing [email protected]
We will create two data structures from the information in this CSV:
A dictionary, whose keys are the names of all of the objects and whose values are the object type (view, csv, chart). This will later be used for node color coding.
A similar dictionary with the same keys, but with the direct parents of each object as the values.
Before going much further, let’s disambiguate some DAG terminology:
Each object is a node, each arrow is an edge describing a relationship. Parents, children, ancestors, and descendants are concisely explained in the following diagram:
Our first function will then be defined as:
defget_node_info(csv_infile_path):
It will take one parameter, the location and name of the CSV file we generated from the SQL above; for example:
'/Users/username/Gviz/digraph.csv'
We create the desired dictionaries using {} notation:
node_info_dict = {}
parent_dict = {}
and immediately open the file:
with open(csv_infile_path) asinfile:
Using the with command means we don’t have to worry about closing the file later, we just un-indent. This could also have been written as infile = open(csv_infile_path, “rb”), but would have required cleanup work at the end.
The open command opens the CSV as a text file, but we know it’s a CSV that has attributes like headers which we’d like to use. To take advantage of that, we will read our CSV file with csv.DictReader, which allows us to call values based on headers, among a few other neat tricks:
infile_reader = csv.DictReader(infile)
Though it may seem like we could have nested those commands into a single line, saving the naming of one variable, we actually can’t. With runs an exit method on un-indent, which doesn’t exist in csv.DictReader. This would have made our code error.
Iterate Over CSV, Extract SQL and Metadata
Once we have our CSV loaded, we can iterate over it and pull out all of the information we want to put in our node_info_dict:
for row in infile_reader:
# pull data from CSV
object_name = row['name']
object_type = row['tpe']
sql = row['sql']
We’re not done yet. We don’t just want the SQL, we want the parents in the SQL statement.
Luckily, there’s only two places where you can define a source table in SQL: immediately following a from or a joinstatement. It’s time for regex man!
Before we dig into the details, let’s review what we want to accomplish:
SQL cleanup
Lowercase everything
Remove block comments
Remove line comments
Replace all whitespace with single spaces
Get set of parents
Get set of Common Table Expressions(CTEs / with statement)
Remove CTEs from set of parents
Remove sub-selects
Clean up brackets*
*Periscope uses bracketed table names for views defined in the product; e.g., [customers_view].
To achieve this, we define a set of RegEx strings:
REG_BLOCK_COMMENT = re.compile("(/\*)[\w\W]*?(\*/)", re.I)
REG_LINE_COMMENT = re.compile('(--.*)', re.I)
REG_BRACKETS = re.compile("\[|\]|\)|\"", re.I)
REG_PARENTS = re.compile("(?<=join\s)+[\S\.\"\']+|(?<=from\s)+[\S\.\"\']+", re.I)
REG_CTES = re.compile("(\S+)\sas\W*\(", re.I)
I won’t dive into how these work—that could be a blog post all on its own. In the meantime, we've already written a few posts on RegEx: Getting Started With Regular Expressions and String Parsing in SQL.
SQL Cleanup
We use a handful of these in part one described above, by defining a clean_sql function:
def clean_sql(sql):
c_sql = sql.lower() # lowercase everything (for easier match)
c_sql = REG_BLOCK_COMMENT.sub('', c_sql) # remove block comments
c_sql = REG_LINE_COMMENT.sub('', c_sql) # remove line comments
c_sql = ' '.join(c_sql.split()) # replace \n and multi space w space
return c_sql
c_sql = ' '.join(c_sql.split()) is a little tricky. This bit splits the SQL statement into an array based on arbitrary whitespace as a delimiter, and rejoins that array with a single space as a delimiter. This is to reduce multi-spaces, line breaks, and combinations of the two down to a single space.
Extract Parents From SQL
Next we create functions that get the parents and CTEs:
# this returns the unique set of parents per query
def get_parents(c_sql):
parents = set(REG_PARENTS.findall(c_sql))
return parents
# this returns the unique set of ctes per query, so we can exclude them from the list of parents
def get_ctes(c_sql):
ctes = set(REG_CTES.findall(c_sql))
return ctes
coming back to our iteration over the CSV in get_node_info_dict,we call all of these:
# clean the sql
c_sql = clean_sql(sql)
# get the set of parents
parents = get_parents(c_sql)
# get set of ctes to exclude from parents
ctes = get_ctes(c_sql)
# remove CTES from parent dict
for cte in ctes:
parents.discard(cte)
# get rid of brackets in views
c_parents = set()
for parent inparents:
if not parent[:1] == '(':
c_parents.add(REG_BRACKETS.sub('', parent))
# add the object name and type and direct parents to the dict
node_info_dict[object_name] = object_type
parent_dict[object_name] = c_parents
return (parent_dict, node_info_dict)
Define Relevant Node-set
My first iteration of this project would draw a full map of the entire view and chart ecosystem. Here’s a screenshot of what that looked like:
Needless to say, that isn’t the most useful visualization of my career. I quickly realized that specifying a focal object around which to graph ancestors and descendants makes for a much more digestible map.
However, in order to achieve that, we first need to find all of the relatives of a given node and only graph that set.
This tool allows the user to specify a direction: ancestors, descendants, or both. In this tutorial, we will go through the process of identifying a node set that contains both, as it will require us to go through every relationship type.
We define a function that returns a relevant node set based on a parent dictionary, a focal node, and a direction:
def get_node_set(parent_dict, focal_node=None, direction=None):
As before, we define some placeholders to keep our data:
descendant_dict = {} # intended to store all descendants (any generation)
ancestor_dict = {} # intended to store all ancestry (any generation)
node_set = set() # final result is stored and then returned via this set
node_set.add(focal_node)
Here’s where it gets a little tough. If you recall, our parent_dict is a key:value dictionary that contains an object and its direct parents as a set. For example: parent_dict('A')=[C,B,D] where C, B, and D together create A:
create table A as
select
foo
from
C
join B using (id)
join D using (id)
But what if we wanted a dictionary of children instead, so that we can look at descendants and not just ancestors? This would make C, B, and D the keys, and A one of the values for each.
To derive that, we loop over all of the parents and create a reverse dictionary called child_dict. Notice how there are two loops: one over each key in parent_dict, and one over each node in the value set. Deduplication is handled natively by writing to set and dictionary objects:
# this reverses a parent tree to a child tree
def get_child_dict(parent_dict):
child_dict = {}
for node in parent_dict:
for parent inparent_dict[node]:
if not parent inchild_dict.keys():
child_dict[parent] = set(node)
else:
child_dict[parent].add(node)
return child_dict
We then call this function and store the child_dict for use in just a bit:
child_dict = get_child_dict(parent_dict) # immediate children
Now we have a parent list and a child list, but what we really want is all the ancestors and descendants of a given node. Given this data structure, we must execute a tree traversal! This is where a recursive function comes in handy, since the parent has parents, and so on. Here’s the result:
# this traverses an arbitrary tree (parents or children) to get all ancestors or descendants
def traverse_tree(node, d_tree, been_done=set()):
tree_outputs = set(d_tree.get(node, set())) # direct relatives
for key in d_tree.get(node, set()): # 2nd step relatives
if not key in been_done:
been_done.add(key) # to break any circular references
tree_outputs = tree_outputs.union(traverse_tree(key,
d_tree, been_done))
return tree_outputs
Using this function, we build a dictionary that contains ancestors, and another dictionary that contains descendants:
# build descendant dict
for node in child_dict:
descendant_dict[node] = traverse_tree(node, child_dict,
been_done=set())
# build ancestor dict
for node in parent_dict:
ancestor_dict[node] = traverse_tree(node, parent_dict,
been_done=set())
Depending on whether we want ancestors, descendants, or everything related to a node, we pick some overlap of these sets as our relevant set list.
Build the Graph
Finally, with the relevant nodes and the parent-child relationships, we can build our DAG:
def build_d_graph(parent_dict, node_set, node_type_dict):
G = nx.DiGraph() # initialize graph object
for node in node_set:
# Add color nodes
ifnode_type_dict.get(node, None) == 'view':
G.add_node(node, color='green')
elifnode_type_dict.get(node, None) == 'chart':
G.add_node(node, color='blue')
elifnode_type_dict.get(node, None) == 'csv':
G.add_node(node, color='red')
# add edges and non-color nodes
for parent inparent_dict.get(node, set()):
G.add_edge(parent, node)
return G
And done!
Comments
Post a Comment