If you are an Excel pro who has been fortunate enough to find Alteryx Designer, you probably love the Join Tool and are feeling OK about breaking up with VLOOKUP 😊
But if you aren’t careful about how you use this new superpower, can you turn bad data into *really* bad information when using the Join Tool.
I like to “cover my joins” with Filter, Formula, Unique and Summarize tools to check inputs and outputs in order to minimize the likelihood of joins gone bad.
Let’s run though a simple example with only a handful of records for illustration purposes.
Say you are trying to merge 2 data sets describing your companies’ PCs. Specifically, which employees are assigned to them, and which software titles are installed on each PC.
The first data set gives us a “Name” and a “PC Tag”:
The second data set gives us software “Titles” installed on a “PC”:
So your first pass at a workflow to accomplish this join might look like this:
When you run it, you see the following output:
Not what you expected, right?
Join is very picky about case when matching strings…if the case doesn’t match, even with all the same letters in the same position, no match on join. “ABC” doesn’t match with “abc”.
Solution: Create a new field with a Formula Tool that forces string characters to all lower case from the join field in both data sets. Then use that new field to join both tables.
Now your workflow might look like this:
When you run it, you see the following output:
Better, but still not right. Bill Nye allegedly has two software titles installed on a “null” PC that belongs to him. Does that make sense? Probably not.
Turns out that when you join a field from one table to a field in another table where they each contain “null” values, those are considered matches by the Join Tool. Each null value from table “A” will join with each null value in table “B”, potentially creating lots of garbage data in your output. They are fruitful and oh will they multiply!
Also, “null” is supposed to represent the absence of information, not something you want to match or join on, generally.
Solution: Filter both data sets on your join field for “IsEmpty”. When you filter for IsEmpty, you’ll catch both “blank” and “null” values. While “blank and “null” are different values, “IsEmpty” gets them both. (If you want to separate “blank” from “null” for further investigation, put a second Filter on the “True” output of your first Filter and test for “IsNull”.)
Now your workflow might look something like this:
And the output like this:
I also like to ensure that I understand the uniqueness of values in the fields I intend to use for a join between two tables. The Unique Tool is the way to start. I check the “Duplicates” output for any records in the join field(s), then use the Summarize Tool to Group By values in the join field(s). The last step is to use the Join Tool, combining:
- (L)eft: the unique list of values for the join field(s) from the output of the Summarize Tool
- (R)ight: the join field(s) from the input of the Unique Tool, i.e. the output of the Formula Tool. (I typically use a wireless connection for this input to the Join Tool…less messy)
Inspecting the “(J)oin” output of the Join Tool will show you all the records that are associated with what were considered duplicate values in our join field(s).
It may make perfect sense for there to be duplicate values in the join field(s) from one or both of your tables, and if so, you’ll be able to see and confirm that. However, if your untested assumption is that all values in the join field(s) are unique, then this little 3-tool-rig will help you figure out what’s wrong and what needs to change before you run the workflow and call the output good.
I normally do a final test for uniqueness in what I think is the primary key for the output dataset, just to be sure I am getting what I expect. If not, figure out why those duplicates are present:
As a bonus trick, I have at times also added a Record ID Tool on both the (L)eft and (R)ight inputs to the Join Tool. This way I have a single field as a unique identifier for each record on both sides of the join inputs, which can be summarized from (J)oin output to see which records from the (L)eft are being “multiplied” the most by records from the (R)ight, and vice versa.
Finally, use a Select Tool to remove any fields you created to debug your Join results, sending only good data downstream for further processing.
So if you sometimes get unexpected results out of joins between two tables on text fields, you might try covering your join the way I’ve described in this blog to turn bad data into good information. I am sure this isn’t the only or best way to get the job done, but it’s a good place to start if you have nothing else.
What’s worse…a workflow that takes a little longer to build and run, or sending bad info to the CFO? I’ll leave with you that thought 😊