alteryxlightblueondarkblue247x98

After building a complex workflow that finally works and gets you the output you wanted, it’s tempting to give yourself a high five and move on. But do you ever think about the data you lose as it passes through Filters and Joins, from input to output, and whether you need to be concerned about that “lost data”?

  • Is there a lot of lost data, as a percentage of the input records you start with?
  • Is there an important pattern that, once detected and corrected, would get that data back into your output results?
  • Unless you know what data is lost from input to output, how do you know the results you are passing along to your peers or customers are good and complete information?

By inspecting your workflow and keeping an eye out for where data is “lost”, you can capture that lost data as another important output of your workflow. I’ll show you how I’ve done it in what follows.

By way of example, let’s look at a workflow I built with two inputs…one for “pc” data, and the other for “software” data. Both inputs are filtered and then joined on the field “pctag_join” to create the final output.

We’ll just consider the input “Table = pc” for now.

The first place you lose data is on the “IsEmpty[PC Tag]” Filter Tool.  Add a Select Tool onto the True output of Filter Tool “IsEmpty[PC Tag]”

filterout1

Then make the connection wireless, and drag it down below the main workflow.

filterout2

The second and last place you lose data from “Table = pc” in this workflow on “Join pctag_join to pctag_join”.  Add a Select Tool on the Left output of the Join Tool.

joinout1

Then make the connection wireless, and drag it down below the main workflow.

joinout2

Next, we’ll bring our two Select Tools together, label them, add them as inputs to a Union Tool, and finally add a Browse Tool on the output of the Union Tool.

finalrig

To reduce visual clutter and improve readability of your workflow, I’d recommend wrapping your “lost data” rig inside of a Container Tool, collapsing it, and giving it a meaningful name like “Lost Data Table pc”, or something like that. Place it near your “OUTPUT” on the canvas so you don’t forget about it.

output

You’d want to do this for each of the tables that are your input data sources.

Now when you run this workflow, in addition to getting the “good” data that will pass downstream as the OUTPUT, you can also see which rows of “bad data” were lost and did not make it as part of your OUTPUT. Inspection of the “bad data” will allow you to see how much of your input data does *not* make it through to your OUTPUT and figure out why.

Upon further inspection of the “bad data”, you may see patterns or exceptions that can be updated in the source data, or manipulated inside the workflow, such that “bad data” can become “good data” and included in your main OUTPUT, thereby improving the quality of information you are responsible to provide so that others can make better decisions.

You just can’t ignore the data lost after passing through Filters and Joins in your workflow. You can’t assume that the input data source is already clean enough such that only “good data” will make it through to your output. It’s always safer to assume there’s some “good” in the “bad data” waiting to be found and incorporated into the best possible results by the careful Alteryx Designer analyst.