Tags » Tips

A nifty little tool to find all the distinct values in all columns

A few months ago I became quite frustrated because I had a huge backlog of work to do performing the Quality Control of the analyses of my team members. A large portion of performing this type of Quality Control was getting a sense of the data ingested by the analyses and determining if there were any data-specific elements of the code which were not appropriately commented or if there were strange data points in the output data. To do this I needed to quickly get a sense for the possible values in an unfamiliar table, doing the same thing over and over again gets tedious, so I wrote this little script to allow me to quickly get all the distinct values in a column for every column in a table. I’ve provided the code to do this below with explanatory comments.

Creating anonymized pseudo-random IDs

I sometimes need to convert a series of unique sensitive IDs into a list of unique IDs which I can use to share with people who aren’t allowed to know the IDs in the first list. The very easiest was to do that in SQL is to either use the HASHBYTES function in T-SQL and provide it with the SHA512 hashing algorithm or to use the t-SQL newID function to create new GUIDs. The difficulty with this is that it creates a very difficult to read ID so if I started with relatively simple to remember 5 character IDs I end up with a super user unfriendly 40 character hexadecimal number.

So how could one scramble a friendly¬†6 character ID into an anonymous friendly 7 character ID? The easiest way to do this is to use some basic number theory and combine that with SQL’s NEWID() function.

  1. We start off by getting a unique list of all the ID numbers we want to anonymize and scramble the order by using NEWID()

  2. We then Take that randomly ordered list and perform some math on the order using two carefully selected primes, one is a small prime the number of digits that we want the result to be in (A), and the other is a very large prime (B). I’ve chosen only a middlingly large prime here to make reading easier.
    A = 7957597
    B = 362736035870515331128527330659
    We multiply these in the following way to create a unique list of numbers. NB, prime A must be larger than the total number of IDs you want to create or else this doesn’t work.

  3. If we want to be fancy we can convert the integer into hex and add a prefix to denote that this is an anonymous ID. NB the size of your VARBINARY depends on the size you want your strings to be.

Pivoting a table when you don’t know how many columns you need with dynamic SQL

Here’s the situation, you’ve got a column in a table which you want to pivot on but you don’t have any control over the distinct number of values that can be in that column. It could be 3 or it could be 20. However, you still need to be able to pivot the whole table to display the information to the user. How do you do this?

The answer here lies in the way you can use string variables as code in dynamic SQL, and there are two parts of the pivot code which have become variables in the situation outlined above:

Therefore, to create this pivot when we have a variable number of values in our PivotSource column we need to populate two strings with the distinct values of the PivotSource Column — Variable Block #1 and Variable Block #2.

To do this we will use the method I explained in an earlier post. For Variable Block #1 the code looks like:

The stuff command here removes the comma at the beginning so that we have a comma-separated list of values and no trailing comma.

The code for Variable block #2 is similar but formatted slightly differently.

Finally we need to construct our Dynamic SQL statement that we need to execute:

And there you have it, a Pivot statement that can pivot a table if you don’t know how many variables will be in the column you want to pivot on.


How to include several values in one field in SQL

I found out the other day how to solve a problem that had been causing me issues when presenting data. I often got requests to resent data in such a way that we display information from a table in a grouped form while still allowing someone to consult all of the values relating to this grouping.

The solution isn’t very machine friendly and is end-user oriented. The code can be seen below:

Getting the number of lines in multiple csv files with Powershell v1

To find out the number of lines in each file of a bunch of csv files in the directory tree you can run the following snippet:

A function similar to first or last from access in MS SQL

I’ve been looking for a while now to find a way to return just one row when the fields I am using split up my table return several rows per unique field combination. I found this little gem today:

This of course does not emulate first and last exactly as it doesn’t really allow you to influence the order of the returned fields in the partitioning. It is however very useful when you just want to return one value.

Finding hidden characters in Excel

Did you know that you could search for hidden characters and replace with hidden characters in excel?

Simply type:

To find any non-printable characters.


Here are some handy Ascii codes: