Find duplicate lines in SQLite

Well it is good to know that Murphy’s Law is still going.

Two days ago I updated the downloadable SQLite-OLE application, and today I found a bug.

I regularly (3 or 4 times a week) have to deal with Excel files that have ~ 100 columns and more than 800,000 lines. Rather than do the analysis in Excel with macros, I transfer everything to the SQLite-OLE application. I find that far easier and far faster.

Recently I needed to identify the rows that were 100% duplicated across all columns . Now in SQL it is easy to remove duplicates, but it is a bit of a pain to find full row duplicates.

I will give an example:

create table family   ( Position text , Age INT)
 insert into family values ( "Father", 42  )
 insert into family values ( "Mother", 21 )
 insert into family values ( "Daughter", 4 )
 insert into family values ( "Son", 11 )
 insert into family values ( "Mother", 21 )
 insert into family values ( "Mother", 21 )
 insert into family values ( "Son", 11 )

Removing duplicates is easy, you will find lots of examples in the Web. One easy way is to create a new table based on the old table using the distinct keyword:

create table Family2 as select distinct * from family

But finding duplicates is a bit more complicated.

Eventually I found the following which is fairly neat:

create temp table Family2 as select rowid as Ind from family

delete from Family2 where Ind in ( select rowid from family group by Position, Age having count (*) = 1 )

This leaves Family2 holding the rowids into Family of all the duplicated rows!

The only trouble is that when there are lots of columns in Family then the expression

delete from Family2 where Ind in ( select rowid from family group by Position, Age having count (*) = 1 )

has to be extended to include all of the column names separated by commas. Frankly when there are lots of columns, then creating this column list is tedious, prone to error and horrible if you need to apply this to lots of different tables.

So I have extended the SQLite-OLE application’s (shell) commands to include a command that creates a comma separated list of the column names in a user specified table.

DBcolumnList ( <string expression = DB alias name> ,  <string expression = Table name> )

So in this example I would have used

 create temp table Family2 as select rowid as Ind from family"

dim queryString = " delete from Family2 where Ind in ( select rowid from family group by "

queryString  = queryString  &   DBcolumnList ( "main" , "family")   having count (*) = 1 ) "

sql = queryString 

Message = "Table temp.Family2 holds the rowids of the duplicate lines in table  main.Family"

Yes, yes, I know. The red text is not SQLite syntax, but belongs to the SQLite-OLE application.

So I admit it, a little bit of advertising for SQLite-OLE, but on the other hand it did find the duplicates, and one bug has been removed! So there is a new version. For details, and the download link, please look in the SQLite-OLE web-page.