• deleting duplicate data in many rows (Excel 2000)

    Home » Forums » AskWoody support » Productivity software by function » Visual Basic for Applications » deleting duplicate data in many rows (Excel 2000)

    • This topic has 12 replies, 5 voices, and was last updated 23 years ago.
    Author
    Topic
    #372163

    Hi all,
    I have 20,000+ rows of data in multiple cols. I need to be able to occasionally go through this data and delete duplicates.

    In this case a duplicate is defined as any occurrence of the same data in more than one row. In any given row, the data can occur more than once in one or more columns, however. confused

    Example (3 data rows)
    This is OK.
    AA, BB, AA, DD
    EE, KK, LL, MM
    ZZ, FF, FF, WW

    This is NOT ok: (3 data rows)
    AA, BB, AA, DD
    EE, KK, LL, AA
    ZZ, FF, FF, AA

    Rows 2 and 3 contain data that already exist in previous rows (AA is found in row 1). These last two rows would get deleted.

    I think I need VBA here, not a formula. It’d have to be some triply nested loop I think. One for the whole data set, one within sequencial dupes in one column, and then one to check all columns for a pair of cells. The code would also have to keep track of the “used” data so that it searches through the list of previous data and if any matches are found (for the new rows) that row is deleted.

    This is more than a search/replace since there is no known data to search for (it’s any duplicate data).

    I’ll start working on the algorithm now but wanted to post this to the group in case someone comes up with a good methodology. I’ll post what I came up with when I get it working.

    Thnx, Deb compute

    Viewing 1 reply thread
    Author
    Replies
    • #593947

      A couple of questions:

      1- How many columns are involved in the check for duplicates (it looks like four from your example)?

      2- What should happen in this case:

      AA, BB, AA, DD
      EE, KK, LL, AA
      ZZ, FF, FF, DD
      EE,GG,GG,HH

      • #593948

        The # of cols is not fixed, but I was going to assume that either the source data is the current worksheet or a selected range before starting the search. I’d then just query for the # of cols that contain data for the known rows.

        I was thinking of first making a pass through the data to gather all the possible data and store them in a collection (which would by definition get rid of dups for purpose of searching). This collection would be my compare list when I’m checking if any given data for a cell has already been found. If yes, I’d increment a counter by 1 and that would tell me that the data has already occured. I was going to create a user defined data type to store in the collection (e.g., DataFound as variant, DataCount as integer, DataRow as long). If I’m still on the same row, DataCount can be > 1 as long as DataFound has not already been found in a previous row. (I know this because DataCount = 1 and DataRow current row for the active cell.) This is quite confusing I know.

        To answer your 2nd question, rows 2,3,4 would be deleted since they all contain at least one data element that exists in a previous row. (For row 2, AA appears in row 1; for row 3, DD appears in row 1; for row 4, EE appears in row 2).

        I haven’t started this algorithm yet as I’m working on something else but will try it later tonight.

        Thanks for looking into this,
        Deb crazy

        • #593992

          This is some strange data. If you don’t care what row the data is (was) in, or how it was grouped with anything else, you might take each non-empty cell, copy it to a single column on a separate sheet, sort that, and then the work of killing duplicates should be simple.

        • #594003

          But if 2,3 and 4 should be deleted, then EE would not appear in the results. Is that what you want? Or, would row four not be deleted because row two got deleted before row four is checked?

      • #594246

        Ok, I now know what I’m talking about clapping clapping

        I talked to the user and I was making it too complicated. They need to delete duplicate rows where ALL the cells in a given row match 100% of the cells in any previous row.

        They tried doing it with MapPoint and it worked but took 30 minutes for each data set (which contains 100,000 data ponts) and there are 120 data sets. (It was on a slow 400MHz PC with 15GB disk).

        So, back to the beginning…If my data is

        row1: AA, BB, AA, DD
        row2: AA, DD, CC, AA
        row3: BB,CC,AA, DD
        row4: AA, DD, CC, AA

        only row 4 would be deleted.

        Now I wonder if there is a formula to do this as well. A VBA solution would be fine too.

        Thanks all for the time spent to figure this one out.
        Deb bow

        • #594277

          OK, try this code:

          Public Sub DeleteDeptNums()
          Dim lLastRow As Long, lLastCol As Long, I As Long, J As Long, K As Long
              lLastRow = Worksheets("Sheet1").Range("A65536").End(xlUp).Row - 1
              lLastCol = Worksheets("Sheet1").UsedRange.Columns.Count - 1
              For I = 0 To lLastRow - 1
                  For J = lLastRow To I + 1 Step -1
                      For K = 0 To lLastCol
                          If Worksheets("Sheet1").Range("A1").Offset(I, K).Value  _
                            Worksheets("Sheet1").Range("A1").Offset(J, K).Value Then
                              Exit For
                          End If
                      Next K
                      If K > lLastCol Then
                          Worksheets("Sheet1").Range("A1").Offset(J, 0).EntireRow.Delete
                      End If
                  Next J
              Next I
          End Sub
          
        • #594445

          Deb, you don’t need VBA to do this, just use the Data | Filter | Advance Filter menu and use a blank criteria and check the box “Unique records only.” HTH –Sam

          • #594594

            Hey that actually works! This is great. I obviously have never look at that menu before. Thanks for filling me in.

            Deb fanfare

          • #594639

            Oopps, I spoke too soon or I’m not using this feature right. doh

            I entered this data in A1:D7
            AA BB AA DD
            DD DD AA BB
            AA BB AA DD
            ZZ HH KK LL
            DD DD AA BB

            I selected the area, did Data/Filter. Advanced Filter. I checked ‘unique records’ only and did the blank criteria (it prompts for a cell so I selected an empty cell).

            I got this:
            AA BB AA DD
            DD DD AA BB
            AA BB AA DD
            ZZ HH KK LL

            So it did not remove duplicates, I still have two copies of “AA BB AA DD”

            What did I do wrong?

            AAH, I did it again and found out it ONLY works on numbers, not text!!! I must have headers on each column.

            When I changed the data to numbers with text labels it worked:

            A, B, C, D <—- labels
            11,22,33,44
            44,33,11,22
            11,22,33,44
            77,88,99,77
            11,22,33,44
            77,88,99,77

            This created:
            11,22,33,44
            44,33,11,22
            77,88,99,77

            which is correct. My data is a mix of text and numbers so I'll have to figure a way to turn it into numbers.

            This all helps though… At least I have a start on the solution (and there's always the code another lounger came up with).

            Deb bash

            • #594651

              It works on everything, it’s just tough to figure out the first time. To make it easier, we will use two sheets, one for the original data and one for the criteria.

              First, make sure that your data has a row of headers, one for each column. You can just make them up: C1, C2, etc. if you want.

              Next, copy the row of headers and paste it on sheet 2, cell A1.

              Now, you are ready to do the Advanced filter. Click in the original table and do Data | Filter | Advanced Filter. Excel tries to read your mind, so hopefully it will already have you original data selected and entered in the list range. If not, enter it there.

              Now for the criteria range, select the first row of headers and a blank row beneath the headers. Finally, check the “unique records only,” and press OK.

              Sometimes, Excel cannot figure out your header row, if you get this alert, just press OK to use the first row as headers.

              Since this was done in place, Excel has hidden the duplicate rows, so, if it looks OK, you can do Edit | GoTo | Special, Visible cells only, copy this data and paste it on another worksheet. If you don’t like the filtered data, you can do a Data | Filter | Show All. HTH –Sam

    • #594032

      Hi Deb,
      See if this meets your needs. It simply trawls through the active sheet finding rows with duplicates and deleting them. I’ve also included an alternative for deleting just the duplicate cells, in case that’s what you’re after.

      Sub ClearDuplicateRowsInColumns()

      Dim A As Long
      Dim B As Long
      Dim C As Range
      Dim D As Range

      On Error GoTo Abort
      Application.ScreenUpdating = False
      Application.Calculation = xlCalculationManual

      Set D = ActiveSheet.UsedRange.Columns
      Set C = ActiveSheet.UsedRange.Rows
      For B = D.Columns.Count To 1 Step -1
      For A = C.Rows.Count To 1 Step -1
      If Application.WorksheetFunction.CountIf(Range(Rows(1).Columns(1), Rows(A).Columns(), C.Rows(A).Columns() > 1 Then
      ‘If you only want to delete the offending cell, use: If Application.WorksheetFunction.CountIf(Columns(, C.Rows(A).Columns() > 1 Then
      C.Rows(A).EntireRow.Delete
      ‘If you only want to delete the offending cell, use: Rows(A).Columns(.Delete Shift:=xlUp
      End If
      Next A
      Next B

      Abort:
      Application.ScreenUpdating = True
      Application.Calculation = xlCalculationAutomatic
      End Sub

      Cheers

      Cheers,
      Paul Edstein
      [Fmr MS MVP - Word]

      • #594247

        Hey I like that! Cool. I did try it and it seemed to work but now I find that my requirement is not as complicated as I had thought. See my updated post which says only the entire row needs to be a duplicate of a previous row before it’s deleted.

        Thanks a lot. Deb groovin

    Viewing 1 reply thread
    Reply To: Reply #594594 in deleting duplicate data in many rows (Excel 2000)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information:




    Cancel