keep specific duplicate based on lookup table in R

Question!

Thank you to anyone who could help me with this. I have tried to figure this out for days now without luck. My apologies if the solution was out there but extensive web search did not help.

So I have two datasets df and df2, df1 being my dataset where I have pseudo-duplicates (duplicates if I only consider certain variables) and df2 is my lookup table.

df <- data.frame(
  x = c("green", "green", "blue", "orange", "orange"),
  y = c("W12", "W12", "W12", "W11", "W12"),
  z = c(23, 54, 21, 16, 54)
  )
df2 <- data.frame(y=c("W12","W11"), z=c(54, 16))

So, we have:

> df
       x   y  z
1  green W12 23
2  green W12 54
3   blue W12 21
4 orange W11 16
5 orange W12 54

> df2
     y  z
 1 W12 54
 2 W11 16

I am looking for a way to not only weed out one of the duplicates based on (x,y), but to be able to tell R which to keep based on the value of z in the look up table. So here, keep the record #2 but not based on its position in the dataset (in my real date, the value of z is sometimes large and other time small, depending on y).

I have tried using !replicate() but cannot find a way to point to the reference table, just to retain either the first record (or the last).

df_dup<-df[c("x", "y")]
df[!duplicated(df_dup),]

I also tried something along the lines of

ddply(df,c("x", "y"), 
             function(v) {
               if (nrow(v)>1) v[which(c(df$y, df$z) %in% c(df2$y, df2$z)), ]
               if (nrow(v)==1) v
               }
               )
df %>% 
  group_by(x,y) %>% 
  filter(c(df$y,df$z) %in% c(df2$y,df2$z))

But something funky is happening here, and the %in% does not match the pairs exactly but any combinations of (y,z).

The output I am hoping for is

 df
       x   y  z
2  green W12 54
3   blue W12 21
4 orange W11 16
5 orange W12 54

But with Row#2 chosen not because it is the last row but because it matches the lookup table. In my longer dataset, the rows to keep might end up being the first or the second.

Thank you in advance again to anyone who can find a way to do this in R. Ultimately, I will need to do this on a gigantic dataset and with several variables as grouping variables with only one of them being part of the lookup table.

By : Marie T.


Answers

One approach is the following:

  1. Find all the rows that have duplicates for x and y in df. For this, we use Sven Hohenstein's answer found here:

    dup.ind <- which(duplicated(df[,c("x","y")]) | duplicated(df[,c("x","y")], fromLast = TRUE))
    
  2. We also want to keep all other rows (that do not have duplicates) in the result so we use setdiff to identify those:

    other.ind <- setdiff(seq_len(nrow(df)), dup.ind)
    
  3. From dup.ind keep only those for which the z value in df is equal to that in df2 for the matching y values. Here, df2$z[match(df$y[dup.ind], df2$y)] looks up the z value in df2 for each dup.ind:

    keep.ind <- dup.ind[df$z[dup.ind] == df2$z[match(df$y[dup.ind], df2$y)]]
    
  4. Subset the original df using c(keep.ind,other.ind). Here, we sort these to maintain the original order (but that is not necessary):

    result <- df[sort(c(keep.ind, other.ind)),]
    

Using your input data, the result is:

print(result)
##       x   y  z
##2  green W12 54
##3   blue W12 21
##4 orange W11 16
##5 orange W12 54
By : aichao


I might do...

library(data.table)
setDT(df); setDT(df2)

ord = +is.na(df2[df, on=c("y", "z"), which=TRUE])
unique(df[ order(ord) ], by=c("x","y"))

        x   y  z
1:  green W12 54
2: orange W11 16
3: orange W12 54
4:   blue W12 21

This prioritizes rows with matches in df2; but if you want to to do the opposite (as it looked like in an earlier version of the question), just put a - in the definition of ord instead of a +.


How it works:

X[Y, on, which=TRUE] returns, for each row of Y, the row(s) of X that are matched. If there are multiple matches, they are all returned (but in your lookup table, there's no reason to have repeats). If there is no match, a missing value is returned.

+is.na(w) where w is a vector of row numbers returns a vector we can sort by:

  • 1 if w is a missing value
  • 0 otherwise

unique(Y[order(ord)], by) sorts Y by our vector and then drops duplicates as usual, keeping the first observation per group. You could alternately do Y[order(ord), .SD[1L], by] for this step.

By : Frank


what you are expecting is wrong & you can get the result as an array of numbers or coma separated string, please refer below code and output.

var RESULT = YOUR_RESPOSE.map(function(a){
['Poured','Sold', 'Variance', 'Loss'].forEach(function(v){
 if(typeof a[v] === "string"){
    // if you want as an array
    a[v] = a[v].split(',').map(function(s){return isNaN(s) ? s : (+s).toFixed(2)});
    // if you want as an string
    //a[v] = a[v].split(',').map(function(s){return isNaN(s) ? s : (+s).toFixed(2)}).join(',');
 }
});

return a;

})



"[{"ID":"September-2016","Product":"September-2016","Poured":["111.00","759.07"],"Sold":["107.00","660.97"],"Loss":["-4.00","98.10"],"Variance":["-3.67"],"startDate":"2016-09-01","endDate":"2016-09-22"},{"ID":"November-2015","Product":"November-2015","Poured":["53.00","690.25"],"Sold":["52.00","953.60"],"Loss":["-736.65"],"Variance":["-1.37"],"startDate":"2015-11-20","endDate":"2015-11-30"},{"ID":"May-2016","Product":"May-2016","Poured":["156.00","401.65"],"Sold":["151.00","192.51"],"Loss":["-5.00","209.14"],"Variance":["-3.33"],"startDate":"2016-05-03","endDate":"2016-05-31"},{"ID":"March-2016","Product":"March-2016","Poured":["49.00","260.22"],"Sold":["49.00","399.14"],"Loss":["138.92"],"Variance":["0.28"],"startDate":"2016-03-01","endDate":"2016-03-09"},{"ID":"June-2016","Product":"June-2016","Poured":["162.00","126.88"],"Sold":["161.00","718.62"],"Loss":["-408.26"],"Variance":["-0.25"],"startDate":"2016-06-01","endDate":"2016-06-30"},{"ID":"July-2016","Product":"July-2016","Poured":["160.00","185.68"],"Sold":["154.00","882.40"],"Loss":["-5.00","303.28"],"Variance":["-3.31"],"startDate":"2016-07-01","endDate":"2016-07-31"},{"ID":"January-2016","Product":"January-2016","Poured":["355.00","509.26"],"Sold":["179.00","696.72"],"Loss":["-175.00","812.54"],"Variance":["-49.45"],"startDate":"2016-01-01","endDate":"2016-01-31"},{"ID":"February-2016","Product":"February-2016","Poured":["150.00","980.73"],"Sold":["146.00","248.72"],"Loss":["-4.00","732.01"],"Variance":["-3.13"],"startDate":"2016-02-01","endDate":"2016-02-29"},{"ID":"December-2015","Product":"December-2015","Poured":["167.00","843.42"],"Sold":["163.00","732.95"],"Loss":["-4.00","110.47"],"Variance":["-2.45"],"startDate":"2015-12-01","endDate":"2015-12-31"},{"ID":"August-2016","Product":"August-2016","Poured":["168.00","853.51"],"Sold":["160.00","24.84"],"Loss":["-8.00","828.67"],"Variance":["-5.23"],"startDate":"2016-08-01","endDate":"2016-08-31"}]"


This video can help you solving your question :)
By: admin