Generate random pairs from a list of numbers making sure that the generated random pairs are not already present

By : Jordan
Source: Stackoverflow.com
Question!

Given a set of genes and existing pair of genes, I want to generate new pairs of genes which are not already existing.

The genes file has the following format :

123    
134   
23455  
3242  
3423  
...  
...  

The genes pairs file has the following format :

12,345    
134,23455   
23455,343  
3242,464452  
3423,7655  
...  
...  

But I still get few common elements between known_interactions and new_pairs. I'm not sure where the error is.

For the arguments,
perl generate_random_pairs.pl entrez_genes_file known_interactions_file 250000
I got a common elements of 15880. The number 250000 is to tell how many random pairs I want the program to generate.

#! usr/bin/perl

use strict;
use warnings;

if (@ARGV != 3) {
    die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = @ARGV;

open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";

my @e_file = <IN>; s/\s+\z// for @e_file;
my @k_file = <IN2>; s/\s+\z// for @k_file;

my (%known_interactions);

my %entrez_genes;
$entrez_genes{$_}++ foreach @e_file;

foreach my $line (@k_file) {
    my @array = split (/,/, $line);
    $known_interactions{$array[0]} = $array[1];
}
my $count = 0;

foreach my $key1 (keys %entrez_genes) {
    foreach my $key2 (keys %entrez_genes) {
        if ($key1 != $key2) {
            if (exists $known_interactions{$key1} && ($known_interactions{$key1} == $key2)) {next;}
            if (exists $known_interactions{$key2} && ($known_interactions{$key2} == $key1)) {next;}
            if ($key1 < $key2) { print "$key1,$key2\n"; $count++; }
            else { print "$key2,$key1\n"; $count++; }
        }
        if ($count == $interactions) {
            die "$count\n";
        }
    }
}
By : Jordan


Answers

first of all, you are not chomping (removing newlines) from your file of known interactions. That means that given a file like:

1111,2222

you will build this hash:

 $known_interactions{1111} = "2222\n";

That is probably why you are getting duplicate entries. My guess is (can't be sure without your actual input files) that these loops should work ok:

 map{
    chomp;
    $entrez_genes{$_}++ ;
 }@e_file;

and

map {
    chomp;
    my @array = sort(split (/,/));
    $known_interactions{$array[0]} = $array[1];
}@k_file;

Also, as a general rule, I find my life is easier if I sort the interacting pair (the joys of bioinformatics :) ). That way I know that 111,222 and 222,111 will be treated in the same way and I can avoid multiple if statements like you have in your code.

Your next loop would then be (which IMHO is more readable):

my @genes=keys(%entrez_genes);
for (my $i=0; $i<=$#genes;$i++) {
   for (my $k=$n; $k<=$#genes;$k++) {
     next if $genes[$n] == $genes[$k];
     my @pp=sort($genes[$n],$genes[$k]);
     next unless exists $known_interactions{$pp[0]};
     next if $known_interactions{$pp[0]} == $pp[1];
     print "$pp[0], $pp[1]\n";
     $count++;
     die "$count\n" if $count == $interactions;
  }
}
By : terdon


I can see nothing wrong with your code. I wonder if you have some whitespace in your data - either after the comma or at the end of the line? It would be safer to extract just the digit fields with, for instance

my @e_file = map /\d+/g, <IN>;

Also, you would be better off keeping both elements of the pair as the hash key, so that you can just check the existence of the element. And if you make sure the lower number is always first you don't need to do two lookups.

This example should work for you. It doesn't address the random selection part of your requirement, but that wasn't in your own code and wasn't your immediate problem

use strict;
use warnings;

@ARGV = qw/ entrez_genes.txt known_interactions.txt 9 /;

if (@ARGV != 3) {
    die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}

my ($e_file, $k_file, $interactions) = @ARGV;

open my $fh, '<', $e_file or die "Error!! Cannot open $e_file: $!";
my @e_file = sort { $a <=> $b } map /\d+/g, <$fh>;

open $fh, '<', $k_file or die "Error!! Cannot open $k_file: $!";
my %known_interactions;
while (<$fh>) {
  my $pair = join ',', sort { $a <=> $b } /\d+/g;
  $known_interactions{$pair}++;
}

close $fh;

my $count = 0;
PAIR:
for my $i (0 .. $#e_file-1) {
  for my $j ($i+1 .. $#e_file) {
    my $pair = join ',', @e_file[$i, $j];
    unless ($known_interactions{$pair}) {
      print $pair, "\n";
      last PAIR if ++$count >= $interactions;
    }
  }
}

print "\nTotal of $count interactions\n";
By : Borodin


This video can help you solving your question :)
By: admin