[Help-bash] split

Discussion:

[Help-bash] split

Val Krem

2018-05-26 17:28:05 UTC

Hi All,
I wanted to split a big file based on the value of the first column of thsi file
file 1 (atest.dat).1 ah1 251 ah2 26
4 ah5 354 ah6 362 ah2 54
I want to split this file into three files based on the first column
file 1 will be
     1 ah1 25     1 ah2 26
file 2 would be    4 ah5 35    4 ah6 36
file three would be     2 ah2 54
The range of the first column could vary from 1 up to 100.
I trad the following script
################################################
#! bin/bash
numb=($(seq 1 1 10))
for i in "${numb[@]}"
   do
     awk '{if($1=='"${i}"') print $0}' atest.dat   > numb${i}.txt
   done
#################################################

The above script gave me 10 files while I was expecting only 3 files.
How do I limit to get only the three files that do have data in atest.dat file?

Thank you

Seth David Schoen

2018-05-26 22:29:23 UTC

Permalink

Post by Val Krem
Hi All,
I wanted to split a big file based on the value of the first column of thsi file
file 1 (atest.dat).1 ah1 251 ah2 26
4 ah5 354 ah6 362 ah2 54
I want to split this file into three files based on the first column
file 1 will be
     1 ah1 25     1 ah2 26
file 2 would be    4 ah5 35    4 ah6 36
file three would be     2 ah2 54
The range of the first column could vary from 1 up to 100.
I trad the following script
################################################
#! bin/bash
numb=($(seq 1 1 10))
   do
     awk '{if($1=='"${i}"') print $0}' atest.dat   > numb${i}.txt
   done
#################################################

A more idiomatic use of awk here would be simply

awk '$1=="'$i'"'

because awk allows you to set a condition before the action, and its
default action is {print}.

One idea to limit the number of files created might be to start with

numb=($(cut -1 < atest.dat | sort -u))

which will literally only run the command for the values that actually
occur in the first column of atest.dat (whatever they are, whether they
are in the range 1 to 100 or not).

One inefficient thing about this is that you read the file one hundred
times (in your original version) or one more than the number of distinct
values in the first column (in my modified version). There are various
alternatives, such as writing a script in some language that appends
to the appropriate file in each case, which could even be a loop in
bash. For example

cat atest.dat | while read -a line; do
echo ${line[@]} >> numb"${line[0]}".txt
done

This might not have faster overall I/O performance than the original
version because it will have to constantly open and close each file,
and also won't be able to do buffered writes. However, it will only
read through the original file once.

Another option could be to write the script in another language and
hold all of the files open, with references to them in a hash table,
and generate appropriate writes on each file by looking up its file
descriptor in the hash table.

Another option could be to sort the file first. Then an advantage is
that you know when to switch to a new output file because the first
field of the input line changes. However, with most sorts you may lose
the original relative order of the input lines.

--
Seth David Schoen <***@loyalty.org> | No haiku patents
http://www.loyalty.org/~schoen/ | means I've no incentive to
8F08B027A5DB06ECF993B4660FD4F0CD2B11D2F9 | -- Don Marti

Greg Wooledge

2018-05-29 12:57:44 UTC

Permalink

Post by Val Krem
I wanted to split a big file based on the value of the first column of thsi file

You seem to be asking a lot of basic programming questions here.
Have you never programmed before? This is elementary work. Even a
first-year student should be able to do this.

Post by Val Krem
file 1 (atest.dat).1 ah1 251 ah2 26
4 ah5 354 ah6 362 ah2 54

Are you writing your email in a freakin' web browser? Instead of a text
editor in a terminal like a normal programmer?

This looks like you've completely corrupted the input file by feeding
it to a web browser, or a Windows-based "word processor". Where do the
lines begin and end? How much whitespace is actually present, and
where?

Post by Val Krem
I want to split this file into three files based on the first column
The range of the first column could vary from 1 up to 100.

FIELD.

A COLUMN is a single character.

A FIELD is a "word" composed of one or more characters, terminated or
delimited in some way. In your case there MIGHT be whitespace delimiters
between fields. It's hard to be sure because your sample input has been
corrupted.

Anyway....

The basic algorithm here is extremely simple.

1) Open three output file descriptors.

2) Read the input file line by line.

2a) For each line, examine the first field, and use that to decide which
output file to write to.

2b) Write the line to the appropriate file descriptor.

3) There is no 3. Once you reach the end of the input, you're done.

Post by Val Krem
file 1 will be
     1 ah1 25     1 ah2 26
file 2 would be    4 ah5 35    4 ah6 36
file three would be     2 ah2 54

How does the first field value tell you which output FD to use?

You want "1" to map to file 1, and "4" to map to file 2, and "2" to map
to file 3? This is lunacy. There is no discernable pattern here. Where
does 17 go? Where does 42 go? Where does 100 go?

Is it supposed to be RANDOM?!

Post by Val Krem
I trad the following script
################################################
#! bin/bash
numb=($(seq 1 1 10))

seq is Linux-only, and is stupid. If you want to loop 10 times, simply
write a for loop that counts to 10.
You don't need to store a list of the integers {1..10} in an array.
You could simply write

for i in {1..10}

Except, don't do that! Why are you looping 10 times? Where did you
get the number 10 from? Which part of the problem specification does
this represent? Which part of the algorithm that I described above
includes the number 10?

Post by Val Krem
do
awk '{if($1=='"${i}"') print $0}' atest.dat > numb${i}.txt

This would have been a code injection vulnerability if you didn't already
know that $i will be an integer.

Also, you're reading the input 10 times instead of 1 time. Why?

Also also, you're only handling 10 of the possible 100 values
of the first input field. You're putting "4" into file 4, and so on.
Where does 11 go? Nowhere. Where does 47 go? Nowhere.

Also also also, your mapping does not match what you said you
wanted in each output file. You said that "4" should map to output
file 2. But you're putting "4" in output file 4.

Post by Val Krem
done
#################################################
The above script gave me 10 files while I was expecting only 3 files.

Because you ran awk 10 times! If you looped 10 times, and you produced
a different output file each time, why are you surprised when there are
10 files?

Which part of your program had the number 3 in it? NONE!

Why would you expect 3 output files when you loop 10 times instead of
3 times?

Pierre Gaston

2018-05-29 13:22:56 UTC

Permalink

Post by Val Krem
Hi All,
I wanted to split a big file based on the value of the first column of thsi file
file 1 (atest.dat).1 ah1 251 ah2 26
4 ah5 354 ah6 362 ah2 54
I want to split this file into three files based on the first column
file 1 will be
1 ah1 25 1 ah2 26
file 2 would be 4 ah5 35 4 ah6 36
file three would be 2 ah2 54
The range of the first column could vary from 1 up to 100.
I trad the following script
################################################
#! bin/bash
numb=($(seq 1 1 10))
do
awk '{if($1=='"${i}"') print $0}' atest.dat > numb${i}.txt
done
#################################################
The above script gave me 10 files while I was expecting only 3 files.
How do I limit to get only the three files that do have data in atest.dat file?

awk can do the redirection, you may hit a problem if the number of open
files is too high, but with gnu awk you should be fine.

awk '{print > ("numb" $1 ".txt")}' atest.dat

i'd suggest you use padding for the number so that lexical sorting works
e.g.:

awk '{print > (sprintf("numb%03d.txt", $1))}' atest.dat

Greg Wooledge

2018-05-29 13:42:07 UTC

Permalink

Post by Pierre Gaston
awk can do the redirection, you may hit a problem if the number of open
files is too high, but with gnu awk you should be fine.
awk '{print > ("numb" $1 ".txt")}' atest.dat
i'd suggest you use padding for the number so that lexical sorting works
awk '{print > (sprintf("numb%03d.txt", $1))}' atest.dat

But....

Post by Pierre Gaston

Post by Val Krem
The above script gave me 10 files while I was expecting only 3 files.

We still have no idea what his basic splitting criteria are. We
can't write his program for him until we know this part.

Pierre Gaston

2018-05-29 13:50:42 UTC

Permalink

Post by Greg Wooledge

But....

Post by Pierre Gaston

Post by Val Krem
The above script gave me 10 files while I was expecting only 3 files.

We still have no idea what his basic splitting criteria are. We
can't write his program for him until we know this part.

ok, my guess is that he just didn't want to end up with empty files.

Greg Wooledge

2018-05-29 13:55:36 UTC

Permalink

Post by Pierre Gaston
ok, my guess is that he just didn't want to end up with empty files.

And the part about input field "4" going into file "2" was just a lie
too? Yes, that's possible. Sadly.

I always overestimate the intelligence of people who are supposed to
be writing computer programs.

Val Krem

2018-05-31 03:24:34 UTC

Permalink

Thank you All for your help. I got what I wanted.

I am sorry for my elementary question to you.

Post by Pierre Gaston
ok, my guess is that he just didn't want to end up with empty files.

Greg Wooledge

2018-05-31 12:39:36 UTC

Permalink

Post by Val Krem
Thank you All for your help. I got what I wanted.

But of course you won't tell us what you wanted, or which solution
satisfied it.

This mailing list is publicly logged for eternity. The main reason
people are willing to offer their time to help with questions on
this list (and most other lists) is because doing so creates a
permanent, searchable, record that can help others who have the same
problem in the future.