edit big CSV file

Discussion in 'Computing and Networks' started by praondevou, Jan 17, 2013.

  1. praondevou

    Thread Starter AAC Fanatic!

    Jul 9, 2011
    I have a CSV file that contains 49 million lines and is 2Gb big.

    I would like to cut the last 13million lines and save only the first 36. Is there any software that can easily accomplish this?

    I have the necessary RAM to open such a big file. Wordpad doesn't do it. VIM opens it but it seems I can only delete the content of the current window.
  2. tshuck

    Well-Known Member

    Oct 18, 2012
    You could write a program that reads the file in, counts the number of newlines in the file, and save the file...

    it may take a while, though:p
  3. djsfantasi

    AAC Fanatic!

    Apr 11, 2010
    Why will it take a while? He only needs the first 36 lines...
    Write code to read in only 36 lines, write them to a new file and stop.

    set /a countlines=0
    if exist newfile.csv erase newfile.csv
    setlocal enabledelayedexpansion
    for /F %%l in (file.CSV) do set myline=%%l&call :doit
    goto :EOF
    set /a countlines=!countlines!+1
    if %countlines% GTR 36 exit /B
    echo %myline% >> newfile.CSV
  4. tshuck

    Well-Known Member

    Oct 18, 2012
    36 million...

    49 - 13 = 36... he just didn't add the million at the end...
  5. praondevou

    Thread Starter AAC Fanatic!

    Jul 9, 2011
    I'm using "TextFileSplitterv2.0.4" now. It works eventhough it takes quite some time.

    Thanks guys
  6. WBahn


    Mar 31, 2012
    I don't think there is anything you can do to keep it from taking some time. Since every line is, presumably, potentially a different length (even if it might have the same number of values, the values, such as -5 and 5674, occupy different lengths) you don't have much choice but to read every byte in until you get to the end of the data you are instered in. You do NOT necessarily need to read in the contents of the file beyond that.

    A trivial program that does not need hardly any RAM would be to open the file for reading and a new file for writing. Then read a character, increment a counter if it is a newline, and then write the character out to the new file. Repeat until you have the desired number of lines. This can be easily adapted to extract any subset of lines by just turning on and off whether you echo out to the new file.
  7. praondevou

    Thread Starter AAC Fanatic!

    Jul 9, 2011
    I tried several programs and only got it to work with the text editor. Strangely enough it runs equally fast on a 2GHZ/3GbRAM or on a 12 prozessor/20GbRAM machine...

    Anyway, I found a way even if it takes half an hour for each file to split.
  8. WBahn


    Mar 31, 2012
    It's not surprising that it takes about the same amount of time -- it would be very hard to do something like this on more than one processor.

    I think you can do a lot better than 30 minutes. (36/49)*2GB is about 1.5GB. You have to read in and write out, so that is 3GB of transfer. 3GB/30min is only about 1.7MB/s. Let's say that you can only sustain 10MB/s average transfer rate to the drive and that you are I/O bound, that would mean that it should take you about five minutes.
  9. WBahn


    Mar 31, 2012
    Okay, so I told myself to put my money where my mouth was. So I wrote a program to generate a 50 million line file that would be around 2GB (1.91GB).

    I then wrote another program to read the file, one character at a time, and export the first 39 million lines that it read to another file, which ended up at 1.40GB. It took that program 4 minutes and 8 seconds on a Toshiba Ultrabook under Win7.

    Here is the entire program:

    Code ( (Unknown Language)):
    2. #include <stdio.h>
    3. #include <stdlib.h>
    5. #define KEEP (36000000)
    6. #define INFILE "huge.txt"
    7. #define OUTFILE "big.txt"
    9. int main(void)
    10. {
    11.    int lines;
    12.    int c;
    13.    FILE *fp_i, *fp_o;
    15.    fp_i = fopen(INFILE, "rt");
    16.    if (!fp_i)
    17.    {
    18.       printf("ABORT - Input file failed to open.\n");
    19.       exit(EXIT_FAILURE);
    20.    }
    22.    fp_o = fopen(OUTFILE, "wt");
    23.    if (!fp_o)
    24.    {
    25.       printf("ABORT - Output file failed to open.\n");
    26.       fclose(fp_i);
    27.       exit(EXIT_FAILURE);
    28.    }
    30.    lines = 0;
    31.    while ((lines < KEEP) && (EOF != (c=getc(fp_i))))
    32.    {
    33.       putc(c, fp_o);
    34.       if ('\n'==c)
    35.          lines++;
    36.    }
    38.    fclose(fp_i);
    39.    fclose(fp_o);
    41.    return 0;
    42. }
    As you can see, the guts of it is a loop with three lines of code. Most of it is just error checking my file open operations.
    praondevou likes this.
  10. chrisw1990

    Active Member

    Oct 22, 2011
    i have a question.. whether iv missed someone elses suggestion..

    why read it in in one go? i have a pic application that reads in 512 bytes at a time and processes that, then processes the next 512.. why not do that.. but ya no.. more? :D
    read in 100 lines say.. process and output that.. then carries on with the next 100.. itll still take a while, but your codes reduced, processing overhead, and memory.. whether your code does that im not sure, hard to follow without being able to look at the program properly :)
  11. WBahn


    Mar 31, 2012
    My code reads the file ONE BYTE at a time and completely processes that byte before reading another byte from the file

    I could potentially get substantially higher throughput if I read on block (4096 bytes on this machine) at a time and buffer an entire block before writing to disk, but that depends on how smart the compiler and OS are at optimizing the read/write buffers for sequential single byte ping-pong reads interlaced with writes to two files.

    I don't know what you mean about hard to follow without being able to look at the program. That IS the program -- the entire program. You can copy it and paste it into a text file and compile it with your favorite ANSI-C compiliant compiler and run it. The file names are hardcoded for simplicity, but you can change them to whatever you like.