edit big CSV file

praondevou · Jan 17, 2013

I have a CSV file that contains 49 million lines and is 2Gb big.

I would like to cut the last 13million lines and save only the first 36. Is there any software that can easily accomplish this?

I have the necessary RAM to open such a big file. Wordpad doesn't do it. VIM opens it but it seems I can only delete the content of the current window.

tshuck · Jan 17, 2013

You could write a program that reads the file in, counts the number of newlines in the file, and save the file...

it may take a while, though

djsfantasi · Jan 17, 2013

Why will it take a while? He only needs the first 36 lines...
Write code to read in only 36 lines, write them to a new file and stop.

set /a countlines=0
if exist newfile.csv erase newfile.csv
setlocal enabledelayedexpansion
for /F %%l in (file.CSV) do set myline=%%l&call :doit
goto :EOF
:doit
set /a countlines=!countlines!+1
if %countlines% GTR 36 exit /B
echo %myline% >> newfile.CSV
:end

tshuck · Jan 17, 2013

djsfantasi said:
Why will it take a while? He only needs the first 36 lines...
Write code to read in only 36 lines, write them to a new file and stop.

36 million...

I have a CSV file that contains 49 million lines and is 2Gb big.

I would like to cut the last 13million lines and save only the first 36.

49 - 13 = 36... he just didn't add the million at the end...

praondevou · Jan 17, 2013

I'm using "TextFileSplitterv2.0.4" now. It works eventhough it takes quite some time.

Thanks guys

WBahn · Jan 17, 2013

I don't think there is anything you can do to keep it from taking some time. Since every line is, presumably, potentially a different length (even if it might have the same number of values, the values, such as -5 and 5674, occupy different lengths) you don't have much choice but to read every byte in until you get to the end of the data you are instered in. You do NOT necessarily need to read in the contents of the file beyond that.

A trivial program that does not need hardly any RAM would be to open the file for reading and a new file for writing. Then read a character, increment a counter if it is a newline, and then write the character out to the new file. Repeat until you have the desired number of lines. This can be easily adapted to extract any subset of lines by just turning on and off whether you echo out to the new file.

praondevou · Jan 17, 2013

WBahn said:
I don't think there is anything you can do to keep it from taking some time.

I tried several programs and only got it to work with the text editor. Strangely enough it runs equally fast on a 2GHZ/3GbRAM or on a 12 prozessor/20GbRAM machine...

Anyway, I found a way even if it takes half an hour for each file to split.

WBahn · Jan 18, 2013

It's not surprising that it takes about the same amount of time -- it would be very hard to do something like this on more than one processor.

I think you can do a lot better than 30 minutes. (36/49)*2GB is about 1.5GB. You have to read in and write out, so that is 3GB of transfer. 3GB/30min is only about 1.7MB/s. Let's say that you can only sustain 10MB/s average transfer rate to the drive and that you are I/O bound, that would mean that it should take you about five minutes.

WBahn · Jan 18, 2013

Okay, so I told myself to put my money where my mouth was. So I wrote a program to generate a 50 million line file that would be around 2GB (1.91GB).

I then wrote another program to read the file, one character at a time, and export the first 39 million lines that it read to another file, which ended up at 1.40GB. It took that program 4 minutes and 8 seconds on a Toshiba Ultrabook under Win7.

Here is the entire program:

Rich (BB code):

#include <stdio.h>
#include <stdlib.h>

#define KEEP (36000000)
#define INFILE "huge.txt"
#define OUTFILE "big.txt"

int main(void)
{
   int lines;
   int c;
   FILE *fp_i, *fp_o;

   fp_i = fopen(INFILE, "rt");
   if (!fp_i)
   {
      printf("ABORT - Input file failed to open.\n");
      exit(EXIT_FAILURE);
   }

   fp_o = fopen(OUTFILE, "wt");
   if (!fp_o)
   {
      printf("ABORT - Output file failed to open.\n");
      fclose(fp_i);
      exit(EXIT_FAILURE);
   }

   lines = 0;
   while ((lines < KEEP) && (EOF != (c=getc(fp_i))))
   {
      putc(c, fp_o);
      if ('\n'==c)
         lines++;
   }
   
   fclose(fp_i);
   fclose(fp_o);
   
   return 0;
}

As you can see, the guts of it is a loop with three lines of code. Most of it is just error checking my file open operations.

chrisw1990 · Jan 19, 2013

i have a question.. whether iv missed someone elses suggestion..

why read it in in one go? i have a pic application that reads in 512 bytes at a time and processes that, then processes the next 512.. why not do that.. but ya no.. more?

read in 100 lines say.. process and output that.. then carries on with the next 100.. itll still take a while, but your codes reduced, processing overhead, and memory.. whether your code does that im not sure, hard to follow without being able to look at the program properly

WBahn · Jan 19, 2013

My code reads the file ONE BYTE at a time and completely processes that byte before reading another byte from the file

I could potentially get substantially higher throughput if I read on block (4096 bytes on this machine) at a time and buffer an entire block before writing to disk, but that depends on how smart the compiler and OS are at optimizing the read/write buffers for sequential single byte ping-pong reads interlaced with writes to two files.

I don't know what you mean about hard to follow without being able to look at the program. That IS the program -- the entire program. You can copy it and paste it into a text file and compile it with your favorite ANSI-C compiliant compiler and run it. The file names are hardcoded for simplicity, but you can change them to whatever you like.

Thread starter	Similar threads	Forum	Replies	Date
	help to edit dupm	Software & IDEs	0	Apr 17, 2023
	I want to do edit my gerber file	PCB Layout , EDA & Simulations	4	Jan 5, 2023
M	someone can edit one gerber file for me?	PCB Layout , EDA & Simulations	16	Nov 10, 2022
	C++ not locating the file though absolute file location is given [EDIT: SOLVED]	Programming & Languages	8	May 9, 2022
E	edit startup.s file	Microcontrollers	0	Dec 10, 2010

edit big CSV file

Join our Engineering Community! Sign-in with:

edit big CSV file

praondevou

tshuck

djsfantasi

tshuck

praondevou

WBahn

praondevou

WBahn

WBahn

chrisw1990

WBahn

You May Also Like

Synopsys Supports AI-Bogged Data Centers With First 1.6T Ethernet IP

Nexperia Announces Energy Balance Calculator to Extend Battery Life

Intel Spins Off Altera as Independent FPGA Supplier

How the Bluetooth ESL Standard Aims to Replace Billions of Retail Paper Labels