Problems writing a large text file in python

Thread Starter

Vilius_Zalenas

Joined Jul 24, 2022
192
Hi,

I have developed this semi-scientific code that acts as a TCP server. It listen for incoming messages, processes those and stores it in the computer RAM. Once the client disconnects, the PC attempts to write all the gathered data to a text file. The complete provided code is really girthy, but you can only concentrate on 2 functions:


Python:
def print_decimal_chunks_to_file_and_terminal(data, file_path, timestamps, cps_values):
    global total_logs_received
    global filename
    try:
        data = list(data)
        timestamps = list(map(float, timestamps)) 
        cps_values = list(cps_values)
        lines_to_write = []
 
        for i in range(len(data)):
            packet_data = data[i]
            timestamp = timestamps[i]
            cps = cps_values[i]
            formatted_timestamp = time.strftime('%Y-%m-%d %H:%M:%S.', time.localtime(timestamp)) + f"{timestamp:.6f}"[11:]
            formatted_cps = f"{max(0, int(cps) - 1):d}" if cps is not None else "N/A"
            line = f"{formatted_timestamp} - {formatted_cps} CPS - "
            formatted_line = " ".join(f"{max(num - 2048, 0):4d}" for num in packet_data)
            line += formatted_line
            lines_to_write.append(line + "\n")
            total_logs_received += 1
        data_to_write = ''.join(lines_to_write)
 
        if os.path.getsize(filename) == 0:
            first_line = lines_to_write[0]
            with open(filename, "a+") as file:
                file.write(first_line)
            lines_to_write = lines_to_write[1:]
 
 
        remaining_data = ''.join(lines_to_write).encode()
        data_length = len(remaining_data)
            
        with open(filename, "r+b") as file:
            file.seek(0, os.SEEK_END)
            current_size = file.tell()
            file.write(b'\0' * data_length)
                
            mmapped_file = mmap.mmap(file.fileno(), current_size + data_length)
            mmapped_file.seek(current_size) 
            mmapped_file.write(remaining_data)
            mmapped_file.close()
 
    except Exception as write_error:
        print(f"Error writing to file: {write_error}")

def write_accumulated_data_to_file():
    global buffer_queue
    global timestamps
    global cps_values
    global filename
 
    with write_lock:
        if buffer_queue and timestamps:
            try:
                cps_values = calculate_cps(timestamps)
                with open(filename, "a+") as file:
                    print_decimal_chunks_to_file_and_terminal(buffer_queue, file, timestamps, cps_values)
                    file.flush()   
                buffer_queue.clear()
                timestamps.clear()
                cps_values = []
 
            except Exception as write_error:
                print(f"Error writing to file: {write_error}")
So the PC gathers data, an processes it to its final form like this:

2024-12-18 14:11:03.031651 - 0 CPS - 485 702 925 1152 1319 1440 1511 1568 1602 1624 1631 1631 1599 1534 1441 1344 1245 1124 994 896 774 645 513 390 283 189 5 0 5 5 34 133 289

This is an example of one completely processed data packet. Depending on the environment and my application use case, I can either expect only a few of those lines I will be writing to a file, but it is also possible I will have to write up to half a million of such lines to a text file.

Problem is that when I get more and more lines to write, it takes exponentially longer to complete the text file. For example: couple of thousand lines takes around few seconds, but 50k+ lines may take up to 15 minutes. 129k lines took half a day... I guess it should not be like that. I do not have any hardware limitations and I am ready to optimize much of the usual OS working conditions, but I want to believe its possible to write 500k lines in under 5-10 minutes somehow...

I did my initial research, but I am already using the memory mapped files, I do not use many write. commands, I try to do it all at once. I also tried to save it as a csv file or even binary file in HDF5 format, but it made almost no difference in the time it takes to write the file. Not much of quick fixes are left for me... So I am asking for help and observations on where are the bottlenecks in my code, I feel like I am missing out some fundamental python restrictions or somethin else... It does not have to be a text file in the first place, but I need a working algorithm to convert all my data to a given data example line list. I do not have much experience with python, but this time I can not fundamentally change the program architecture - I can not write data in small chunks as soon as it is received, I still have to write all buffer at once. Thank you for any help.
 

Attachments

nsaspook

Joined Aug 27, 2009
16,249
I would first add debug profiling code to mark (print to screen) the entry and exit times of likely time consuming commands and sequences that use I/O like seeks, reads, writes, opens, closes and/or 'for' (repeats of sets of commands) loops with possible large index variables.
 

abrsvc

Joined Jun 16, 2018
159
Creating an arbitrarily large data stream and attempting to write it out as a single write makes no sense. If you have an average size of each buffer, break up the data into segments that are multiples of that size. For example, you mention that the write of a few thousand packets takes a few seconds, write out that number of packets at a time and loop back to start gain. This method should result in a more linear increase in overall write time as packets are received. The underlying file system is likely the cause here as the file needs to grow to allow the write of that overly large data "record".

Regardless of the above, at some point with the current method, you could run out of memory space for the large buffer. Not virtual space but "real" memory allowable for the program itself in main memory which will incur additional overhead to manage.

The bottom line: Break up the writes into more manageable pieces. Since the packets use /n (new line?) anyway, why not let the file system manage the records? That is why it exists. You are over complicating the process for no performance gain.

This is not a python issue as any language would result in the same performance outcome.

Dan
 
The first thing I notice is that you're iterating over the data WAAAAY more than you need to. Zip the timestamp, cps and data into a tuple and only iterate over it _once_. Do all your formatting inline.

lines_to_write.append(line + "\n") #<---- This is bad practice. Just write the data.
######
file.write(b'\0' * data_length) #<-- Ooophf! Forget about using sparse files then...
mmapped_file = mmap.mmap(file.fileno(), current_size + data_length) #<---- Forgot to flush buffered file after writing the \0s
# Also, you only need to mmap data_length if you use the offset keyword parameter instead of seeking...


If I were you, I would rewrite this so it emits a stream. Then you can use any stream handler to write to the filesystem, or any other storage you may desire. Decouple the "filesystem stuff" from the "data processing stuff"...
 
Top