Trivial question about text files and non-text files...

Thread Starter

Spacerat

Joined Aug 3, 2015
36
Hello,

All computer files are collections of 1s and 0s at the lowest level. Files can have different content, i.e. some files contain text, some images, etc., and can be opened by different types of application programs.

html files, txt files, json files, xml files, etc. are all examples of text files. How are they different from each other? What do they have in common?
I guess they are similar to each other in that they can be opened with any text editors and the output is understandable. A text editor is able to interpret the bits in those files using ASCII or UTF decoding and render them as text. A html file is a bunch of characters (the tags and the text between them). A browser can somehow read a html text file and also give meaning to the tags in it rendering the html file as more than text.

Even a .jpg file (a bitmap) can be opened with a text editor but the content is not rendered correctly. An image processing application can render the image though. Does that mean that image processing applications do not support ASCII/UTF and uses a different decoding strategy to interpret and display the bits in image files?

Thanks!
 

dl324

Joined Mar 30, 2015
14,704
How are they different from each other? What do they have in common?
It depends on what operating system you're using.

On Windows, binary files are treated differently than text files, so you have to specify file type when opening them (from a language like C). On Linux/Unix, all files are the same; just a collection of bytes.
 

boostbuck

Joined Oct 5, 2017
204
A JPG can be opened with a text editor and it will be presented as a series of bytes mapped onto the ASCII character table for graphic presentation. As the file was not encoded using the ASCII table, the presentation does not resemble the source document.

The issue you are considering is of using the same decryption rules for a file as was used to encrypt it into binary. Even a text file can have a higher level of presentation beyond the text - for example, a document encoded into an HMTL file and subsequently viewed with a text editor will not resemble the source document.

A program object file could be opened with a text editor and viewed as a sequence of ASCII characters, but it has no relevance to the intended presentation of the file - the behaviour of a processor, which has no visual counterpart at all.
 

djsfantasi

Joined Apr 11, 2010
8,394
Any “text” file restricts itself to the printable (and understandable) character set in whatever encoding system is in use. ASCII or UTF-8.

The difference between straight text, HTML and other text files are the syntax they follow. Different syntaxes require different applications to render their content. For example, a browser renders HTML. Coding languages require a compiler or interpreter to execute.

Binary files (a compiled program, a picture file in JPG or MPx) are also interpreted via corresponding applications. While they can be viewed as raw files (binary or hex) in a text editor, they shouldn’t be edited. For example, most text editors will lose some of the binary values if edited and saved.

ALL files are binary. You can use a text editor to view them, but not edit them. All files are organized by syntax, standards or purpose. What you need to learn is what appropriate applications are used with each type.
 

xox

Joined Sep 8, 2017
697
Even a .jpg file (a bitmap) can be opened with a text editor but the content is not rendered correctly. An image processing application can render the image though. Does that mean that image processing applications do not support ASCII/UTF and uses a different decoding strategy to interpret and display the bits in image files?


ALL file formats imply an encoding of one kind or another. The text editor for example was originally designed to interpret single-byte ASCII octets. Later UTF-8 support was almost universally adopted.


An image format by the way can absolutely be implemented in ASCII/UTF. Just look at SVG. Only problem with that approach is that it isn't very efficient. Formats such as the JPEG family instead focus on being as compact as possible. The file is essentially read in and processed at the raw binary level. Text only appears in specially demarcated "annotation" sections.


At the end of the day of course it is all really nothing more than an arbitrary matter of convention. You could create your very own "custom" file format for that matter, then write some program which allows you to "edit" such files. How the editor maps its symbols to binary would be completely up to you.
 

Thread Starter

Spacerat

Joined Aug 3, 2015
36
Thank you everyone! much appreciated.

Summarizing my thoughts below after digesting your comments and looking farther into it:

  • All computer files (images, sound files, programs, text files) are binary: they are all sequence of bytes or, equivalently, a long sequence of decimal numbers going from 0 to 255. However, in common tech jargon, files can be divided into text files (can be open using text editor) and binary files (all files that are not binary, like images, applications, sound, etc.).
  • Format: all files have a format (format = encoding) which are rules that describe the file structure. The format is represented by the first few bytes of the file. Format =magic number. Ex: all PKZIP files start with 0x50 0x4b 0x03 0x04. Programs that handle zip files read the magic number first and if not recognized they don't open it.
  • Extension: files also have an extension. The extension is not a rigid attribute in the sense that it is just a hint of what type of file it may be. For example, we could save an image with .txt extension.
  • Character encoding: Text files don't have a magic numbers. So a text editors open/process/render any file and assume that there is a one-one relation between bytes and characters of a certain alphabet. This relation is the character encoding (there are many). Many alphabets means many different encodings. The concept of codepoint (numbers) and characters. Images also have an encoding which is understood by image processing programs.
  • Text files and Structure: not every text files are text files. Different text files have different structure. Ex: HTML and XML have tags that are text but with additional meaning. Syntax may be the same but semantics are different....

Thank you for any correction.
 

Ya’akov

Joined Jan 27, 2019
6,044
[This is not to disagree with anything previously written here, rather to possibly provide an additional perspective that might help clarify things further.]

One of the confusing aspects of the terminology used in computer science and engineering is its contextual quality. While sometimes language is precise and careful, the same language can turn up being used in a more colloquial way.

This is compounded by the fact that there are layers in all computer applications and the names given to things can shift meanings depending on the layer under consideration.

For example, binary is, of course just base 2 numbering. Computers use on and off signals for everything, so binary arithmetic is a basic description of many operating parts.

But binary can mean something only has two states, or it can mean math performed with only the symbols 0 and 1. These two things are clearly closely related, but they are not the same, and our brains being relentless pattern recognition engines, we find ourselves noting the strong similarities (and the actual fundamental commonalities) and try to superimpose the differences as properties each have when they are, in fact, differences.

As you point out, "all computer files are collections of 1s and 0s at the lowest level", which is certainly true to a great extent but not quite so. At the lowest level, computer files are a scattered set of electrical charges (in memory) or magnetic domains (on disk). The way in which you are correct, is if you understand the term file as a reference to only the collection of these physical entities, which is done by the index of the filesystem.

In other words, so long as the file is understood to be only a logical entity then we can speak in terms like "collection of 1's and 0's" because collection*, *1*, and *0 in that phrase are all at the same layer—the logical layer. Any time we shift layers in the architecture of the computer system, we have to abandon the idea the terminology we use necessarily resembles any particular aspect of something at a different layer with the same name.

So, if you look at your very first sentence, you will find the key to the answer embedded in it. You spoke about files as collections of symbolic bits, and that's where you find your answer.

The key here is to add in the idea of information*. Files are intended to be logical containers for information. In order to interpret the bits encoded in a file, we need to understand the *format that organizes the information in it. And, so you hear about *file formats*.

A file format is an agreed upon way of organizing the bits collected together into a logical file so that the information stored in that file can be retrieved from it.

In the simplest case, we have the format imposed on a file by the underlying physical constraints and architectural choices made at the layers below the one where we are dealing with "files" at all. The nature of the source of the data, the hardware, and the various layers of operating system code all require that a file, no matter what its high level format might be share a low level format with all other files on the system.

So if we simply stuff the bits from some source into the basic file, the data will be there, and no matter the source, it will look the same. But, interpreting the data will require that we know the source, and can extract the information by organizing the data in relation to itself.

One format, which is not a file format but lower level, is character encoding like ASCII and UTF-8. For any given file, which can use that encoding (format) to interpret the data. But if the information in the file is not symbolically encoded, that is, it isn't written language that use some character encoding, we will not get the information in any form we can use, it will be gibberish.

Even if it is symbolically encoded language if we choose the wrong character encoding the interpretation by higher layers of the operating system (i.e.: the display) will produce something that is either nonsense or in this case by a side effect of how the schemes were developed "broken", incomplete text.

If we are dealing with files of textual data, the character encoding is the lowest layer, it is a little bit of presentation since our programs need to know which glyphs to present to us, but it is just text with newlines that can break it into paragraphs.

But in the case where we want more information about presentation we need some format that will add data about things like typefaces, sizes, colors, positions, and the like. So, there are formats like DOCX and HTML and TEX. These use symbols encoded in the character stream to indicate to the interpreting program not only what something says but how it should look.

Note when you said, "files can have different content" that's true, but only as the level of the interpreting programs. If we move back down we can see they only have the information as the collection of zeroes and ones. What they do have is different sorts of information encoded in them. The "content" is an ambiguous idea. It depends entirely how you choose to interpret the data.

So, every part of this process involves knowing what format was used to store the information in the file. How to know that can vary, but there are two primary elements. The first is file extension the .jpg, .txt, .html and the like are hints to which program should attempt to interpret the file.

They don't determine anything, as you pointed out, and the "wrong" program will still attempt to interpret the file with various results from displaying gibberish to emitting an error like "not a valid jpeg image".

So how does it know that it's not valid? Because of the second element is the internal file structure, particularly but not exclusively *headers*.

The header in a JPEG image is precisely defined. Here's the struct used in C code to define it for the program:

typedef struct *JFIFHeader
{
BYTE SOI[2]; /* 00h Start of Image Marker */
BYTE APP0[2]; /* 02h Application Use Marker */
BYTE Length[2]; /* 04h Length of APP0 Field */
BYTE Identifier[5]; /* 06h "JFIF" (zero terminated) Id String */
BYTE Version[2]; /* 07h JFIF Format Revision */
BYTE Units; /* 09h Units used for Resolution */
BYTE Xdensity[2]; /* 0Ah Horizontal Resolution */
BYTE Ydensity[2]; /* 0Ch Vertical Resolution */
BYTE XThumbnail; /* 0Eh Horizontal Pixel Count */
BYTE YThumbnail; /* 0Fh Vertical Pixel Count *
} JFIFHEAD;


Note the "JFIF" which is actually readable if you look inside a JPEG file. It is a standard called the JPEG File Interchange Format. It is intended to make jPEGs usable no matter which application generates them because in spite of our everyday experience, JPEG files to have to be compatible across implementations to still be JPEG encoded data, which is something from different layer.

So, you can see that character encodings aren't fundamental to files, they are only relevant to files containing character-based information. And, even when they are used, they can be accompanied by other elements as part of a format which explains how to interpret whatever else is along side them.

One last point, the idea of a "binary" file is one of those language-based confusions which, unfortunately, we are stuck with. There is nothing more "binary" about a file that is not character information, but because of our tendency to conflate attributes of things that share names, it has simply come to mean it is a blob of data that isn't intended for presentation and instead for use by some other layer of the system (executable code, program input data, image data, etc.) and does not follow the character encoding conventions.
 
Last edited:

SamR

Joined Mar 19, 2019
4,303
One of the tricks I used with Open VMS operating system (which only had a really crappy line editor similar to MDS-DOS Edlin) was to use a Telnet app on my office Windows system to transfer .txt files from the VMS (using our plant fiber ethernet) to my Windows 95 desktop. Then I could use any full screen text editor (seem to remember using notepad) to create or edit control algorithms to be telnetted back to the VMS system to then use the control systems compiler to integrate the changes into the GSE Solutions Distributed Control System used for that operations area. As DJ and Yaakov said, ASCII text is the same no matter which operating system it runs on. Saved quite a few long industrial bicycle (like pedaling a tank) rides out to the operations area to make changes in the control room.
 
Last edited:

Ya’akov

Joined Jan 27, 2019
6,044
I don’t think we have to worry about files that are a collection of trits for some time. They donot even have a material, just a simulation. I am also skeptical about the handwaving about applications. He really didn’t specify anything just how amazingly good it all is.

So for now in any case I think I’ll wait for a practical demonstration before I get too excited.
 

djsfantasi

Joined Apr 11, 2010
8,394
One of the tricks I used with Open VMS operating system (which only had a really crappy line editor similar to MDS-DOS Edlin) was to use a Telnet app on my office Windows system to transfer .txt files from the VMS (using our plant fiber ethernet) to my Windows 95 desktop. Then I could use any full screen text editor (seem to remember using notepad) to create or edit control algorithms to be telnetted back to the VMS system to then use the control systems compiler to integrate the changes into the GSE Solutions Distributed Control System used for that operations area. As DJ and Yaakov said, ASCII text is the same no matter which operating system it runs on. Saved quite a few long industrial bicycle (like pedaling a tank) rides out to the operations area to make changes in the control room.
Another lost technique was to use the MS-DOS program DEBUG, to manually edit a data file that contained mixed formats. One could modify floating point values with DEBUG. One could patch executables with DEBUG. One could extend a mixed format data file with debug. One could split a data file into two smaller files with DEBUG.

It was a great, low-level tool that solved many problems.
 

sparky 1

Joined Nov 3, 2018
718
The question involves a comparison of one instruction set to higher level instruction set. Lately probability about 1's and 0's is ideal theory.
If you begin with machine language which is considered a low level language then it is a logical compare and contrast for the elements found in a text file. Having some idea about how text is read in and out of memory you might compare a basic text file's format capability of that of an HTML text file. The no frills text editor named Notepad.exe 1983 was 6.539,000 bytes long. HTML has the capability of a web page. It is considered a mark- up language. The delimiters of some the text files are a subset of Mark up language delimiters. PHP is another language that can dynamically convert text files.


Hello World in machine code
b8 21 0a 00 00
a3 0c 10 00 06
b8 6f 72 6c 64
a3 08 10 00 06
b8 6f 2c 20 57
a3 04 10 00 06
b8 48 65 6c 6c
a3 00 10 00 06
b9 00 10 00 06
ba 10 00 00 00
bb 01 00 00 00
b8 04 00 00 00
cd 80
b8 01 00 00 00
cd 80
 
Last edited:

MrAl

Joined Jun 17, 2014
9,186
Hello,

All computer files are collections of 1s and 0s at the lowest level. Files can have different content, i.e. some files contain text, some images, etc., and can be opened by different types of application programs.

html files, txt files, json files, xml files, etc. are all examples of text files. How are they different from each other? What do they have in common?
I guess they are similar to each other in that they can be opened with any text editors and the output is understandable. A text editor is able to interpret the bits in those files using ASCII or UTF decoding and render them as text. A html file is a bunch of characters (the tags and the text between them). A browser can somehow read a html text file and also give meaning to the tags in it rendering the html file as more than text.

Even a .jpg file (a bitmap) can be opened with a text editor but the content is not rendered correctly. An image processing application can render the image though. Does that mean that image processing applications do not support ASCII/UTF and uses a different decoding strategy to interpret and display the bits in image files?

Thanks!
There are different types of files because they all have specific purposes. First came text files then later image files. Later text and image in one file.
One of the problems in programming is you cant really do everything with one program. It's easier to break off into individual programs for specific purposes and maybe mix one or two.

But you cant really open any file with a text editor because you will not be able to see every single byte (or bits). That's because text editors have at least one character that tells the program it reached the end of the file, even though there could be more bytes stored by someone after that character.
Now jpg files end up having bytes that are somewhat random. That's because the compression schemes used to make an image smaller produces bytes that are strings of 1's and 0's literally during the Huffman part of the encoding. Before even that though, there is a discreet cosine transform that creates somewhat random bytes (as would appear to a program that cant decode them). That means at some point within the file, and probably much before the actual end of the image, it will generate an end-of-file character and that means the text editor will stop reading bytes from the file and just display what it already has. It's very likely with a 100k jpg file you may only see one tenth of the file in a text editor because of the above.

I happen to like pure text files because they are so simple yet can convey so much information.
I like image files too but they have to be read knowing something about the possible formats that may be encountered so it's a much more difficult task to create an image viewer.
I happened to create an image viewer that can view most image file formats. It was quite a long road to getting something decent that can do all kinds of stuff like rename files, move files, search for files, etc.
But in the image decoder section i rely on the Windows Gdi Plus API in order to load image files and convert image files to other types and load into memory for custom processing. Before that all it could read were bmp files, jpg files, and gif files. With the Gdi Plus library you can read a hole much of formats. So that makes part of the program easier. The rest is file management really.

If you really want to read a jpg (or other file) with a text editor, you can create a small program that reads the bytes and converts them to ascii, so a value of 97 would product the hex equivalent and the text editor would read that. That would display every byte (all bits) in the file regardless what kind it was. It it is a very large file however you have to read part of it at a time or it will take too long to read and take up a lot of memory.
This kind of program would be called a hex editor.
 

MrChips

Joined Oct 2, 2009
26,526
Thank you everyone! much appreciated.

Summarizing my thoughts below after digesting your comments and looking farther into it:

  • All computer files (images, sound files, programs, text files) are binary: they are all sequence of bytes or, equivalently, a long sequence of decimal numbers going from 0 to 255. However, in common tech jargon, files can be divided into text files (can be open using text editor) and binary files (all files that are not binary, like images, applications, sound, etc.).
  • Format: all files have a format (format = encoding) which are rules that describe the file structure. The format is represented by the first few bytes of the file. Format =magic number. Ex: all PKZIP files start with 0x50 0x4b 0x03 0x04. Programs that handle zip files read the magic number first and if not recognized they don't open it.
  • Extension: files also have an extension. The extension is not a rigid attribute in the sense that it is just a hint of what type of file it may be. For example, we could save an image with .txt extension.
  • Character encoding: Text files don't have a magic numbers. So a text editors open/process/render any file and assume that there is a one-one relation between bytes and characters of a certain alphabet. This relation is the character encoding (there are many). Many alphabets means many different encodings. The concept of codepoint (numbers) and characters. Images also have an encoding which is understood by image processing programs.
  • Text files and Structure: not every text files are text files. Different text files have different structure. Ex: HTML and XML have tags that are text but with additional meaning. Syntax may be the same but semantics are different....

Thank you for any correction.
You got that right.
All files text or non-text is a collection of 0s and 1s.
You can make up whatever rules you desire in order to decode the information in the file.
 

Ya’akov

Joined Jan 27, 2019
6,044
All computer files (images, sound files, programs, text files) are binary: they are all sequence of bytes or, equivalently, a long sequence of decimal numbers going from 0 to 255. However, in common tech jargon, files can be divided into text files (can be open using text editor) and binary files (all files that are not binary, like images, applications, sound, etc.).
This is wrong in an important way. The "bytes" are just a conventional way of organizing the bits, like character encoding at a different level. The files are just bits scattered all over disk and/or memory, with maps to indicate the addresses of groups of bits that are related to each other, and their order.

Depending on what level you inspect a "file", you will see various sorts of organization, or none at all. Without the map, is a table of locations for the files contents, finding it would be nearly impossible* because while chunks of data will be written in order in various places, the free space available to write files is all over the disk (or memory).

*There are ways of recovering files by using clues about the file format and contents but it doesn't really change the idea.
 

eetech00

Joined Jun 8, 2013
3,241
Some comments:

Thank you everyone! much appreciated.

Summarizing my thoughts below after digesting your comments and looking farther into it:

  • All computer files (images, sound files, programs, text files) are binary: they are all sequence of bytes or, equivalently, a long sequence of decimal numbers going from 0 to 255. However, in common tech jargon, files can be divided into text files (can be open using text editor) and binary files (all files that are not binary, like images, applications, sound, etc.).
All computer files (images, sound files, programs, text files) at the lowest level contain binary information.

That said, the term "text file" (previously known as "ascii text file") is used to describe a file that is written using the standard ascii character set. It can be read and displayed by a standard ascii text editor. Such an editor read/writes/displays the ascii content in human readable format.
A "binary file", on the other hand, is used to describe a file with content that contains characters beyond the standard ascii character set (but may include ascii characters). Because of this, it may partially display in an ascii text editor but usually ends up as a partially scrambled view of the content. If the editor doesn't understand the content, then it cannot properly read/write/or display the content. So binary files are written to be read by specific programs or operating systems

  • Format: all files have a format (format = encoding) which are rules that describe the file structure. The format is represented by the first few bytes of the file. Format =magic number. Ex: all PKZIP files start with 0x50 0x4b 0x03 0x04. Programs that handle zip files read the magic number first and if not recognized they don't open it.
The content of each file follows a specific "format". The format is a specific sequencing of characters designed to be recognized by specific programs or operating systems. Some programs can read/recognize multiple formats.

  • Extension: files also have an extension. The extension is not a rigid attribute in the sense that it is just a hint of what type of file it may be. For example, we could save an image with .txt extension.
The file type may provide a hint of the file format. However, a specific file extension name (such as .txt or .jpg) may or may not be required by the program or operating system reading the file.
 

MrAl

Joined Jun 17, 2014
9,186
Hello again,

There is one more little caveat, raw disk space vs file space.

The file space would be the bytes that are contained within a specific file type file. For example, if a text file or jpg file has 1000 bytes in it then the program for that file type would read 1000 bytes possibly a larger amount but drop the extra bytes and just use those 1000 bytes.

The raw disk space would be the actual sectors of the disk which contain many bytes for many files.
Then also the allocation unit is the smallest amount of disk space that could be used by a file on the disk. For the above file with 1000 bytes if the allocation unit was 1024 bytes then the file would actually take up 1024 bytes so there would be 24 bytes not used but also not able to be used by any regular program. These bytes can be read with a raw disk routine though although they usually wont do any good unless you want to clone a drive exactly as it is, and then the entire sector would be read and stored on the new disk.
Then there is a file index section that stores the names of the files and where they are located on the disk. This varies with different types of file systems.

As the above example of 1000 bytes shows there was a wasted disk space of 24 bytes but with a default allocation unit size of 4096 bytes there would be more than 3000 bytes wasted. With an allocation unit size of 2048 bytes there would be 1048 bytes wasted and if there were a lot of files with 1000 bytes that means more than half of the entire disk space would be wasted and no good way to use that space unless you combine files into libraries or albums where a lot of files are connected end to end to make one big files. Of course then you need a companion program to read the individual files. A .zip file compressing program would do this and also save some more space by compressing each file. Bmp files for example often compress down to one tenth of their original size while jpg files are already pseudo compressed so they dont reduce as much.
On the other hand, if many of the files are very large and there is an allocation unit size of 4096 bytes then there will be much less wasted space because each file will take up several full allocation unit size spaces and only at the end of the file there will be some wasted space. If the files size was 8191 bytes, then only 1 byte would be wasted for eample.

Disk organization systems are interesting to study to get a good feel for how all this works.
 

MrAl

Joined Jun 17, 2014
9,186
Oh hear is another interesting fact about text files and related.

Say for example you want to name your files with a name that is currently not accepted by Windows OS.
That system does not allow you to use characters like forward slash, backslash, colon, etc. So if you want to name the file say
"MyFile:File1:YesterdayLog\Logs"
You cant do it.
In this case what you could do, should you be able to program just a little, is create a program that reads the first line of the file and displays that in a listing rather than the actual name, or with the actual name too.
So the text file you create to say log voltages would look like this:
"MyFile:File1:YesterdayLog\Logs"
120 volts
119 volts
121 volts
117 volts.

The program you create would read the first line and use that as the name, then send the rest of the lines to a text editor or a text reader program with the extra code to read the first line and use it as the file name.
You could name all the files in the file system itself as FIle_00000001, File 00000002, etc., and use that first line as tne human readable name.

One way this idea is already used is in Notepad that ships with many Windows versions.
If you open a text file and type on the first line:
".LOG"
the file acts as a log file so every time you open it it starts by appending the file with the date.
You have to open it with Notpad.exe each time though because that recognizes the first line as such.
Here is an example that i just did (not including FILE_START and FILE_END here):
FILE_START
.LOG

3:43 PM 2/21/2022
This log 1.
3:43 PM 2/21/2022
This log 2.
3:44 PM 2/21/2022
This log 3.
FILE_END

The program typed the time and date each time it was opened i typed the log 1,2, 3 myself before closing the program.
It's a little interesting they dont include the seconds just the hour and minutes and date. You could write your own little program as above and rectify this and also use a more standard format like:
03:45:12pm 02/21/2022
or the way i like to do it:
02/21/2022 at 03:45:12pm

This is a little interesting too because many file systems do not have 1 second resolution with the keeping of file times, so if we see 12 seconds on one file we will never see 13 seconds on the next file if it was created exactly 1 second later. we would see either 12 again or 14 second.

As you can see, the format is very very variable almost anything you want to do you can do.
 
Top