Why can I store more chars than I reserve a place for?

Thread Starter

StrongPenguin

Joined Jun 9, 2018
307
I'm chewing my way through "C Programming for absolute beginners" book, which is wonderful.

For practice, I type in many of the examples in CodeBlocks and compile, just for exercise, then I fiddle around with them.

Here is something I can't figure out.
Code:
#include <stdio.h>
#include <string.h>

main()
{
    char month[] = "Oktober";  //ikki neydugt at seta tal í [] tá man definerar string
    char something[5];  //ódefineradur string, pláss fyri 24 ASCII chars + \0 string end

    printf("What month is it? %s ofc. \n", month);
    printf("Which is your favorite month?\n");
    scanf(" %s", &something);
    printf("Computing\n\n");
    printf("Your favorite month is %s\n", something);
    printf("%s\n\n", something);  //test if the something var holds more than 5 elements
}
I sat the something variable to only 5 elements, just to see if I can write something over 5 (well 4..) and see if it prints. And it does.

Why? I don't get that.
 

dl324

Joined Mar 30, 2015
16,846
You're writing to memory "after" the allocated space. Sometimes the operating system detects it and aborts the program and sometimes it doesn't.

Bounds checking is programmer responsibility.

Try swapping the variable declarations.
 

nsaspook

Joined Aug 27, 2009
13,086
That's how a classic buffer-overflow happens.
https://en.wikipedia.org/wiki/Buffer_overflow

The variable is just a location in memory with a length of bytes reserved for the contents of the variable in a usually larger memory block.

A possible memory space allocation for a string.


If the extra memory locations in the block are unused by other program data if you write to the locations the data will be there(usually if it's a valid memory locations for the running process). NOT a good idea stomping on random memory locations by going out of the allocated memory for the variable because another variable or program code could be allocated in those next memory locations.
 

WBahn

Joined Mar 31, 2012
29,979
I'm chewing my way through "C Programming for absolute beginners" book, which is wonderful.

For practice, I type in many of the examples in CodeBlocks and compile, just for exercise, then I fiddle around with them.

Here is something I can't figure out.
Code:
#include <stdio.h>
#include <string.h>

main()
{
    char month[] = "Oktober";  //ikki neydugt at seta tal í [] tá man definerar string
    char something[5];  //ódefineradur string, pláss fyri 24 ASCII chars + \0 string end

    printf("What month is it? %s ofc. \n", month);
    printf("Which is your favorite month?\n");
    scanf(" %s", &something);
    printf("Computing\n\n");
    printf("Your favorite month is %s\n", something);
    printf("%s\n\n", something);  //test if the something var holds more than 5 elements
}
I sat the something variable to only 5 elements, just to see if I can write something over 5 (well 4..) and see if it prints. And it does.

Why? I don't get that.
This is one of the big reasons why you should never use scanf() for user input.

In general, when memory for an array is allocated the name of the array is associated with the address of the first element of the array. The size of the array is only used in order to enable the memory allocation code (in the case the code that generates the function call to main() to not put other variables into any of the space allocated to the array.

But the runtime code ONLY sees the address of the start of an array of type char (actually, it only sees the address of a single char that just happens to coincide with the first char in an array). Something else has to tell the code when to stop walking down the memory looking for values of type char. Most string functions look for the NUL terminator contained within the string (or input data). Unless YOU, the programmer, take responsibility for ensuring that your code doesn't walk past the end of the allocated memory, then the functions have no way of knowing that they did and will happily do so. In the process, they will read or, in this case, overwrite memory that belongs to other things -- such as other variables or control structures used by the program such as the call stack frame contents. In nearly all cases those fall under the heading of "a bad thing."

Since scanf() isn't capable of detecting the end of the allocated memory either, the user can enter data that overwrites memory beyond the array. In doing so, it is possible for them to insert malicious code into the string and then overwrite the return address for the function call thereby forcing the execution of that code. This is the classic "buffer overflow attack".
 

Thread Starter

StrongPenguin

Joined Jun 9, 2018
307
Ok, the book did not mention buffer overflow, so that's nice to know. I didn't really either get why I should allocate X number of words for the array. Arrays as a whole still confuse me, so I need to work more with them.

If scanf() is not the best way to get user input, then what other ways are there?
 

dl324

Joined Mar 30, 2015
16,846
Ok, the book did not mention buffer overflow, so that's nice to know. I didn't really either get why I should allocate X number of words for the array. Arrays as a whole still confuse me, so I need to work more with them.
Standard C doesn't have a string datatype, so we use arrays of characters. When you allocate space, you need to also include the null terminating character.
If scanf() is not the best way to get user input, then what other ways are there?
You can use fgets() and use stdin as the file stream. The size argument limits the maximum number of characters that will be read; including the NULL terminator.

char *fgets(char *s, int size, FILE *stream);

I have used gets() to read from stdin, but I see in the version of Debian Linux I'm using that it has been marked as deprecated and buffer overrun is listed as a bug:

"BUGS
Never use gets(). Because it is impossible to tell without knowing the
data in advance how many characters gets() will read, and because
gets() will continue to store characters past the end of the buffer, it
is extremely dangerous to use. It has been used to break computer
security. Use fgets() instead."
 
Last edited:

WBahn

Joined Mar 31, 2012
29,979
Ok, the book did not mention buffer overflow, so that's nice to know. I didn't really either get why I should allocate X number of words for the array. Arrays as a whole still confuse me, so I need to work more with them.

If scanf() is not the best way to get user input, then what other ways are there?
In the newer libraries there are more secure versions of most of the string functions.

Since I'm an old codger and set in my ways, I prefer using fgets() because it gives the programmer complete control. I tell it to get a string from the source stream and I tell it to get no more characters than my buffer has room for. I can then examine that string and validate it before I proceed. But it is a bit more cumbersome because I also have to check to see if the entire input was captured and respond correctly if not -- but that's a burden I'll gladly accept because if I detect that that's the case, then it also means that I would have had a buffer overflow if I had just used scanf().
 

WBahn

Joined Mar 31, 2012
29,979
Standard C doesn't have a string datatype, so we use arrays of characters. When you allocate space, you need to also include the null terminating character.
You can use fgets() and use stdin as the file stream. The size argument limits the maximum number of characters that will be read; including the NULL terminator.

char *fgets(char *s, int size, FILE *stream);

I have used gets() to read from stdin, but I see in the version of Debian Linux I'm using that it has been marked as deprecated and buffer overrun is listed as a bug:

"BUGS
Never use gets(). Because it is impossible to tell without knowing the
data in advance how many characters gets() will read, and because
gets() will continue to store characters past the end of the buffer, it
is extremely dangerous to use. It has been used to break computer
security. Use fgets() instead."
The gets() function has ALWAYS had the same problem that scanf() has -- namely that it blindly copies characters from input to the memory pointed to without any awareness or concern of the size of the buffer.
 

dl324

Joined Mar 30, 2015
16,846
The gets() function has ALWAYS had the same problem that scanf() has -- namely that it blindly copies characters from input to the memory pointed to without any awareness or concern of the size of the buffer.
It was never an issue. I always had a large buffer I used for unconstrained input.
 

nsaspook

Joined Aug 27, 2009
13,086
Guess we never had any deviants working at my company.
Too late. I used it for decades in software that was used for decades.
The deviants are not the worry, it's the professionals who do this for a living that should worry companies with IP they need to protect.
 

dl324

Joined Mar 30, 2015
16,846
The deviants are not the worry, it's the professionals who do this for a living that should worry companies with IP they need to protect.
AFAIK, my company was only hacked once; by someone who was acting as a security consultant. He was caught and convicted.

If someone wanted to do damage, it was far easier to try to discover a sysadmin's password.
 

WBahn

Joined Mar 31, 2012
29,979
To be sure, if the software you write is always used internally and is never accessible from a public interface, then the risk is greatly reduced. The biggest risk is widely used software that is accessible online, particularly if the binary and/or source is also available to an attacker. But successful exploits exist where people took code that was originally written for person and informal use (and thus with no input validation because the perceived threat was so low) and later became used by others internally and then, for convenience, made it available on the organization's website. Attackers then made some assumptions about the most likely ways that the input routines were written and wrote scripts that basically modified a basic attack pattern for progressively longer string buffers and within literally minutes of commencing the attack they had root privileges on the webserver.

Since programmers follow the patterns that they learn early on, it is best that they NOT learn to use programming practices that are widely exploited -- and scanf() and gets() are probably the single most exploited security holes out there. While some so argue -- with reasonable justifications -- that all code should be written with the most rigorous security concerns from day one, I think this is going overboard from a practicality standpoint. But there ARE some low hanging fruits that can and should be addressed from the get go.

I originally banned scanf() from student code not because of security concerns (that just wasn't on my radar in those days) but because of all the problems I saw with how poorly scanf() could handle user input that wasn't exactly correct.
 

dl324

Joined Mar 30, 2015
16,846
To be sure, if the software you write is always used internally and is never accessible from a public interface, then the risk is greatly reduced.
I worked mainly on software to facilitate design automation. We usually wrapped our programs with scripts that allowed sites to customize input validation to be appropriate for their sites and to make them bulletproof from Users. When we switched to Unix based computers, User input was via command line options and we typically used getopt() for argument processing.

The people using the programs and wrappers were far more interested in performing their jobs as efficiently as possible vs hacking to do something they shouldn't. Sometimes the wrappers were GUIs, but they performed similar functionality; verified inputs against a set of project specific requirements.

I worked with top secret company information, so there was no external access to our network. Contractors, vendors, and visitors were always escorted while they were on site.
 

nsaspook

Joined Aug 27, 2009
13,086
I worked mainly on software to facilitate design automation. We usually wrapped our programs with scripts that allowed sites to customize input validation to be appropriate for their sites and to make them bulletproof from Users. When we switched to Unix based computers, User input was via command line options and we typically used getopt() for argument processing.

The people using the programs and wrappers were far more interested in performing their jobs as efficiently as possible vs hacking to do something they shouldn't. Sometimes the wrappers were GUIs, but they performed similar functionality; verified inputs against a set of project specific requirements.

I worked with top secret company information, so there was no external access to our network. Contractors, vendors, and visitors were always escorted while they were on site.
None of that excuses using 'gets' or its kin in any program today, even toy examples. The ISO has actually removed gets() in the C11 standard.

This should be the current implementation.
Code:
char *gets(char *buffer)
{
assert(buffer !=0);
abort();
return 0;
}
 

MrSoftware

Joined Oct 29, 2013
2,188
To the original poster, as mentioned above there are more secure versions that can help prevent overwriting memory that isn't yours. For example, scanf_s() takes a parameter that specifies the buffer size, and it won't read more than you tell it that there is room for. Note that there is still the limitation that you have to specify the size correctly:

https://docs.microsoft.com/en-us/cp...-s-scanf-s-l-wscanf-s-wscanf-s-l?view=vs-2017

As for security; most times the risk is directly proportional to how interesting or valuable the target system is, or becomes. i.e. No one really considered home security cameras a risk to the general public... until a bunch of college kids demonstrated the danger by using them to levy DDOS against some mighty big targets (fascinating must read article). Similarly the guys who wrote the code to control centrifuges on a closed network probably didn't think incredibly hard about security... until Stuxnet appeared. There are lots of examples of things that seem uninteresting and unimportant, suddenly become very important because they became very interesting to the wrong people.
 

dl324

Joined Mar 30, 2015
16,846
The scanf() call could also be modified to add bounds restrictions:
Code:
  ...
  char fmt[16];

  printf("What month is it? %s ofc. \n", month);
  printf("Which is your favorite month?\n");
  sprintf(fmt, "%%%ds", sizeof(something)-1);
  scanf(fmt, something);
 
Last edited:

WBahn

Joined Mar 31, 2012
29,979
The scanf() call could also be modified to add bounds restrictions:
Code:
  ...
  char fmt[16];

  printf("What month is it? %s ofc. \n", month);
  printf("Which is your favorite month?\n");
  sprintf(fmt, "%%%ds", sizeof(something)-1);
  scanf(fmt, &something);
The big problem with this approach (IIRC, since I haven't used scanf() since my first semester learning C) is that you have no idea if the input was terminated early. If it was, then the rest of the input is sitting out there and will be processed by a subsequent call to scanf(). With fgets() this is not a problem since the carriage return is included as part of the string that is read, so you can determine by examining the buffer contents whether the entire input was captured and respond accordingly if it wasn't before proceeding to the next input.

The other problem is that (again, IIRC) scanf() breaks strings at whitespace so if the person is asked to enter their last name and it is "De Lorosa", only the "De" will be accepted and the "Lorosa" will be seen as the input for the next thing they are asked. I believe you can avoid this be using a regex-like expression as part of the format string, but I suspect this is almost never done.
 

WBahn

Joined Mar 31, 2012
29,979
Oh, and the most concise and, from a certain perspective, most accurate and useful answer to the question in the thread title:

"Why can I store more chars than I reserve a place for?"

is: "Because C gives you plenty of rope with which to hang yourself."

While nice and pithy, there really is a lot of truth to this. C was a language designed by people who knew what the heck they were doing and it was written to allow them to exploit that skill set by placing as few restrictions on them as possible since they knew exactly how everything worked under the hood. The result is a language that grants great power to those that program in it, but with great power goes great responsibility -- something which run-of-the-mill C programmers are often not equipped, either intellectually or emotionally, to handle.
 
Top