High level language telltales in machine code?

Discussion in 'Programmer's Corner' started by WBahn, Feb 14, 2013.

  1. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,788
    4,808
    I was asked a question that I don't have any idea about the answer. Basically, the question was what the difference was between C and C++ once you got down to the low level (i.e., once it is compiled). This got me to thinking about the more general question of how feasible it would be for someone to examine compiled code (say once it has been dissassembled) and figure out what language the original program was written in.

    My initial gut feeling was that it should be impossible in theory but that the language constructs probably lend themselves to translation strategies that probably leave signatures (patterns) that can be identified. But even then, are the patterns more likely to be a function of the compiler writer (or tools used by the compiler writer) or are they more likely to be a function of the language?

    Perhaps to giving a more concrete working example, would you expect object-oriented code to result in compiled code that was demonstrably recognizable compared to non-object-oriented code.

    It would be interesting to hear people's thoughts.
     
  2. tshuck

    Well-Known Member

    Oct 18, 2012
    3,531
    675
    My thoughts are that it would be impossible to distinguish C++ and C in a dissassembled form. A compiler can turn a while loop into a number of implementations, but, would prefer one do to the styling put forth by the compiler's creators. In that respect, the implementation of an object, provided it is unique to the compiler, could yield results as to which compiler, therefore, which language was used.

    However, C++ classes would have information about its size and shape, along with the class members, in what would eventually store the object and its references. This would be reproduced for each object, so that one could possibly determine the presence of a class due to the way in which it is referenced and the sequence of steps used to access the object in the dissassembly.

    In that respect, identifying common sequences of writes to sections of memory would likely indicate a class object, however, this would not be guaranteed, as the compiler could do the same for accessing any data type.

    If, however, you knew the constructs a compiler makes in order to implement/access a certain C++ object, you could determine that the code is most likely C++.
     
  3. Brownout

    Well-Known Member

    Jan 10, 2012
    2,375
    998
    This is nothing more than an educated guess. At the low level, it's all just 1's and 0's so there is no real difference. You question about patterns is a good one, however. Ceatainly, C++ would compile with recognizable patterns. Now, it is possible that c can be structured with objects that would theoritically compile with similar patterns, and it might not be possible to discerne between the two. At that point would have need to ask, would anyone use c in that matter when a better tool was available? In most cases that answer would be 'no.' But there would be exceptions.
     
  4. spinnaker

    AAC Fanatic!

    Oct 29, 2009
    4,887
    1,019
    I would think that someone with an intimate knowledge of the compiler that produced the code might be able to determine that their compiler produced the code.
     
  5. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,788
    4,808
    That's my general supposition, as well. It's one thing (and a useful tidbit of information) if it's practical or even probable to know the telltales of a specific compiler well enough to spot code that was compiled with that compiler. But I'm curious whether or not different languages, because of the syntax and semantics of the language itself, would make it so that somone that was very familiar with many different compilers for many different languages would have a fair shot at determining that a particular program was written in that language even if it were compiled by a compiler they had never come across previously.

    I'm guessing that the telltales would exist but would be much, much weaker even then spotting that a particular program was compiled from an unknown language by one of, say, Borland's compilers because their other compilers for other languages exhibit certain patterns in the generated code. But I'm curious whether those language-specific telltales are strong enough to actually be usable.
     
  6. ErnieM

    AAC Fanatic!

    Apr 24, 2011
    7,395
    1,607
    The difference between the compiled output machine codes from C and C++ compilers would be immediately obvious.

    C++ is object orientated, C is not. Objects are not just a high level construct that disappears in low level code: an object is created by first obtaining RAM from the heap, and the address of this memory os the pointer to the object. Object members variables are accessed by adding an offset to the object pointer and using the value stored there. Object members methods are accessed by reading a function pointer also stored at an offset to the object pointer.

    Thus a pattern of variable and function being accessed by double dereferenced pointers would show code to be C++ and not C.
     
  7. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,788
    4,808
    How sure of this are you? My limited understanding is that the method pointers are NOT stored in each object because that would allow each object to have different methods, not just each class. Instead, either a single table of method pointers is generated (or possibly one table per class) by the compiler and those are used by the code (with no reference to it in the object's data block) or, I believe in C++ anyway, it can be even more non-obvious than that because if a class has static methods then the compiler can produce jumps to those methods directly just like any other function.

    As for the data objects themselves being dynamically allocated in C++ but not in C, most C programs of any size use dynamically allocated structures extensively.

    I was talking to a friend at dinner tonight and he thought that the virtual function tables used by most object oriented languages could probably be identified. He also said that languages that allow functions to be defined within functions would have telltales that languages that don't allow that would lack.
     
  8. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    Back when it was just Windows 3.11, Borland C++ and Microsoft C, and two others, there was a program called "whatis.exe" that would say which compiler was used, and even the version of that compiler.

    It used binary signatures from libraries to identify the different compilers, much like antivirus programs use signatures to identify a virus. It could even detect which assembly was used in the event the program was compiled in assembler. That was accomplished by matching macro behavior.

    The "stoned" Virus was written in Borland C, for example (this was in the early 90's)

    I can see a similar program being built today for the M$ Win platform, though it would be FAR more complicated with the number of compilers around.

    Today, it's not really difficult if you have cygwin installed, simply do `strings prog.exe` and you'll see the standard errors/output/warnings and locations, and if it is a .NET file, 32 bit, or 64 bit executable. The compiler also tags it's name and version near the top or bottom of the file.

    When it comes to which compiler created a .hex file for a microcontroller, I've never seen one, or heard of one. Since the "whatis.exe" essentially scanned the program, then ran the program keeping track of how stacks were maintained, it could get it right. That's not possible on a uC (unless you use an emulator and have signatures for a few dozen compilers).

    --ETA: Just Searched, found one!



    Here is a Download Link (can't vouch for it)

    The details on the download page are:
    --ETA2: Here is a page of MANY different tools to do what the OP is asking

    I just searched for "whatis.exe" after writing the post above. Weird.
     
    Last edited: Feb 15, 2013
  9. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,788
    4,808
    A friend on another mailing list mentioned whatis.exe, which I had never heard off prior to that.

    What the OP is really asking about is not whether it is possible to identify the compiler and, from that, infer the language. The real question is whether or not it is practical to determine the language based on the signatures that the language itself, not the compiler, introduces into the machine level code.

    I think the answer at this point is a qualified maybe. It's almost certain that some degree of categorization can take place.

    It would be interesting if I could get ahold of J.E. Smith (the guy that wrote whatis.exe) and see what he might have to say about it.
     
  10. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    I just tried out PDiE.

    MikroC - Borland Delphi 6.0-7.0 And Visual Studio C+++

    DipTrace Schematic - gcc

    Some come up unidentifiable (PICKit 3 standalone), but it hits most others.
     
Loading...