High level language telltales in machine code?

Thread Starter

WBahn

Joined Mar 31, 2012
29,976
I was asked a question that I don't have any idea about the answer. Basically, the question was what the difference was between C and C++ once you got down to the low level (i.e., once it is compiled). This got me to thinking about the more general question of how feasible it would be for someone to examine compiled code (say once it has been dissassembled) and figure out what language the original program was written in.

My initial gut feeling was that it should be impossible in theory but that the language constructs probably lend themselves to translation strategies that probably leave signatures (patterns) that can be identified. But even then, are the patterns more likely to be a function of the compiler writer (or tools used by the compiler writer) or are they more likely to be a function of the language?

Perhaps to giving a more concrete working example, would you expect object-oriented code to result in compiled code that was demonstrably recognizable compared to non-object-oriented code.

It would be interesting to hear people's thoughts.
 

tshuck

Joined Oct 18, 2012
3,534
My thoughts are that it would be impossible to distinguish C++ and C in a dissassembled form. A compiler can turn a while loop into a number of implementations, but, would prefer one do to the styling put forth by the compiler's creators. In that respect, the implementation of an object, provided it is unique to the compiler, could yield results as to which compiler, therefore, which language was used.

However, C++ classes would have information about its size and shape, along with the class members, in what would eventually store the object and its references. This would be reproduced for each object, so that one could possibly determine the presence of a class due to the way in which it is referenced and the sequence of steps used to access the object in the dissassembly.

In that respect, identifying common sequences of writes to sections of memory would likely indicate a class object, however, this would not be guaranteed, as the compiler could do the same for accessing any data type.

If, however, you knew the constructs a compiler makes in order to implement/access a certain C++ object, you could determine that the code is most likely C++.
 

Brownout

Joined Jan 10, 2012
2,390
This is nothing more than an educated guess. At the low level, it's all just 1's and 0's so there is no real difference. You question about patterns is a good one, however. Ceatainly, C++ would compile with recognizable patterns. Now, it is possible that c can be structured with objects that would theoritically compile with similar patterns, and it might not be possible to discerne between the two. At that point would have need to ask, would anyone use c in that matter when a better tool was available? In most cases that answer would be 'no.' But there would be exceptions.
 

spinnaker

Joined Oct 29, 2009
7,830
I would think that someone with an intimate knowledge of the compiler that produced the code might be able to determine that their compiler produced the code.
 

Thread Starter

WBahn

Joined Mar 31, 2012
29,976
I would think that someone with an intimate knowledge of the compiler that produced the code might be able to determine that their compiler produced the code.
That's my general supposition, as well. It's one thing (and a useful tidbit of information) if it's practical or even probable to know the telltales of a specific compiler well enough to spot code that was compiled with that compiler. But I'm curious whether or not different languages, because of the syntax and semantics of the language itself, would make it so that somone that was very familiar with many different compilers for many different languages would have a fair shot at determining that a particular program was written in that language even if it were compiled by a compiler they had never come across previously.

I'm guessing that the telltales would exist but would be much, much weaker even then spotting that a particular program was compiled from an unknown language by one of, say, Borland's compilers because their other compilers for other languages exhibit certain patterns in the generated code. But I'm curious whether those language-specific telltales are strong enough to actually be usable.
 

ErnieM

Joined Apr 24, 2011
8,377
The difference between the compiled output machine codes from C and C++ compilers would be immediately obvious.

C++ is object orientated, C is not. Objects are not just a high level construct that disappears in low level code: an object is created by first obtaining RAM from the heap, and the address of this memory os the pointer to the object. Object members variables are accessed by adding an offset to the object pointer and using the value stored there. Object members methods are accessed by reading a function pointer also stored at an offset to the object pointer.

Thus a pattern of variable and function being accessed by double dereferenced pointers would show code to be C++ and not C.
 

Thread Starter

WBahn

Joined Mar 31, 2012
29,976
The difference between the compiled output machine codes from C and C++ compilers would be immediately obvious.

C++ is object orientated, C is not. Objects are not just a high level construct that disappears in low level code: an object is created by first obtaining RAM from the heap, and the address of this memory os the pointer to the object. Object members variables are accessed by adding an offset to the object pointer and using the value stored there. Object members methods are accessed by reading a function pointer also stored at an offset to the object pointer.

Thus a pattern of variable and function being accessed by double dereferenced pointers would show code to be C++ and not C.
How sure of this are you? My limited understanding is that the method pointers are NOT stored in each object because that would allow each object to have different methods, not just each class. Instead, either a single table of method pointers is generated (or possibly one table per class) by the compiler and those are used by the code (with no reference to it in the object's data block) or, I believe in C++ anyway, it can be even more non-obvious than that because if a class has static methods then the compiler can produce jumps to those methods directly just like any other function.

As for the data objects themselves being dynamically allocated in C++ but not in C, most C programs of any size use dynamically allocated structures extensively.

I was talking to a friend at dinner tonight and he thought that the virtual function tables used by most object oriented languages could probably be identified. He also said that languages that allow functions to be defined within functions would have telltales that languages that don't allow that would lack.
 

thatoneguy

Joined Feb 19, 2009
6,359
Back when it was just Windows 3.11, Borland C++ and Microsoft C, and two others, there was a program called "whatis.exe" that would say which compiler was used, and even the version of that compiler.

It used binary signatures from libraries to identify the different compilers, much like antivirus programs use signatures to identify a virus. It could even detect which assembly was used in the event the program was compiled in assembler. That was accomplished by matching macro behavior.

The "stoned" Virus was written in Borland C, for example (this was in the early 90's)

I can see a similar program being built today for the M$ Win platform, though it would be FAR more complicated with the number of compilers around.

Today, it's not really difficult if you have cygwin installed, simply do `strings prog.exe` and you'll see the standard errors/output/warnings and locations, and if it is a .NET file, 32 bit, or 64 bit executable. The compiler also tags it's name and version near the top or bottom of the file.

When it comes to which compiler created a .hex file for a microcontroller, I've never seen one, or heard of one. Since the "whatis.exe" essentially scanned the program, then ran the program keeping track of how stacks were maintained, it could get it right. That's not possible on a uC (unless you use an emulator and have signatures for a few dozen compilers).

--ETA: Just Searched, found one!

PEiD detects most common packers, cryptors and compilers for PE files. It can currently detect more than 600 different signatures in PE files.
PEiD is special in some aspects when compared to other identifiers already out there!

  1. It has a superb GUI and the interface is really intuitive and simple.
  2. Detection rates are amongst the best given by any other identifier.
  3. Special scanning modes for advanced detections of modified and unknown files.
  4. Shell integration, Command line support, Always on top and Drag'n'Drop capabilities.
  5. Multiple file and directory scanning with recursion.
  6. Task viewer and controller.
  7. Plugin Interface with plugins like Generic OEP Finder and Krypto ANALyzer.
  8. Extra scanning techniques used for even better detections.
  9. Heuristic Scanning options.
  10. New PE details, Imports, Exports and TLS viewers
  11. New built in quick disassembler.
  12. New built in hex viewer.
  13. External signature interface which can be updated by the user.


Here is a Download Link (can't vouch for it)

The details on the download page are:
Detects packers, cryptors and compilers

Written by Giorgiana Bursuc on August 10th, 2012
PEiD is an intuitive application that relies on its user-friendly interface to detect packers, cryptors and compilers found in PE executable files – its detection rate is higher than that of other similar tools since the app packs more than 600 different signatures in PE files.

PEiD comes with three different scanning methods, each suitable for a distinct purpose. The Normal one scans the user-specified PE file at its Entry Point for all its included signatures. The so-called Deep Mode comes with increased detection ratio since it scans the file's Entry Point containing section, whereas the Hardcore mode scans the entire file for all the documented signatures.

When users need to get their results right away, they can rely on the Normal or the Deep modes, and they can turn to the Hardcore one when they are willing to wait the time it takes for the scan to complete – regardless of the chosen type, the generated results are as accurate as possible due to PEiD’s error control method.

In addition to the intuitive interface of PEiD, its functions can also be accessed via command-line, and the detailed documentation can help users get familiarized to the proper commands and parameters.

PEiD also allows users to explore all the currently running processes and terminate them with a single mouse click. One can also dump a module then scan then dumped image, or analyze the dependent modules of a process.

The best results can be obtained if each file is analyzed separately as it takes less time to complete the scan, but PEiD also supports batch processing. Users can choose a folder, then set PEiD to select the PE files and scan them.

To sum it up, PEiD is a feature-packed application that can scan PE files and identify packers and compilers, while also featuring a HEX viewer and a task manager.
--ETA2: Here is a page of MANY different tools to do what the OP is asking

I just searched for "whatis.exe" after writing the post above. Weird.
 
Last edited:

Thread Starter

WBahn

Joined Mar 31, 2012
29,976
Back when it was just Windows 3.11, Borland C++ and Microsoft C, and two others, there was a program called "whatis.exe" that would say which compiler was used, and even the version of that compiler.
A friend on another mailing list mentioned whatis.exe, which I had never heard off prior to that.

What the OP is really asking about is not whether it is possible to identify the compiler and, from that, infer the language. The real question is whether or not it is practical to determine the language based on the signatures that the language itself, not the compiler, introduces into the machine level code.

I think the answer at this point is a qualified maybe. It's almost certain that some degree of categorization can take place.

It would be interesting if I could get ahold of J.E. Smith (the guy that wrote whatis.exe) and see what he might have to say about it.
 

thatoneguy

Joined Feb 19, 2009
6,359
I just tried out PDiE.

MikroC - Borland Delphi 6.0-7.0 And Visual Studio C+++

DipTrace Schematic - gcc

Some come up unidentifiable (PICKit 3 standalone), but it hits most others.
 
Top