Identifying the country of origin for a malware PE executable

22 Aug, 2012

REPOST - ORIGINALLY POSTED NOVEMBER 25, 2010

Have you ever wondered how people writing reports about malware can say where the malware was likely developed?

Sometimes you get totally lucky and log files created by the malware will help answer the question. Given the following line from a log:

11/16/2009 6:41:48 PM –> Hook instalate lsass.exe

We can use Google Translate’s “language detect” feature to help up determine the language used:

Of course, it’s not often we get THAT lucky!

A more interesting method is the examination of certain structures known as the Resource Directory within the executable file itself. For the purpose of this post, I will not be describing the Resource Directory structure. It’s a complicated beast, making it a topic I will save for later posts that actually warrant and/or require a low-level understanding of it. Suffice it to say, the Resource Directory is where embedded resources like bitmaps (used in GUI graphics), file icons, etc. are stored. The structure is frequently compared to the layout of files on a file system, although I think it’s insulting to file systems to say such a thing. For those more graphically inclined, I took the following image from http://www.devsource.com/images/stories/PEFigure2.jpg.

For the sake of example, here’s some images showing you just a few of the resources embedded inside of notepad.exe: (using CFF Explorer from: http://www.ntcore.com/exsuite.php)

Now it’s important to note that an executable may have only a few or even zero resources – especially in the case of malware. Consider the following example showing a recent piece of malware with only a single resource called “BINARY.”

Moving on, let’s look at another piece of malware… Below, we see this piece of malware has five resource directories.

We could pick any of the five for this analysis, but I’ll pick RCData – mostly because it’s typically an interesting directory to examine when reverse engineering malware. (This is because RCData defines a raw data resource for an application. Raw data resources permit the inclusion of any binary data directly in the executable file.) Under RCData, we see three separate entries:

The first one to catch my eye is the one called IE_PLUGIN. I’ll show a screenshot of it below, but am saving the subject of executables embedded within executables for a MUCH more technical post in the near future (when it’s not 1:30 am and I actually feel like writing more!).

Going back to the entry structure itself, the IE_PLUGIN entry will have at least one Directory Entry underneath it to describe the size(s) and offset(s) to the data contained within that resource. I have expanded it as shown next:

And that’s where things get interesting – as it relates to answering the question at the start of this post anyways. Notice the ID: 1055. That’s our money shot for helping to determine what country this binary was compiled in. Or, more specifically, the default locale codepage of the computer used to compile this binary. Those ID’s have very legitimate uses, for example, you can have the same dialog in English, French and German localized forms. The system will choose the dialog to load based on the thread’s locale. However, when resources are added to the binary without explicitly setting them to different locale IDs, those resources will be assigned the default locale ID of the compiler’s computer.

So in the example above, what does 1055 mean?

It means this piece of malware likely was developed (or at least compiled in) Turkey.

How do we know that one resource wasn’t added with a custom ID? Because we see the same ID when looking at almost all the other resources in the file (anything with an ID of zero just means “use the default locale”):

In this case, we are also lucky enough to have other strings in the binary (once unpacked) to help solidify the assertion this binary is from Turkey. One such string is “Aktif Pencere,” which Google’s Translation detection engine shows as:

However, as you can see, this technique is very useful even when no strings are present – in logs or the binary itself.

So is this how the default binary locale identification works normally (eg: non-malware executable files)?

Not exactly. The above techniques are generally used with malware (if the malware even has exposed resources), but not generally with normal/legitimate binaries. Consider the following legitimate binary. What is the source locale for the following example?

As you see in the green box, we have some cursor resources with the ID for the United States. (I’m including a lookup table at the bottom of this post.) In the orange box, there are additional cursor resources with the ID for Germany. In the red box is RCData, like we examined before, but all of these resources have the ID specifying the default language of the computer executing the application.

As it turns out, the normal value to examine is the ID for the Version Information Table resource (in the blue box). In the case above, it’s the Czech Republic. The Version Information Table contains the “metadata” you normally see depicted in locations like this:

In the above screenshot, Windows is identifying the source/target local as English, and specifically, United States English (as opposed to UK English, Australian English, etc…). That information is not stored within the Version Information table, but rather is determined by the ID of the Version Information Table.

However, in malware, the Version Information table is almost always stripped or mangled, as is the case with our original example from earlier:

Because of that, the earlier techniques are more applicable to malware.

Below, I’m including a table to help you translate Resource Entry IDs to locales (sorted by decimal ID number).

Locale	Language	LCID	Decimal	Codepage

Arabic – Saudi Arabia	ar	ar-sa	1025	1256
Bulgarian	bg	bg	1026	1251
Catalan	ca	ca	1027	1252
Chinese – Taiwan	zh	zh-tw	1028
Czech	cs	cs	1029	1250
Danish	da	da	1030	1252
German – Germany	de	de-de	1031	1252
Greek	el	el	1032	1253
English – United States	en	en-us	1033	1252
Spanish – Spain (Traditional)	es	es-es	1034	1252
Finnish	fi	fi	1035	1252
French – France	fr	fr-fr	1036	1252
Hebrew	he	he	1037	1255
Hungarian	hu	hu	1038	1250
Icelandic	is	is	1039	1252
Italian – Italy	it	it-it	1040	1252
Japanese	ja	ja	1041
Korean	ko	ko	1042
Dutch – Netherlands	nl	nl-nl	1043	1252
Norwegian – Bokml	nb	no-no	1044	1252
Polish	pl	pl	1045	1250
Portuguese – Brazil	pt	pt-br	1046	1252
Raeto-Romance	rm	rm	1047
Romanian – Romania	ro	ro	1048	1250
Russian	ru	ru	1049	1251
Croatian	hr	hr	1050	1250
Slovak	sk	sk	1051	1250
Albanian	sq	sq	1052	1250
Swedish – Sweden	sv	sv-se	1053	1252
Thai	th	th	1054
Turkish	tr	tr	1055	1254
Urdu	ur	ur	1056	1256
Indonesian	id	id	1057	1252
Ukrainian	uk	uk	1058	1251
Belarusian	be	be	1059	1251
Slovenian	sl	sl	1060	1250
Estonian	et	et	1061	1257
Latvian	lv	lv	1062	1257
Lithuanian	lt	lt	1063	1257
Tajik	tg	tg	1064
Farsi – Persian	fa	fa	1065	1256
Vietnamese	vi	vi	1066	1258
Armenian	hy	hy	1067
Azeri – Latin	az	az-az	1068	1254
Basque	eu	eu	1069	1252
Sorbian	sb	sb	1070
FYRO Macedonia	mk	mk	1071	1251
Sesotho (Sutu)			1072
Tsonga	ts	ts	1073
Setsuana	tn	tn	1074
Venda			1075
Xhosa	xh	xh	1076
Zulu	zu	zu	1077
Afrikaans	af	af	1078	1252
Georgian	ka		1079
Faroese	fo	fo	1080	1252
Hindi	hi	hi	1081
Maltese	mt	mt	1082
Sami Lappish			1083
Gaelic – Scotland	gd	gd	1084
Yiddish	yi	yi	1085
Malay – Malaysia	ms	ms-my	1086	1252
Kazakh	kk	kk	1087	1251
Kyrgyz – Cyrillic			1088	1251
Swahili	sw	sw	1089	1252
Turkmen	tk	tk	1090
Uzbek – Latin	uz	uz-uz	1091	1254
Tatar	tt	tt	1092	1251
Bengali – India	bn	bn	1093
Punjabi	pa	pa	1094
Gujarati	gu	gu	1095
Oriya	or	or	1096
Tamil	ta	ta	1097
Telugu	te	te	1098
Kannada	kn	kn	1099
Malayalam	ml	ml	1100
Assamese	as	as	1101
Marathi	mr	mr	1102
Sanskrit	sa	sa	1103
Mongolian	mn	mn	1104	1251
Tibetan	bo	bo	1105
Welsh	cy	cy	1106
Khmer	km	km	1107
Lao	lo	lo	1108
Burmese	my	my	1109
Galician	gl		1110	1252
Konkani			1111
Manipuri			1112
Sindhi	sd	sd	1113
Syriac			1114
Sinhala; Sinhalese	si	si	1115
Amharic	am	am	1118
Kashmiri	ks	ks	1120
Nepali	ne	ne	1121
Frisian – Netherlands			1122
Filipino			1124
Divehi; Dhivehi; Maldivian	dv	dv	1125
Edo			1126
Igbo – Nigeria			1136
Guarani – Paraguay	gn	gn	1140
Latin	la	la	1142
Somali	so	so	1143
Maori	mi	mi	1153
HID (Human Interface Device)			1279
Arabic – Iraq	ar	ar-iq	2049	1256
Chinese – China	zh	zh-cn	2052
German – Switzerland	de	de-ch	2055	1252
English – Great Britain	en	en-gb	2057	1252

Topic:

0 Comment

Identifying the country of origin for a malware PE executable

CATEGORIES

RECENT POST