Resource Center - High CPU usage

This is one of the recent issues that I was tracking This issue is pretty simple but because of that, its solution is not so trivial. So let's dig into it. The resource center is a standalone app that:

  • Is a remote file system server 
  • Integrate with Version Control System (currently P4)
  • Is a resource server that is responsible for building resources
  • Manage remote resource builders. 
  • Manage the creation of game build
  • Have UI which allows to browse the file system and trigger compilation of resources and build.  
All communication uses TCP/IP sockets and most operations are async for performance reasons. 

Recently I was doing changes to speed up the remote file system. I decided to switch all calls into async processing. While doing that I was put in front of the decision: Should I put all processing of connection read/write operations to a single thread? or maybe I should create a separate thread for each of them? I naively decided to go with a single-thread idea. 

The whole code looked more or less like this:

while (true)
{
    for (auto* connection : m_connections)
        connection->processIncoming();
}

on top of this:

bool Connection::processIncoming()
{
    if (!socket->isDataAvailable())
        return false;
    .. // process data
    return true;
}

Well as you can see this is a pretty tight loop that when there is no data sent, will occupy most of the CPU. For my stationary PC, this was not an issue but while using the laptop sound of the fans was accompanying me for a whole time the app was running.    

This was a simple overlook. We can come with a simple solution:

int32_t runReadThread(Server* server)
{
    while (server->isAlive())
    {
        for (auto* connection : server->m_connections)
            connection->processIncoming();
        Sleep(kSleepTime);
    }
}

One run of the app and high CPU use is gone, just like the performance of the whole app. But you can of course cover this by tweaking of kSleepTime to have some balance. A little bit of spoiler: doing this is a bad idea.  That is why we may change a little bit approach may and do this:

while (true)
{
    bool hasWork = false;

    for (auto* connection : m_connections)
        if (connection->processIncoming())
            hasWork = true;

    if (hasWork)
        Sleep(kSleepTime);
}

Now if there is work we do not sleep and gain some performance, only that in practice we did not. Now I know that no matter what I would try to do this approach would not work. We can think about the whole processing as being into one of the states:
  1. connections have no data
  2. connections have a continuous stream of data 
  3. connections have data from time to time 
Let's analyze all cases. The second code example covers cases 1 and 2 which are the trivial ones. We put a thread to sleep when there is no data. When there is a continuous stream of data we just do not sleep.

Issues start with the 3rd case which in practice is the most common one. We have some data, then there is some gap without it, and so on. The presented approach introduces a delay to processing data after encounter gaps. These delays will be between 0 - kSleepTime. You can tweak it to find something acceptable but we will still slow processing. There may exist some tricks to drop sleep time when there are events and then increase it when there is no data. But to my knowledge, there will be always loose of performance.

So here I'm wanting to return back to the writing game and not fixing the underlying system. I have done this for the past 3-4 weeks and feel that I drifted too much into tech. On the other hand, I cannot allow for high CPU use because this is annoying for all devs who work on tech (2 right now). That is why initially I decided to go naively with the sleep which was really a mistake. Once again by trying to save time I lost more of it. 

Before we get to my final solution I need to mention that I'm not a fan of putting the sleep in the code. I'm also not a big fan of the code that locks execution till something happens. In real-time app's they often can be replaced by async calls or delaying them till later execution. The problem is that what I do is not a real-time app. What I working on is just a normal app, you can think about it even as a demon that running somewhere in the background and does not use a lot of resources.

In the end, I think I understood my mistake and decided to do stuff differently. I mentioned earlier "I naively decided to use a single thread.". The logic behind it was that one resource center right now can have more than 16 connections at a time, which gives us more than 32 threads (separate for the read and write data). In my opinion, this was a lot so I tried to avoid it. But this is the road that I decided to follow in the end. Now my code looks more or less like this: 

int32_t runReadThread(Connection* a_connection)
{
    while (connection->isConnected())
        connection->processIncoming(500);
    return 0;
}

and function is modified:

void Connection::processIncoming(msec_t timeout)
{
    if (!socket->isDataAvailable(timeout))
        return;
    .. // process data
}

The idea here is that we pushing waiting on the operating system. Whenever there is data,  isDataAvailable(timeout) will awake the thread and start processing. In case it times out we just checking if the connection was not requested to shut down and if not, return to waiting. There is also no way of combining the processing of multiple connections into a single thread. If we do that we will need to wait for a timeout of all other connections before processing the next data. This would introduce a significant delay in the processing of events.

Is this solution worked? I would say yes. I dropped CPU use from 30% to 0.16%, no matter if there is an app attached or not. 

If I'm happy about it? I'm not really sure, somehow deep inside I suspect that there is a lot more elegant solution. Sadly I don't know it and being 3 days later in the problem I run out of the time that I can spend on it. 

For now, I will take this issue as a reminder that we programmers, like to go down the rabbit holes. We try to improve our solution, tweak it, change some equations, add a null check. Everything to make it work. We do that even when deep down we know that we should not do this.  I would love to say that I'm free of this behavior but I'm just like everyone else and sometimes falling into this trap.  

Like always if you have some comments on the topic, suggestions about different solutions I would love to hear about them. 



Comments

Popular posts from this blog

Query commands execution

W.U. 0x21

WRE