TX2, allocate different threads on different CPUs? Possible?

For tx2, it has 6 CPUs(2 dual-core denver and 4 quad-core ARM A57). Now I have 6 threads(thread1, thread2, thread3, 4, 5, 6).

I want to allocate thread1 on denver 1, core 1.
thread2 on denver 2, core 2.

               thread3 on 1st A57, core 1.
               thread4 on 2nd A57, core 1.
               thread5 on 3rd A57, core 1.
               thread6 on 4th A57, core 1.

Can I allocate threads specifically on certain CPUs? And even more specifically on certain core of certain CPUs? It’s perfect if you can provide some sample code or link, it’s easier to understand by reading code.

Thanks in advance.

You may start looking at:

man sched_setaffinity
man taskset   # has also some details

More specifically, I am using it in my C++ program(one program uses 6 threads), not command line. Thanks a lot.

The man pages are just API notes. Using this in C++ just requires wrapping it with:

extern "C"

Example where an include file is mentioned from the man page:

extern "C" {
#include <sched.h>
}

…from there forward you can use “sched_setaffinity()” in C++ in the manner stated in the manual (“man”) page. I find man pages are often a better format than web tools.

There are sometimes C++ wrappers to such C functions, and newer C++ standards have some threadding additions, but the C interface will be available in an extraordinarily wide set of platforms going far back in time.

CPU_ZERO (&cpuset);
CPU_SET (0, &cpuset);
rc = sched_setaffinity (pid, size_cpu_set, &cpuset);

  1. It can only choose the CPU, but that function is not able to select which core of that CPU? How can I also choose the core in that specific CPU?

  2. I did set all the threads on CPU 0 just for experiments, then I run my program(1 single program including 6 threads), I checked the system monitor of TX2, I can see the usage of CPU 0 is 80%, but usage of all other CPUs is also around 30%. How come CPU1-CPU5 are involved in? (CPU1-CPU5 are invoked by this program, when I close my program, usage of all CPUs is around 0)

Each CPU is a core. So far as the operating system is concerned there is no difference between four cores on one chip and four chips of one core each. CPU0 is core 0, CPU1 is core 1. Two names for the same thing in this case.

Long ago nobody put two CPUs on a single chip. Much of the technology of multi-core CPUs is nothing more than packaging (there tends to be some level of integration improvement when on the same chip since there is no need for all of those external pins and motherboard support for the two discrete cores talking to each other).

I don’t know about your specific case, but consider that some parts of what is going on may not require access specifically to CPU0 (parts not needing I/O bound by a physical GPIO wire for example). It isn’t unusual for drivers and other software in libraries to split work to other cores when there isn’t a conflict (a well designed driver makes this possible and efficient). In the case of drivers an IRQ typically triggers the start of running the driver…if the driver can be split into two pieces where the hardware dependent I/O is in one piece, then the software half can go elsewhere and free up that core for other hardware dependent drivers. ksoftirqd schedules these software-only parts of drivers onto other cores just like separate programs might compete for and get scheduled time from CPUs. To know what is going on from your program one would have to know everything linked to it, and also have to know what the scheduler considers as conflicts.

The scheduler itself determines where things actually run. For the most part attempts to set affinity are hints and pressure to run on a given core…the scheduler is the final authority as to where it actually runs. Should your process not be dependent on hardware I/O, and thus able to run on any core, then most likely trying to force affinity to a non-CPU0 core would get 100% of its time on a non-CPU0 core…running purely on CPU0 will be rejected if something else is deemed a conflict and if that part can run on another core.

Every library function could itself spawn a thread, and that thread could be the same process ID, and yet end up eventually being managed by ksoftirqd…ksoftirqd might honor the request for CPU0, or it might not. Consider that if hardware must use CPU0, then the scheduler has to decide whether to wait to service the software IRQ on CPU0 or to migrate it to another core…since hardware access is rather important, there is a good chance ksoftirqd will migrate to another core when this is practical (there would be a cache miss, but the scheduler is thinking this cache miss is less costly than waiting for CPU0).

Perhaps a more productive approach is to ask what your program does, and why it must run on CPU0 specifically? Or perhaps a description of what you are trying to accomplish. The question of what the program does might be more properly phrased as “what must the program access in terms of hardware resources and data sources”.

Dear linuxdev,

Thanks a million for your detailed explanation.

  1. All the threads don’t have to run on CPU0, what I did is just an experiment whether the below 3 lines of code can do what I expected, which is that CPU0 is only busy CPU, all others should be vacant.

CPU_ZERO (&cpuset);
CPU_SET (0, &cpuset);
rc = sched_setaffinity (pid, size_cpu_set, &cpuset);

Apparently the system didn’t do what I coded, maybe as you said, the cpu affinity I set is just a hint, system has their own mechanism.

  1. My purpose is very simple, TX2 has denver CPU and a57 CPU, denver is more powerful that a57, I have 6 threads in 1 program, some of them are very simple, some needs complex computation, I want to allocate complex computation thread on denver, simple computation thread on A57.

As long as that process does not require physical hardware access (at least none which isn’t wired to all cores…the memory controller is an example of something wired to all cores), then you should be able to put this process on some particular core other than CPU0. Keep in mind that if your process triggers other hardware access (e.g., perhaps it needs to read eMMC or USB), then there is still an indirect dependency on CPU0.

Also keep in mind that sometimes it may seem that separating processes to different cores would boost performance, but this depends in part on cache behavior. If you run multiple threads on one core, then there will be cache hits if something is in common…if they run on separate cores, then it will be 100% cache miss. The scheduler has some concept of this, and so sometimes seeing a multi-threaded program running on just one core is actually the better choice.

Separate processes which are not sharing anything are better candidates to running on different cores. Consider that there is a certain amount of context which must be saved when the CPU switches to some other process…much is security related and goes beyond just the state of what the program has in registers for computation. A thread (as opposed to a full separate process) shares part of this context (e.g., both threads of one process share security)…as such swapping from thread to thread is more efficient and faster than swapping from two separate processes. You could take this even further in the case of a particle engine where yield points are chosen such that going from one particle to the next has no requirement at all to save certain state (sometimes this is called a coroutine or a microthread…this relies on the programmer to know what is needed to switch between particles and does not use the scheduler). When you run multiple threads from a single process there is a reason why the scheduler often keeps threads on a single core…at least part of the context required for loading into registers is not required when swapping context on a single core because this state is already present…doing so across multiple cores guarantees more register loads on context switch because none of this state is already loaded to the different core.

Hi Linuxdev,

As you said

"The scheduler itself determines where things actually run. 
For the most part attempts to set affinity are hints and pressure to run on a given core...
the scheduler is the final authority as to where it actually runs."

Then it seems that we will never be able to set affinity to limit some specific thread on some specific cpu/core, because we can never guarantee my own setting affinity to be exactly same to the system’s automatic allocation, especially when system allocates threads and CPUs dynamically. While system has the right to ignore user’s setting.

What’s more, for multi-CPU/core, the sched_setaffinity() by user will be useless if system sticks to it’s own mechanisms. Where can I find the description how system manage its own mechanism and user’s input.

Note: even I write a simple loop, and setaffinity on 1 CPU, when I run it, all 6 CPUs will be triggered to run. I know from system side, it’s not good if system only uses 1 CPU and just leaves the other 5 CPUs asleep. But in a lot of applications, people would like 1 CPU to take care of 1 specific thread, how can I do that?

Hi heyworld, linuxdev may have further suggestion to help you from the default Ubuntu environment, however you may also be interested in Concurrent’s RedHawk realtime Linux kernel, which has support for Jetson TX2 and includes tools for locking the affinity.

If your process is purely software and does not involve talking through some of the I/O (such as GPIO), then the scheduler would honor your request if no other higher priority process needs that core. Keep in mind that much depends on the nature of your process and we don’t know anything about your process…plus it depends on the nature of other processes.

More needs to be known about your process, e.g., is it a hardware I/O driver? What other drivers does it use as it runs, e.g., would it use a USB driver or GPIO? Does data come from disk, ethernet, so on?

One thing I am not clear on is if the GPU is accessible to all cores. I say this because the memory controller is accessible to all cores (and thus so is the GPU), but GPIO and some of the other hardware must go through CPU0. If GPU gets data from USB or ethernet drivers, then somewhere in that chain even if GPU runs from any core there will still be a limitation in getting data to the GPU…there would be an indirect dependency. Once data is in the GPU then I’d think it could run anywhere…if the driver is arranged wrong then what you’d see is part of the process running on CPU0 while obtaining data, and then the possibility of it migrating to another core…but if it is already on CPU0 I couldn’t say what the scheduler would do. What I’m getting at is that for anything running in kernel space the way you arrange its execution can change whether a core is available and whether the scheduler can or will honor affinity. For user space this typically is a separation of various parts of what the kernel does and will mostly go to the core you want when it isn’t hardware I/O. If caching is involved, then there will be some pressure to not migrate elsewhere.

Note that on a desktop multi-core PC there is a mechanism to be able to deal with hardware interrupts being sent to any core…a TX2 does not have this for hardware IRQs, and hardware IRQs are what start code running on any particular hardware driver. Software IRQs can always go wherever the scheduler wants it to go, but only CPU0 on a Jetson will run hardware IRQs (you might start a hardware IRQ elsewhere, but it will migrate to CPU0).

If this is a driver there comes a question about whether there are parts of the driver which depend on hardware, and yet software-only parts could run separately…you could shorten the time needed on hardware and spawn the rest to kthreadd…the kthreadd part could be bound to a different core, the hardware-only part would demand running on CPU0 in competition with everything else.

A second thing to consider is running your process at a higher priority (a lower “nice” value). You can get into trouble if you run yours at too high of a priority since you end up with a priority inversion in some cases, e.g., if its priority is too high but it uses the eMMC and eMMC itself can’t respond because your process is blocking it. Generally speaking a process runs by default at a nice value of “0”, but you can set it to “-2” and get quite a priority boost in comparison to other user space and priority 0 processes. If you get to “-2” and it doesn’t help, then priority changes won’t help in general (if you try “-5” you’re just asking for trouble…either the code is arranged well to do this or it isn’t). This applies a significant amount of pressure on the scheduler to give your process priority instead of the other way around.

You can use the “nice” command on command line, plus there is a C API version. If you run “man -a nice”, then as you quit one man page section it’ll bring up the next man page section’s information (section 2 has the C API information, section 1 has the command line information). The command “renice” can alter an existing process. Only root can nice or renice to a negative value (to a higher priority…though the section 2 man page mentions an exception…see “RLIMIT_NICE” and the referenced “getrlimit” man page).

Dusty mentioned RedHawk, which is hard realtime. Linux can be used for soft realtime, but if you need guaranteed deterministic realtime then you need something like RedHawk. Even RedHawk would not be able to migrate to different cores for some component which requires wiring which only goes to CPU0…so it depends again on the nature of your program and the resources it uses. You would however get much more control over getting your process to run when you want…the overall speed of the system may not be quite as fast, but it would be more predictable and give you more control.