sched: add a reference count to the TCB to prevent it from being deleted. by hujun260 · Pull Request #17468 · apache/nuttx

hujun260 · 2025-12-10T08:16:02Z

Summary

Implement reference counting for task control blocks (TCBs) to enable fine-grained
synchronization and reduce reliance on global task locks. This mechanism allows code
to safely access TCBs without holding large locks, reducing contention and lock
recursion while ensuring TCBs remain valid during access.

should merge with apache/nuttx-apps#3246

Problem Statement

Current implementation relies on large global locks to protect task data structures.
The new reference counting approach provides a lightweight alternative that only
guarantees TCB validity rather than exclusive access, reducing:

Lock contention in multi-threaded scenarios
Critical section failed do to wait
Overhead of global synchronization

Key Changes

1. Core Implementation

Enhanced nxsched_get_tcb() and nxsched_put_tcb() for reference counting
Modified TCB release mechanism to respect reference counts
Added reference count field to task control block structure

2. Scheduler Updates (72 files)

Task management and lifecycle
Pthread operations
Signal and group handling
Memory and filesystem operations
Device drivers

Usage Pattern

// New: Reference counting
struct tcb_s *tcb = nxsched_get_tcb(pid);
if (tcb != NULL)
{
// ... safely access tcb ...
nxsched_put_tcb(tcb);
}

Impact

tcb release

Test Objective nucleo-g431rb:nsh

after this patch
size nuttx
text data bss dec hex filename
127964 904 8644 137512 21928 nuttx

before this patch
size nuttx
text data bss dec hex filename
127232 904 8612 136748 2162c nuttx

flash size increase 732 byte

Testing

test in hardware

esp32s3-devkit:nsh

user_main: scheduler lock test
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Finished

End of test memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 7 6
mxordblk 548a0 548a0
uordblks 5014 5014
fordblks 588a8 588a8

Final memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 1 6
mxordblk 59238 548a0
uordblks 4684 5014
fordblks 59238 588a8
user_main: Exiting
ostest_main: Exiting with status 0
nsh> u
nsh: u: command not found
nsh>
nsh>
nsh>
nsh> uname -a
NuttX 12.11.0 ef91333e3ac-dirty Dec 10 2025 16:11:04 xtensa esp32s3-devkit
nsh>

hujun260 · 2026-01-30T13:05:26Z

It seems that not all implementations have replaced nxsched_get_tcb_noref, which means the system performance will still degrade. My test case is only intended to make you aware of the performance degradation caused by nxsched_get_tcb. The pthread_mutex case is just one of them—if I provide other test cases, your patch will have many more performance bottlenecks requiring optimization. could we address this issue from the perspective of architectural implementation?

Some kernel features cannot be made optional via configuration switches, as disabling them would introduce bugs.

anchao

I won't let this submission be merged:

code size will increase.
performance will degrade.

anchao · 2026-01-31T09:56:26Z

If you can't resolve these 2 issues, please stop changing the status of this PR.

hujun260 · 2026-02-01T03:18:37Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

However, currently, using nxsched_get_tcb cannot guarantee safety even within critical sections, which is more critical than other factors. Furthermore, this patch reduces the scope of critical sections, thereby improving system real-time performance. As for the code size, it remains acceptable because the added safety justifies the increase.

anchao · 2026-02-02T02:21:28Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

However, currently, using nxsched_get_tcb cannot guarantee safety even within critical sections, which is more critical than other factors. Furthermore, this patch reduces the scope of critical sections, thereby improving system real-time performance. As for the code size, it remains acceptable because the added safety justifies the increase.

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

hujun260 · 2026-02-02T02:29:25Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

anchao · 2026-02-02T03:04:48Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

None of the leading mainstream RTOSes have adopted this change, as it would introduce unnecessary performance overhead.
Your commit will affect the project I’m maintaining and cause a slowdown of the whole system.
A number of development boards in the community are IoT devices, whose clock speed and memory space are insufficient to support this feature.

Additionally, please stop setting the status manually – GitHub will send out unnecessary push notifications and emails because of this.

hujun260 · 2026-02-02T03:14:17Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

None of the leading mainstream RTOSes have adopted this change, as it would introduce unnecessary performance overhead.

Your commit will affect the project I’m maintaining and cause a slowdown of the whole system.

A number of development boards in the community are IoT devices, whose clock speed and memory space are insufficient to support this feature.

Additionally, please stop setting the status manually – GitHub will send out unnecessary push notifications and emails because of this.

I will proceed with submitting my proposal.
1 On NuttX, this is a fundamental strategy to ensure the safe usage of TCBs, and it also reduces interrupt disable time.
2 This submission will not impact the core system performance; I have already provided RTOS benchmark data.
3 The increase in code size is controllable and will not cause compilation failures for developers in the community.

anchao · 2026-02-02T03:17:57Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

None of the leading mainstream RTOSes have adopted this change, as it would introduce unnecessary performance overhead.

Your commit will affect the project I’m maintaining and cause a slowdown of the whole system.

A number of development boards in the community are IoT devices, whose clock speed and memory space are insufficient to support this feature.

Additionally, please stop setting the status manually – GitHub will send out unnecessary push notifications and emails because of this.

I will proceed with submitting my proposal. 1 On NuttX, this is a fundamental strategy to ensure the safe usage of TCBs, and it also reduces interrupt disable time. 2 This submission will not impact the core system performance; I have already provided RTOS benchmark data. 3 The increase in code size is controllable and will not cause compilation failures for developers in the community.

If the only way you can avoid hitting performance is to add a noref method, then what’s the point of having ref in the first place? Should we use noref everywhere instead? Why do I have to bear the extra overhead for the implementation of ref?

xiaoxiang781216 · 2026-02-02T03:45:47Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

without refcount, the used after free will happen under the stress testing of task/thread create/destroy. Used after free is a critical bug which must be fixed. refcount is one of well known method to fix this type of error, if you have better suggestion/method, please point out.

hujun260 · 2026-02-02T03:46:33Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

None of the leading mainstream RTOSes have adopted this change, as it would introduce unnecessary performance overhead.

Your commit will affect the project I’m maintaining and cause a slowdown of the whole system.

A number of development boards in the community are IoT devices, whose clock speed and memory space are insufficient to support this feature.

Additionally, please stop setting the status manually – GitHub will send out unnecessary push notifications and emails because of this.

I will proceed with submitting my proposal. 1 On NuttX, this is a fundamental strategy to ensure the safe usage of TCBs, and it also reduces interrupt disable time. 2 This submission will not impact the core system performance; I have already provided RTOS benchmark data. 3 The increase in code size is controllable and will not cause compilation failures for developers in the community.

If the only way you can avoid hitting performance is to add a noref method, then what’s the point of having ref in the first place? Should we use noref everywhere instead? Why do I have to bear the extra overhead for the implementation of ref?

Setting aside the naming issue for now,
Currently, nxsched_get_tcb_noref is only used to implement nxsched_verify_pid.
If we adopt two different usage patterns for TCBs in the future:
1 nxsched_get_tcb_noref + enter_critical_section (unsafe)
2 nxsched_get_tcb and nxsched_put_tcb
This would break code consistency and make maintenance difficult. I believe the pattern (nxsched_get_tcb_noref + enter_critical_section) should only be reserved for extreme, isolated cases where performance is critical.
Our priority should be correctness first, with performance being optimized on a case-by-case basis.

anchao · 2026-02-02T03:54:36Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

without refcount, the used after free will happen under the stress testing of task/thread create/destroy. Used after free is a critical bug which must be fixed. refcount is one of well known method to fix this type of error, if you have better suggestion/method, please point out.

This is because in your business logic, Task A still holds the resources of Task B when it exits. Why can't the resources be released properly?
Most IoT and wearable devices use a fixed thread pool. If your business process involves the frequent creation and destruction of a large number of threads, why not design them as deterministic tasks and implement the creation and exit logic with wait/notify mechanisms?
The only acceptable solution to me is to make this feature configurable via Kconfig – no other options are on the table.

anchao · 2026-02-02T03:56:51Z

Zephyr is no more safety either. If your proposal can be applied to any other RTOS, I will approve this submission.

We found an issue with NuttX, so why drag other RTOSes into this? I'm not even developing on Zephyr.

None of the leading mainstream RTOSes have adopted this change, as it would introduce unnecessary performance overhead.

Your commit will affect the project I’m maintaining and cause a slowdown of the whole system.

A number of development boards in the community are IoT devices, whose clock speed and memory space are insufficient to support this feature.

Additionally, please stop setting the status manually – GitHub will send out unnecessary push notifications and emails because of this.

I will proceed with submitting my proposal. 1 On NuttX, this is a fundamental strategy to ensure the safe usage of TCBs, and it also reduces interrupt disable time. 2 This submission will not impact the core system performance; I have already provided RTOS benchmark data. 3 The increase in code size is controllable and will not cause compilation failures for developers in the community.

If the only way you can avoid hitting performance is to add a noref method, then what’s the point of having ref in the first place? Should we use noref everywhere instead? Why do I have to bear the extra overhead for the implementation of ref?

Setting aside the naming issue for now, Currently, nxsched_get_tcb_noref is only used to implement nxsched_verify_pid. If we adopt two different usage patterns for TCBs in the future: 1 nxsched_get_tcb_noref + enter_critical_section (unsafe) 2 nxsched_get_tcb and nxsched_put_tcb This would break code consistency and make maintenance difficult. I believe the pattern (nxsched_get_tcb_noref + enter_critical_section) should only be reserved for extreme, isolated cases where performance is critical. Our priority should be correctness first, with performance being optimized on a case-by-case basis.

The top priority is to ensure the kernel incurs no performance or footprint impact, right? For the sake of this so-called unnecessary security, making all developers bear the extra performance overhead—isn’t that the biggest issue here?

anchao · 2026-02-02T04:07:15Z

One more update for you all:

In the process of promoting NuttX right now, most business stakeholders have extremely high requirements for kernel performance and will opt for Zephyr as their first choice. Developers also have numerous doubts about the stability and performance of the NuttX kernel.
When working on your internal solutions, you’d better research how other companies have implemented their corresponding solutions. Even Bestechnic, your Tier 1 supplier, does not recommend using your framework for headphone and Wi-Fi/BT module solutions.
The next-generation OS will be built on the first principles rather than being based on excessive assumptions. Any business implementations with poor performance or flaws should be phased out and optimized, instead of having the operating system do a great deal of unnecessary fallback work for them.

hujun260 · 2026-02-02T04:30:07Z

One more update for you all:

In the process of promoting NuttX right now, most business stakeholders have extremely high requirements for kernel performance and will opt for Zephyr as their first choice. Developers also have numerous doubts about the stability and performance of the NuttX kernel.

When working on your internal solutions, you’d better research how other companies have implemented their corresponding solutions. Even Bestechnic, your Tier 1 supplier, does not recommend using your framework for headphone and Wi-Fi/BT module solutions.

The next-generation OS will be built on the first principles rather than being based on excessive assumptions. Any business implementations with poor performance or flaws should be phased out and optimized, instead of having the operating system do a great deal of unnecessary fallback work for them.

It just means we are encountering different usage scenarios. While enter_critical_section might work for very simple cases, But, nuttX needs to evolve.

Memory safety issues are not imaginary.
Let's take the function nxsched_foreach as an example. We cannot guarantee that the tcb parameter passed to the handler is always valid. Neither enter_critical_section nor sched_lock can solve this problem. If the handler performs a wait operation, it will invalidate the entire critical section and the scheduling lock.

xiaoxiang781216 · 2026-02-02T04:30:19Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

without refcount, the used after free will happen under the stress testing of task/thread create/destroy. Used after free is a critical bug which must be fixed. refcount is one of well known method to fix this type of error, if you have better suggestion/method, please point out.

This is because in your business logic, Task A still holds the resources of Task B when it exits. Why can't the resources be released properly?

Most IoT and wearable devices use a fixed thread pool. If your business process involves the frequent creation and destruction of a large number of threads, why not design them as deterministic tasks and implement the creation and exit logic with wait/notify mechanisms?

No, I just consider POSIX spec, not private business logic. POSIX spec allow appliation create/destroy task/thread dynamically, we need fix the problem. Let's hightlight the inviolable rule here:
https://github.com/apache/nuttx/blob/master/INVIOLABLES.md#strict-posix-compliance

The only acceptable solution to me is to make this feature configurable via Kconfig – no other options are on the table.

it's bug fix, I don't unerstand why we add an option to choice whether to fix a critical(memory corruption) bug.

anchao · 2026-02-02T05:07:56Z

One more update for you all:

In the process of promoting NuttX right now, most business stakeholders have extremely high requirements for kernel performance and will opt for Zephyr as their first choice. Developers also have numerous doubts about the stability and performance of the NuttX kernel.

When working on your internal solutions, you’d better research how other companies have implemented their corresponding solutions. Even Bestechnic, your Tier 1 supplier, does not recommend using your framework for headphone and Wi-Fi/BT module solutions.

The next-generation OS will be built on the first principles rather than being based on excessive assumptions. Any business implementations with poor performance or flaws should be phased out and optimized, instead of having the operating system do a great deal of unnecessary fallback work for them.

It just means we are encountering different usage scenarios. While enter_critical_section might work for very simple cases, But, nuttX needs to evolve.

Memory safety issues are not imaginary. Let's take the function nxsched_foreach as an example. We cannot guarantee that the tcb parameter passed to the handler is always valid. Neither enter_critical_section nor sched_lock can solve this problem. If the handler performs a wait operation, it will invalidate the entire critical section and the scheduling lock.

Shouldn't we optimize the implementation here?

anchao · 2026-02-02T05:10:21Z

I won't let this submission be merged:

code size will increase.

performance will degrade.

without refcount, the used after free will happen under the stress testing of task/thread create/destroy. Used after free is a critical bug which must be fixed. refcount is one of well known method to fix this type of error, if you have better suggestion/method, please point out.

This is because in your business logic, Task A still holds the resources of Task B when it exits. Why can't the resources be released properly?

Most IoT and wearable devices use a fixed thread pool. If your business process involves the frequent creation and destruction of a large number of threads, why not design them as deterministic tasks and implement the creation and exit logic with wait/notify mechanisms?

No, I just consider POSIX spec, not private business logic. POSIX spec allow appliation create/destroy task/thread dynamically, we need fix the problem. Let's hightlight the inviolable rule here: https://github.com/apache/nuttx/blob/master/INVIOLABLES.md#strict-posix-compliance

The only acceptable solution to me is to make this feature configurable via Kconfig – no other options are on the table.

it's bug fix, I don't unerstand why we add an option to choice whether to fix a critical(memory corruption) bug.

But this is not a bug on my end. There is no scenario in my commercial solution where threads are dynamically created or deleted. If you insist on merging this commit, make it configurable—this way it won’t block existing users. This is my final concession.

hujun260 · 2026-02-02T11:47:19Z

One more update for you all:

In the process of promoting NuttX right now, most business stakeholders have extremely high requirements for kernel performance and will opt for Zephyr as their first choice. Developers also have numerous doubts about the stability and performance of the NuttX kernel.

When working on your internal solutions, you’d better research how other companies have implemented their corresponding solutions. Even Bestechnic, your Tier 1 supplier, does not recommend using your framework for headphone and Wi-Fi/BT module solutions.

The next-generation OS will be built on the first principles rather than being based on excessive assumptions. Any business implementations with poor performance or flaws should be phased out and optimized, instead of having the operating system do a great deal of unnecessary fallback work for them.

It just means we are encountering different usage scenarios. While enter_critical_section might work for very simple cases, But, nuttX needs to evolve.
Memory safety issues are not imaginary. Let's take the function nxsched_foreach as an example. We cannot guarantee that the tcb parameter passed to the handler is always valid. Neither enter_critical_section nor sched_lock can solve this problem. If the handler performs a wait operation, it will invalidate the entire critical section and the scheduling lock.

Shouldn't we optimize the implementation here?

This is just one example. Your suggestion hardly solves this category of issues at the root. The fundamental problem is that critical sections are simply not a safe solution in complex scenarios. Take task_fssync for instance, which involves a wait operation.

I think you are somewhat overstating the performance impact of this patch. The so-called core performance of an RTOS primarily revolves around context switching and mutex/pthread_mutex operations, where this change has little effect. Moreover, the usage scenarios for nxsched_get_tcb are mostly non-frequently-called interfaces.

…deleted. To replace the large lock with smaller ones and reduce the large locks related to the TCB, in many scenarios, we only need to ensure that the TCB won't be released instead of locking, thus reducing the possibility of lock recursion. Signed-off-by: hujun5 <hujun5@xiaomi.com>

hujun260 requested review from Donny9, GUIDINGLI, Ouss4, anchao, davids5, fjpanag, gustavonihei, hartmannathan, jerpelea, masayuki2009, pkarashchenko, pussuw, simbit18, tmedicci, xiaoxiang781216, yamt and yf13 as code owners December 10, 2025 08:16

hujun260 force-pushed the apache_12 branch from 1ef0f02 to 8ef0e9e Compare December 10, 2025 08:19

hujun260 mentioned this pull request Dec 10, 2025

add a reference count to the TCB to prevent it from being deleted. apache/nuttx-apps#3246

Merged

hujun260 force-pushed the apache_12 branch 3 times, most recently from f192057 to cae0ffd Compare December 11, 2025 06:54

hujun260 requested a review from acassis as a code owner December 11, 2025 06:54

github-actions bot added the Board: arm label Dec 11, 2025

hujun260 marked this pull request as ready for review January 26, 2026 06:19

anchao requested changes Jan 31, 2026

View reviewed changes

anchao marked this pull request as draft January 31, 2026 09:56

hujun260 marked this pull request as ready for review February 1, 2026 03:18

anchao marked this pull request as draft February 2, 2026 02:21

hujun260 marked this pull request as ready for review February 2, 2026 02:29

anchao marked this pull request as draft February 2, 2026 03:04

hujun260 marked this pull request as ready for review February 2, 2026 03:14

hujun260 marked this pull request as draft February 3, 2026 11:23

hujun260 force-pushed the apache_12 branch from 66cc16d to 414c423 Compare February 3, 2026 12:21

xiaoxiang781216 closed this Feb 3, 2026

Conversation

hujun260 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Statement

Key Changes

1. Core Implementation

2. Scheduler Updates (72 files)

Usage Pattern

Impact

Testing

Uh oh!

hujun260 commented Jan 30, 2026

Uh oh!

anchao left a comment

Choose a reason for hiding this comment

Uh oh!

anchao commented Jan 31, 2026

Uh oh!

hujun260 commented Feb 1, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

hujun260 commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

hujun260 commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

xiaoxiang781216 commented Feb 2, 2026

Uh oh!

hujun260 commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

hujun260 commented Feb 2, 2026

Uh oh!

xiaoxiang781216 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

anchao commented Feb 2, 2026

Uh oh!

hujun260 commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

hujun260 commented Dec 10, 2025 •

edited

Loading

xiaoxiang781216 commented Feb 2, 2026 •

edited

Loading