Skip to content

Bug Fix: Thread Leak in client.go#131

Merged
directionless merged 1 commit intoosquery:masterfrom
brian-mckinney:thread_leak_bugfix
Feb 26, 2026
Merged

Bug Fix: Thread Leak in client.go#131
directionless merged 1 commit intoosquery:masterfrom
brian-mckinney:thread_leak_bugfix

Conversation

@brian-mckinney
Copy link
Contributor

Description

We (Elastic) had a customer report excessive resource usage from osqueryd.exe after a recent update of Elastic's osquerybeat.

We did some research, and consulting with Claude a bit, and were able to identify a thread leak in osquery-go. When a program creates a new NewExtensionManagerClient using the go bindings, trans is created and passed to the factory, but is never assigned to assigned to c.transport. Later, when the client is closed, the client does not close transport, so it is left dangling.

When the calling program exits, the threads are cleaned up in osqueryd. However, if the calling program never exits, the threads stay alive in osqueryd in a waiting state indefinitely.

In our case, Osquerybeat was creating a client once every 30s an issuing a query, then closing the client, but staying alive. So the customer saw unrestrained growth of threads in osqueryd.

Solution

The fix is simple, just add the line to assign the transport variable, and it will be closed properly when the client is closed.

Offending Code

osquery-go/client.go

Lines 51 to 88 in f77b3a1

func NewClient(path string, socketOpenTimeout time.Duration, opts ...ClientOption) (*ExtensionManagerClient, error) {
c := &ExtensionManagerClient{
waitTime: defaultWaitTime,
maxWaitTime: defaultMaxWaitTime,
}
for _, opt := range opts {
opt(c)
}
if c.waitTime > c.maxWaitTime {
return nil, errors.New("default wait time larger than max wait time")
}
c.lock = NewLocker(c.waitTime, c.maxWaitTime)
if c.client == nil {
trans, err := transport.Open(path, socketOpenTimeout)
if err != nil {
return nil, err
}
c.client = osquery.NewExtensionManagerClientFactory(
trans,
thrift.NewTBinaryProtocolFactoryDefault(),
)
}
return c, nil
}
// Close should be called to close the transport when use of the client is
// completed.
func (c *ExtensionManagerClient) Close() {
if c.transport != nil && c.transport.IsOpen() {
c.transport.Close()
}
}

Proof of Concept

I had Claude help me make a proof of concept that exercises this bug

Expand: POC code to trigger the bug
package main

import (
	"context"
	"flag"
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	osquery "github.com/osquery/osquery-go"
)

func main() {
	socketPath := flag.String("socket", "", "path to osqueryd's extension manager socket/pipe")
	interval := flag.Duration("interval", 1*time.Second, "delay between connections")
	count := flag.Int("count", 50, "number of connections to open")
	flag.Parse()

	if *socketPath == "" {
		fmt.Fprintln(os.Stderr, "error: --socket is required")
		flag.Usage()
		os.Exit(1)
	}

	if *count < 0 {
		fmt.Fprintln(os.Stderr, "error: --count must be non-negative")
		flag.Usage()
		os.Exit(1)
	}

	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt)
	defer stop()

	log.Printf("Target socket : %s", *socketPath)
	log.Printf("Interval      : %s", *interval)
	log.Printf("Max connections: %d", *count)
	log.Println("Watch osqueryd's thread count while this runs.")
	log.Println("Press Ctrl+C to stop.")
	fmt.Println()

	client_count := 0
	for {
		if client_count >= *count {
			log.Printf("Reached max connections (%d). Stopping.", *count)
			break
		}

		// Open a new client
		client, err := osquery.NewClient(*socketPath, 5*time.Second)
		if err != nil {
			log.Printf("[%d] connect error: %v", client_count, err)
			time.Sleep(*interval)
			continue
		}

		client_count++
		fmt.Printf("Opened client #%d\n", client_count)

		// Send a lightweight RPC so the server-side thread processes at least
		// one request (this is what makes it block on the *next* read forever).
		status, err := client.PingContext(ctx)
		if err != nil {
			log.Printf("ping error from client #%d: %v", client_count, err)
		} else {
			log.Printf("ping OK from client #%d (code=%d, msg=%q)", client_count, status.GetCode(), status.GetMessage())
		}

		// Close() should clean up the server-side thread, but with the bug it doesn't.
		log.Printf("Closing client #%d", client_count)
		client.Close()

		// Brief pause to make thread growth easy to observe.
		select {
		case <-ctx.Done():
			log.Printf("Interrupted after %d connections.", client_count)
			return
		case <-time.After(*interval):
		}
	}

	// Keep the process alive so the user can inspect osqueryd.
	log.Println("Holding process open. Press Ctrl+C to exit.")
	<-ctx.Done()
	log.Println("Done.")
}
  1. Start osqueryd in daemon mode
osqueryd.exe -D --allow_unsafe
  1. Create a folder and setup the poc code
mkdir C:\tmp\leak_poc
cd C:\tmp\leak_poc
<write main.go here>
go mod init leak_poc
go mod tidy
  1. Run the POC and observe osqueryd's thread count
go run . --socket "\\.\pipe\osquery.em" --interval .25s --count 500
image

Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding this. It looks like I introduced this in #108 when I changed the return from return &ExtensionManagerClient{client, trans}, nil to return c, nil

@directionless directionless merged commit 0cc22f4 into osquery:master Feb 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants