Tweag
Technical groups
Dropdown arrow
Open source
Careers
Research
Blog
Contact
Consulting services
Technical groups
Dropdown arrow
Open source
Careers
Research
Blog
Contact
Consulting services

Getting started with CodeQL, GitHub's declarative static analyzer for security

7 August 2025 — by Clément Hurlin

CodeQL is a declarative static analyzer owned by GitHub, whose purpose is to discover security vulnerabilities. Declarative means that, to use CodeQL, you write rules describing the vulnerabilities you want to catch, and you let an engine check your rules against your code. If there is a match, an alert is raised. Static means that it checks your source code, as opposed to checking specific runs. Owned by GitHub means that CodeQL’s engine is not open-source: it’s free to use only on research and open-source code. If you want to use CodeQL on proprietary code, you need a GitHub Advanced Security license. CodeQL rules, that model specific programming languages and libraries, however, are open-source.

CodeQL is designed to do two things:

  1. Perform all kinds of quality and compliance checks. CodeQL’s query language is expressive enough to describe a variety of patterns (e.g., “find any loop, enclosed in a function named foo, when the loop’s body contains a call to function bar”). As such, it enables complex, semantic queries over codebases, which can uncover a wide range of issues and patterns.
  2. Track the flow of tainted data. Tainted data is data provided by a potentially malicious user. If tainted data is sent to critical operations (database requests, custom processes) without being sanitized, it can have catastrophic consequences, such as data loss, a data breach, arbitrary code execution, etc. Statements of your source code from where tainted data originates are called sources, while statements of your source code where tainted data is consumed are called sinks.

This tutorial is targeted at software and security engineers that want to try out CodeQL, focusing on the second use case from above. I explain how to setup CodeQL, how to write your first taint tracking query, and give a methodology for doing so.

Writing the vulnerable code

First, I need to write some code to execute my query against. As the attack surface, I’m choosing calls to the sarge Python library, for three reasons:

  • It is available on PyPI, so it is easy to install.
  • It is niche enough that it is not already modeled in CodeQL’s Python standard library, so out of the box queries from CodeQL won’t catch vulnerabilities that use sarge. We need to write our own rules.
  • It performs calls to subprocess.Popen, which is a data sink. As a consequence, code calling sarge is prone to having command injection vulnerabilities.

For my data source, I use flask. That’s because HTTP requests contain user-provided data, and as such, they are modeled as data sources in CodeQL’s standard library. With both sarge and flask in place, we can write the following vulnerable code:

from flask import Flask, request

import sarge

app = Flask(__name__)


@app.route("/", methods=["POST"])
def user_to_sarge_run():
    """This function shows a vulnerability: it forwards user input (through a POST request) to sarge.run."""
    print("/ handler")
    if request.method != "POST":
        return "Method not allowed"
    default_value = "default"
    received: str = request.form.get("key", "default")
    print(f"Received: {received}")
    sarge.run(received)  # Unsafe, don't do that!
    return "Called sarge"

To run the application locally, execute in one terminal:

> flask --debug run

In another terminal, trigger the vulnerability as follows:

> curl -X POST http://localhost:5000/ -d "key=ls"

Now observe that in the terminal running the app, the ls command (provided by the user! 💣) was executed:

/ handler
Received: ls
app.py	__pycache__  README.md	requirements.txt

Wow, pretty scary right! What if I had passed the string rm -Rf ~/*? Now let’s see how to catch this vulnerability with CodeQL.

Running CodeQL on the CLI

To run CodeQL on the CLI, I need to download the CodeQL binaries from the github/codeql-cli-binaries repository. At the time of writing, there are CodeQL binaries for the three major platforms. Where I clone this repository doesn’t matter, as long as the codeql binary ends up in PATH. Then, because I am going to write my own queries (as opposed to solely using the queries shipped with CodeQL), I need to clone CodeQL’s standard library: github/codeql. I recommend putting this repository in a folder that is a sibling of the repository being analyzed. In this manner, the codeql binary will find it automatically.

Before I write my own query, let’s run standard CodeQL queries for Python. First, I need to create a database. Instead of analyzing code at each run, CodeQL’s way of operating is to:

  1. Store the code in a database,
  2. Then run one or many queries on the database.

While I develop a query, and so iterate on step 2 above, having the two steps distinct saves computing time. As long as the code being analyzed doesn’t change, there is no need to rebuild the database. Let’s build the codebase as follows:

> codeql database create --language=python codeql-db --source-root=.

Now that the database is created, let’s call the python-security-and-quality (a set of default queries for Python, provided by CodeQL’s standard library) queries:

> codeql database analyze codeql-db python-security-and-quality --format=sarif-latest --output=codeql.sarif
# Now, transform the SARIF output into CSV, for better human readibility; using https://pypi.org/project/sarif-tools/
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/unused-local-variable,Variable default_value is not used.,app.py,12

Indeed, in the snippet above, it looks like the developer intended to use a variable to store the value "default" but forgot to use it in the end. This is not a security vulnerability, but it exemplifies the kind of programming mistakes that CodeQL’s default rules find. Note that the vulnerability of passing data from the POST request to the sarge.run call is not yet caught. That is because sarge is not in CodeQL’s list of supported Python libraries.

Writing a query to model sarge.run: modeling the source

The sarge.run function executes a command, like subprocess does. As such it is a sink for tainted data: one should make sure that data passed to sarge.run is controlled.

CodeQL performs a modular analysis: it doesn’t inspect the source code of your dependencies. As a consequence, you need to model your dependencies’ behavior for them to be treated correctly by CodeQL’s analysis. Modeling tainted sources and sinks is done by implementing the DataFlow::ConfigSig interface:

/** An input configuration for data flow. */
signature module ConfigSig {
  /** Holds if `source` is a relevant data flow source. */
  predicate isSource(Node source);

  /** Holds if `sink` is a relevant data flow sink. */
  predicate isSink(Node sink);
}

In this snippet, a predicate is a function returning a Boolean, while Node is a class modeling statements in the source code. So to implement isSource I need to capture the Node that we deem relevant sources of tainted data w.r.t. sarge.run. Since any source of tainted data is dangerous if you send its content to sarge.run, I implement isSource as follows:

predicate isSource(DataFlow::Node source) { source instanceof ActiveThreatModelSource }

Threat models control which sources of data are considered dangerous. Usually, only remote sources (data in an HTTP request, packets from the network) are considered dangerous. That’s because, if local sources (content of local files, content passed by the user in the terminal) are tainted, it means an attacker has already such a level of control on your software that you are doomed. That is why, by default, CodeQL’s default threat model is to only consider remote sources.1 In isSource, by using ActiveThreatModelSource, we declare that the sources of interest are the sources of the current active threat model.

To make sure that ActiveThreatModelSource works correctly on my codebase, I write the following test query in file Scratch.ql:

import python
import semmle.python.Concepts

from ActiveThreatModelSource src
select src, "Tainted data source"

Because this file depends on the python APIs of CodeQL, I need to put a qlpack.yml file close to Scratch.ql, as follows:

name: smelc/sarge-queries
version: 0.0.1
extractor: python
library: false
dependencies:
  codeql/python-queries: "*"

I can now execute Scratch.ql as follows:

> codeql database analyze codeql-db queries/Scratch.ql --format=sarif-latest --output=codeql.sarif
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/get-remote-flow-source,Tainted data source,app.py,1

This seems correct: something is flagged. Let’s make it more visual by running the query in VSCode. For that I need to install the CodeQL extension. To run queries within vscode, I first need to specify the database to use. It is the codeql-db folder which we created with codeql database create above:

Selecting the CodeQL database in vscode

Now I run the query by right-clicking in its opened file:

Running the debug query in vscode

Doing so opens the CodeQL results view:

Result of running the debug query

I see that the import of request is flagged as a potential data source. This is correct: in my program, tainted data can come through usages of this package.

Writing a query to model sarge.run: modeling the sink

This is where things gets more interesting. As per the ConfigSig interface above, I need to implement isSink(Node sink), so that it captures calls to sarge.run. Because CodeQL is a declarative2 object-oriented language, this means isSink must return true for subclasses of Node that represent calls to sarge.run. Let me describe a methodology to discover how to do that. First, modify the Scratch.ql query to find out all instances of Node in my application:

import python
import semmle.python.dataflow.new.DataFlow

from DataFlow::Node src
select src, "DataFlow::Node"

Executing this query in VSCode yields the following results:

Result of querying for all nodes

Wow, that’s a lot of results! In a real codebase with multiple files, this would be unmanageable. Fortunately code completion works in CodeQL, so I can filter the results using the where clause, discovering the methods to call by looking at completions on the . symbol. Since the call to sarge.run I am looking for is at line 17, I can refine the query as follows:

from DataFlow::Node src, Location loc
where src.getLocation() = loc
  and loc.getFile().getBaseName() = "app.py"
  and loc.getStartLine() = 17
select src, "DataFlow::Node"

With these constraints, the query returns only a handful of results:

Results of querying some nodes

Still, there are 4 hits on line 17. Let’s see how I can disambiguate those. For this, CodeQL provides the getAQlClass predicate that returns the most specific type a variable has (as explained in CodeQL zero to hero part 3):

from DataFlow::Node src, Location loc
where src.getLocation() = loc
  and loc.getFile().getBaseName() = "app.py"
  and loc.getStartLine() = 17
select src, src.getAQlClass(), "DataFlow::Node"

See how the select clause now includes src.getAQlClass() as second element. This makes the CodeQL Query Results show it in the central column:

Results of getAQlClass

There are many more results, and that is because entries that were indistinguishable before are now disambiguated by the class. If in doubt, one can consult the list of class of CodeQL’s standard Python library to understand what each class is about. In our case, I had read the official documentation on using CodeQL for Python, and I recognize the CallNode class from this list.

As the documentation explains, there is actually an API to retrieve CallNode instances corresponding to functions imported from a distant module, using the moduleImport function. Let’s use it to restrict our Nodes to be instances of CallNode (using a cast) and this call being a call to sarge.run:

import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs

from DataFlow::Node src
where src.(API::CallNode) = API::moduleImport("sarge").getMember("run").getACall()
select src, "CallNode calling sarge.run"

Executing this query yields the only result we want:

Result of final debug query

Putting this all together, I can finalize the implementation of ConfigSig as shown below. The getArg(0) suffix models that the tainted data flows into sarge.run’s first argument:

private module SargeConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    source instanceof ActiveThreatModelSource
  }

  predicate isSink(DataFlow::Node sink) {
    sink = API::moduleImport("sarge").getMember("run").getACall().getArg(0)
  }
}

Following the official template for queries tracking tainted data, I write the query as follows:

module SargeFlow = TaintTracking::Global<SargeConfig>;

from SargeFlow::PathNode source, SargeFlow::PathNode sink
where SargeFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Tainted data passed to sarge"

Executing this query in VSCode returns the paths (list of steps) along which the vulnerability takes place:

Result of the final query

Conclusion

I have demonstrated how to use CodeQL to model a Python library, covering the setup and steps a developer must do to write his/her first CodeQL query. I gave a methodology to be able to write instances of CodeQL interfaces, even when one is lacking intimate knowledge of CodeQL APIs. I believe this is important, as the CodeQL ecosystem is small and the number of resources is limited: users of CodeQL often have to find out what to write on their own, with limited support from both the tooling and from generative AI tools (probably because the number of resources on CodeQL is small, so the results of generative AI systems are poor too).

To dive deeper, I recommend reading the official CodeQL for Python resource and join the GitHub Security Lab Slack to get support from CodeQL users and developers. And remember that this tutorial’s material is available at tweag/sarge-codeql-minimal if you want to experiment with this tutorial yourself!


  1. The default threat model can be overridden by command line flags and by configuration files.
  2. CodeQL belongs to the Datalog family of languages.

Behind the scenes

Clément Hurlin

Clément is a Director of Engineering, leading the Build Systems department. He studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic. His technical background includes functional programming, compilers, provers, distributed systems, and build systems.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

Company

AboutOpen SourceCareersContact Us

Connect with us

© 2025 Modus Create, LLC

Privacy PolicySitemap