Articles

Run GLM-4.7 Flash Locally: Step-by-Step Installation

What is GLM-4.7 Flash?

GLM-4.7 Flash is a newly released open-weight large language model from Z.AI that’s gained attention for running locally while delivering strong performance in coding, reasoning, and agent-based tasks. Built on a Mixture of Experts (MoE) architecture with 30 billion total parameters but only 3 billion active per token, it’s designed for speed and efficiency. Unlike many models that rely on paid APIs or cloud infrastructure, GLM-4.7 Flash runs entirely on local hardware using lightweight inference tools.

The model works well on consumer GPUs with as little as 16GB of VRAM when quantized properly, reaching 60 to 100 tokens per second on cards like the RTX 3090 or 4090. Typical uses include coding assistants for IDEs like VS Code or Cursor, internal chatbots that keep data private, automated testing and code review systems, and AI agents that work with local files or databases. The model handles tool calling and function execution natively, making it practical for workflows that need to interact with external APIs, databases, or system resources.

What hardware do you need to run GLM-4.7 Flash locally?

System requirements to run GLM-4.7 Flash locally

Running GLM-4.7 Flash requires adequate hardware to achieve practical speeds. Here’s what you need based on how you plan to use the model.

Minimum Setup (for testing and light use)

  • GPU: NVIDIA RTX 3090 or similar with 16GB memory
  • RAM: 16GB
  • Storage: 15GB free space
  • Operating System: Windows, Linux, or macOS
  • Additional: CUDA 11.8 or newer (for NVIDIA GPUs)

Recommended Setup (for regular use)

  • GPU: NVIDIA RTX 4090 or similar with 24GB memory
  • RAM: 32GB
  • Storage: 25GB+ free space on SSD
  • Operating System: Linux (Ubuntu 20.04+) or Windows with WSL2
  • Additional: CUDA 12.1 or newer

A GPU accelerates inference dramatically. Expect 60-100 tokens per second with proper GPU support versus 5-10 tokens per second on CPU alone. The GPU memory determines which model version you can run. Smaller compressed versions need less memory but work well for most tasks.

Once your system meets these requirements, you can begin the installation process.

Running GLM-4.7 Flash locally

Running up GLM-4.7 Flash locally involves four main steps: installing an inference engine, downloading the model files, running the model, and testing it to verify everything works. Each step builds on the previous one, so let’s follow them in order:

Step 1: Installing the inference engine

An inference engine is the software that loads and runs the GLM-4.7 Flash model locally on your hardware. There are several options depending on your needs. Here are some of the options:

Installing llama.cpp with CUDA

llama.cpp is a lightweight inference framework that runs efficiently on NVIDIA GPUs when built with CUDA support. Start by cloning the repository and building it with GPU acceleration enabled:

git clone https://github.com/ggml-org/llama.cpp

This downloads the llama.cpp source code to your local machine. Next, navigate into the directory and create a build folder:

cd llama.cpp
mkdir build && cd build

Now configure the build with CUDA support enabled:

cmake .. -DGGML_CUDA=ON

The -DGGML_CUDA=ON flag tells the build system to compile with GPU acceleration. This is what allows llama.cpp to use your NVIDIA GPU for fast inference. Finally, compile the project:

This step compiles the code with optimizations enabled. After building, verify that CUDA is working by checking for GPU detection:

./llama-cli --version

If CUDA is properly configured, you’ll see GPU information in the output. The compiled binary will be located in the build/bin directory and is ready to load GGUF model files.

Installing Ollama

Ollama simplifies local model deployment by handling downloads, configuration, and serving automatically. Install it based on your operating system:

Windows:

Download the installer from the official Ollama website and follow the setup wizard.

Once installed, Ollama manages model files and inference settings for you, making it ideal for users who want to skip manual configuration. However, you sacrifice some low-level control compared to llama.cpp.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Installing vLLM or SGLang

vLLM and SGLang are Python-based inference frameworks designed for high-throughput scenarios and API-style deployments. These tools are best for users building production systems or serving models to multiple users simultaneously.

Installing vLLM:

pip install vllm --pre --index-url https://pypi.org/simple
pip install git+https://github.com/huggingface/transformers.git

Installing SGLang:

pip install sglang
pip install git+https://github.com/huggingface/transformers.git

Both frameworks require CUDA-enabled environments and Python 3.9 or newer. They offer features like batched inference, request queuing, and OpenAI-compatible API endpoints, making them suitable for applications that need to handle concurrent requests efficiently.

Step 2: Downloading the GLM-4.7 Flash model

The GLM-4.7 Flash model is available in GGUF format, which is optimized for efficient loading and inference with llama.cpp and similar tools. GLM-4.7 Flash is distributed in GGUF (GPT-Generated Unified Format), a file format designed for efficient model storage and loading. Within GGUF files, you’ll find different quantization levels that balance model size against quality:

  • Q4_K_M (4-bit): Smallest size, around 10GB, suitable for 16GB VRAM GPUs
  • Q5_K_M (5-bit): Medium size, around 12GB, better quality with minimal size increase
  • Q8_0 (8-bit): Larger size, around 18GB, near-original quality for 24GB+ VRAM

For most users, the 4-bit quantization offers the best balance.

Downloading from Hugging Face

The easiest way to download GLM-4.7 Flash is from the Hugging Face repository. You can use the Hugging Face CLI or download directly through your browser.

Using Hugging Face CLI:

First, install the Hugging Face hub library if you haven’t already:

pip install huggingface_hub

Then download the model files:

huggingface-cli download unsloth/GLM-4.7-Flash-GGUF --include "Q4_K_M" --local-dir ./models

This command downloads the 4-bit quantized version to a models folder in your current directory. The --include flag filters for specific quantization levels.

Manual download

Alternatively, visit the Hugging Face repository in your browser, navigate to the Files tab, and download the GGUF file that matches your hardware capabilities.

Hugging Face repository page for GLM 4.7 slash GGUF showing Files and Versions section to download GLM flash 4.7 locally.

Store your model files in an organized directory structure. A common practice is to keep all models in a dedicated folder like ~/models/glm-4.7-flash/ for easy access during inference.

Step 3: Running GLM-4.7 Flash locally

With the inference engine installed and model files downloaded, you’re ready to start GLM-4.7 Flash. The process varies slightly depending on which inference engine you chose earlier.

Running with llama.cpp

Navigate to your llama.cpp build directory and start the server with the downloaded model:

./llama-server \
--model /path/to/models/GLM-4.7-Flash-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 99 \
--threads 8

Let’s break down what each parameter does:

  • --model: Path to your downloaded GGUF file
  • --host 0.0.0.0: Makes the server accessible on your local network
  • --port 8080: The port where the server will listen for requests
  • --ctx-size 4096: Context window size (adjust based on your VRAM)
  • --n-gpu-layers 99: Number of model layers to offload to GPU (99 means all layers)
  • --threads 8: CPU threads to use for processing

If the model starts successfully, you’ll see output indicating the server is running and listening on http://localhost:8080. The initial load takes 30-60 seconds as the model loads into GPU memory.

Running with Ollama

Ollama simplifies the process significantly. First, create a model configuration:

ollama pull hf.co/unsloth/GLM-4.7-Flash-GGUF:Q4_K_M

This might take some time, depending on the speed of your internet:

Terminal showing ollama pull command used to run GLM 4.7 slash locally and download the GLM 4.7 model.

Then run the model:

ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q4_K_M

Ollama automatically handles server startup and provides a chat interface in your terminal. You can also run it as a background service:

ollama serve

This starts the Ollama server at http://localhost:11434, making the model available via API calls.

Running with vLLM

For vLLM, use the following command to start an OpenAI-compatible API server:

vllm serve unsloth/GLM-4.7-Flash-GGUF \
--trust-remote-code \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--port 8000

The server starts at http://localhost:8000 and provides endpoints compatible with OpenAI’s API format, making it easy to integrate with existing tools and libraries.

Running with SGLang

SGLang offers similar functionality with additional optimization features:

python -m sglang.launch_server \
--model-path unsloth/GLM-4.7-Flash-GGUF \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.8

The --mem-fraction-static parameter controls how much GPU memory is reserved for the model, helping prevent out-of-memory errors during inference.

Once your chosen inference engine starts, you should see log messages indicating successful model loading and server startup. For llama.cpp and vLLM, you can verify the server by visiting http://localhost:8080 or http://localhost:8000 in your browser. Most inference engines provide a simple web interface for testing.

Step 4: Testing the GLM-4.7 Flash model

After starting the inference server, let’s verify that GLM-4.7 Flash is working and responding as expected. You can test the model through several methods depending on your workflow.

Testing via browser interface

Most inference engines provide a built-in web interface for quick testing. Open your browser and navigate to the appropriate URL:

You’ll see a chat interface where you can type prompts and receive responses.

llama.cpp interface with GLM 4.7 slash selected to run GLM 4.7 locally on a local machine.

Try a basic prompt first:

Write a Python function that calculates the factorial of a number.

A successful response should generate a working code within a few seconds. Pay attention to the response time and quality, you should see tokens appearing smoothly without long pauses.

Testing via command line (curl)

For quick API testing without a browser, use curl to send HTTP requests:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7-flash",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7
}'

This sends a simple question to the model and returns a JSON response. You should receive a properly formatted answer within seconds.

Testing via Python script

For more comprehensive testing, use Python with the OpenAI library (works with most inference engines):

from openai import OpenAI
# Point to your local server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="glm-4.7-flash",
messages=[
{"role": "user", "content": "Explain what a binary search tree is."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)

Beyond checking that the model responds, evaluate these aspects:

Response speed: On a properly configured GPU setup, you should see 60-100 tokens per second. Slower speeds may indicate CPU fallback or insufficient VRAM.

Output quality: Test with coding tasks, reasoning questions, and creative writing to assess whether the model maintains quality across different domains.

Consistency: Run the same prompt multiple times with different temperature settings to ensure stable behavior.

Memory usage: Monitor GPU memory with nvidia-smi to confirm the model is loaded on the GPU and not swapping to system RAM.

If responses are slow, incoherent, or the server crashes, revisit your system requirements and ensure CUDA is properly configured. Most issues stem from insufficient VRAM or incorrect quantization selection for your hardware.

Even with careful setup, you might encounter issues when running GLM-4.7 Flash locally. Let’s see the common problems and how to fix them.

Troubleshooting common issues when running GLM-4.7 Flash locally

Most issues relate to GPU configuration, memory constraints, or software compatibility. This section covers the most common problems and their solutions to help you get back on track quickly.

GPU not being detected

If your model runs on CPU instead of GPU or you see CUDA-related errors, the issue is usually with your CUDA installation or configuration.

Solutions:

  • Verify CUDA installation with nvidia-smi command, you should see your GPU listed

  • Rebuild llama.cpp with -DGGML_CUDA=ON flag if needed

  • Update NVIDIA drivers to the latest version

  • Ensure CUDA is in your system PATH

Slow inference speeds

When token generation falls below 20 tokens per second on a GPU, something is limiting performance.

Solutions:

  • Confirm GPU usage with nvidia-smi while running, utilization should be above 70%

  • Reduce context size to 2048 or lower with --ctx-size 2048

  • Set --n-gpu-layers 99 to offload all layers to GPU

  • Switch to Q4 quantization if running out of VRAM

  • Close other GPU-intensive applications

Startup crashes or memory errors

If the server crashes during model loading or shows out-of-memory errors, your hardware is struggling with the model size.

Solutions:

  • Check available VRAM with nvidia-smi, we need at least 18GB free for Q4 models

  • Use more aggressive quantization like Q4_K_M instead of Q5 or Q8

  • Re-download model files if corruption is suspected

  • Kill other processes consuming GPU memory

  • Reduce --n-gpu-layers to offload fewer layers initially

Context length or quantization mismatches

Errors about incompatible context sizes or garbled output usually point to configuration problems.

Solutions:

  • Start with --ctx-size 4096 or lower

  • Verify you’re loading the correct GGUF file for your quantization level

  • Use --jinja flag with llama.cpp if required

  • Try a different quantization variant if issues persist

Most problems stem from mismatched configurations or hardware limitations. Working through these issues will help you identify the root cause and get GLM-4.7 Flash running smoothly.

With these troubleshooting tips in hand, you’re equipped to resolve most issues and keep GLM-4.7 Flash running smoothly on your local hardware.

Conclusion

Running GLM-4.7 Flash locally gives you a powerful AI coding assistant with zero API costs, complete privacy, and 60-100 tokens per second on consumer GPUs. The setup requires a compatible GPU with at least 16GB VRAM, an inference engine like llama.cpp or Ollama, and about 30-60 minutes for installation. Once running GLM 4.7 Flash locally, you can build coding agents, automated testing systems, private chatbots, and agentic workflows entirely on your hardware.

Key advantages of running GLM-4.7 Flash locally include the 200K context window for handling large codebases, native tool calling support for agent workflows, and offline capability for complete data privacy. Performance ranges from 60-220 tokens per second depending on your GPU and quantization choice.

Ready to dive deeper into language models? Check out Codecademy’s Intro to Language Models in Python course to learn the fundamentals of how these models work and how to use them effectively in your projects.

Frequently asked questions

1. How long is the context window for GLM-4.7 Flash?

GLM-4.7 Flash supports up to 200,000 tokens. For local deployment on consumer hardware, start with 4,096 or 8,192 tokens to manage memory usage effectively.

2. How is the “Flash” version different from the standard GLM-4.7?

Flash is optimized for speed and local deployment. It activates only 3 billion parameters per token from its 30 billion total, making it faster and more resource-efficient for consumer hardware.

3. Is GLM-4.7-Flash the same as GLM-4.7?

No, GLM-4.7 Flash is a variant optimized for fast inference and lower hardware requirements. It provides the best balance of speed and capability for local deployment.

4. How does GLM-4.7 compare to Claude Sonnet?

GLM-4.7 Flash is open-weight and runs locally with no ongoing costs, unlike Claude Sonnet which requires API access and per-token fees. It excels at coding, tool calling, and offers full privacy and offline capability.

5. Why is GLM-4.7 called “Flash”?

The name reflects its speed, delivering 60-100 tokens per second on consumer GPUs. It’s designed for real-time applications like coding assistants without requiring enterprise infrastructure.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy