Run GLM-4.7 Flash Locally: Step-by-Step Installation
What is GLM-4.7 Flash?
GLM-4.7 Flash is a newly released open-weight large language model from Z.AI that’s gained attention for running locally while delivering strong performance in coding, reasoning, and agent-based tasks. Built on a Mixture of Experts (MoE) architecture with 30 billion total parameters but only 3 billion active per token, it’s designed for speed and efficiency. Unlike many models that rely on paid APIs or cloud infrastructure, GLM-4.7 Flash runs entirely on local hardware using lightweight inference tools.
The model works well on consumer GPUs with as little as 16GB of VRAM when quantized properly, reaching 60 to 100 tokens per second on cards like the RTX 3090 or 4090. Typical uses include coding assistants for IDEs like VS Code or Cursor, internal chatbots that keep data private, automated testing and code review systems, and AI agents that work with local files or databases. The model handles tool calling and function execution natively, making it practical for workflows that need to interact with external APIs, databases, or system resources.
What hardware do you need to run GLM-4.7 Flash locally?
System requirements to run GLM-4.7 Flash locally
Running GLM-4.7 Flash requires adequate hardware to achieve practical speeds. Here’s what you need based on how you plan to use the model.
Minimum Setup (for testing and light use)
- GPU: NVIDIA RTX 3090 or similar with 16GB memory
- RAM: 16GB
- Storage: 15GB free space
- Operating System: Windows, Linux, or macOS
- Additional: CUDA 11.8 or newer (for NVIDIA GPUs)
Recommended Setup (for regular use)
- GPU: NVIDIA RTX 4090 or similar with 24GB memory
- RAM: 32GB
- Storage: 25GB+ free space on SSD
- Operating System: Linux (Ubuntu 20.04+) or Windows with WSL2
- Additional: CUDA 12.1 or newer
A GPU accelerates inference dramatically. Expect 60-100 tokens per second with proper GPU support versus 5-10 tokens per second on CPU alone. The GPU memory determines which model version you can run. Smaller compressed versions need less memory but work well for most tasks.
Once your system meets these requirements, you can begin the installation process.
Running GLM-4.7 Flash locally
Running up GLM-4.7 Flash locally involves four main steps: installing an inference engine, downloading the model files, running the model, and testing it to verify everything works. Each step builds on the previous one, so let’s follow them in order:
Step 1: Installing the inference engine
An inference engine is the software that loads and runs the GLM-4.7 Flash model locally on your hardware. There are several options depending on your needs. Here are some of the options:
Installing llama.cpp with CUDA
llama.cpp is a lightweight inference framework that runs efficiently on NVIDIA GPUs when built with CUDA support. Start by cloning the repository and building it with GPU acceleration enabled:
git clone https://github.com/ggml-org/llama.cpp
This downloads the llama.cpp source code to your local machine. Next, navigate into the directory and create a build folder:
cd llama.cppmkdir build && cd build
Now configure the build with CUDA support enabled:
cmake .. -DGGML_CUDA=ON
The -DGGML_CUDA=ON flag tells the build system to compile with GPU acceleration. This is what allows llama.cpp to use your NVIDIA GPU for fast inference. Finally, compile the project:
This step compiles the code with optimizations enabled. After building, verify that CUDA is working by checking for GPU detection:
./llama-cli --version
If CUDA is properly configured, you’ll see GPU information in the output. The compiled binary will be located in the build/bin directory and is ready to load GGUF model files.
Installing Ollama
Ollama simplifies local model deployment by handling downloads, configuration, and serving automatically. Install it based on your operating system:
Windows:
Download the installer from the official Ollama website and follow the setup wizard.
Once installed, Ollama manages model files and inference settings for you, making it ideal for users who want to skip manual configuration. However, you sacrifice some low-level control compared to llama.cpp.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Installing vLLM or SGLang
vLLM and SGLang are Python-based inference frameworks designed for high-throughput scenarios and API-style deployments. These tools are best for users building production systems or serving models to multiple users simultaneously.
Installing vLLM:
pip install vllm --pre --index-url https://pypi.org/simplepip install git+https://github.com/huggingface/transformers.git
Installing SGLang:
pip install sglangpip install git+https://github.com/huggingface/transformers.git
Both frameworks require CUDA-enabled environments and Python 3.9 or newer. They offer features like batched inference, request queuing, and OpenAI-compatible API endpoints, making them suitable for applications that need to handle concurrent requests efficiently.
Step 2: Downloading the GLM-4.7 Flash model
The GLM-4.7 Flash model is available in GGUF format, which is optimized for efficient loading and inference with llama.cpp and similar tools. GLM-4.7 Flash is distributed in GGUF (GPT-Generated Unified Format), a file format designed for efficient model storage and loading. Within GGUF files, you’ll find different quantization levels that balance model size against quality:
- Q4_K_M (4-bit): Smallest size, around 10GB, suitable for 16GB VRAM GPUs
- Q5_K_M (5-bit): Medium size, around 12GB, better quality with minimal size increase
- Q8_0 (8-bit): Larger size, around 18GB, near-original quality for 24GB+ VRAM
For most users, the 4-bit quantization offers the best balance.
Downloading from Hugging Face
The easiest way to download GLM-4.7 Flash is from the Hugging Face repository. You can use the Hugging Face CLI or download directly through your browser.
Using Hugging Face CLI:
First, install the Hugging Face hub library if you haven’t already:
pip install huggingface_hub
Then download the model files:
huggingface-cli download unsloth/GLM-4.7-Flash-GGUF --include "Q4_K_M" --local-dir ./models
This command downloads the 4-bit quantized version to a models folder in your current directory. The --include flag filters for specific quantization levels.
Manual download
Alternatively, visit the Hugging Face repository in your browser, navigate to the Files tab, and download the GGUF file that matches your hardware capabilities.

Store your model files in an organized directory structure. A common practice is to keep all models in a dedicated folder like ~/models/glm-4.7-flash/ for easy access during inference.
Step 3: Running GLM-4.7 Flash locally
With the inference engine installed and model files downloaded, you’re ready to start GLM-4.7 Flash. The process varies slightly depending on which inference engine you chose earlier.
Running with llama.cpp
Navigate to your llama.cpp build directory and start the server with the downloaded model:
./llama-server \--model /path/to/models/GLM-4.7-Flash-Q4_K_M.gguf \--host 0.0.0.0 \--port 8080 \--ctx-size 4096 \--n-gpu-layers 99 \--threads 8
Let’s break down what each parameter does:
--model: Path to your downloaded GGUF file--host 0.0.0.0: Makes the server accessible on your local network--port 8080: The port where the server will listen for requests--ctx-size 4096: Context window size (adjust based on your VRAM)--n-gpu-layers 99: Number of model layers to offload to GPU (99 means all layers)--threads 8: CPU threads to use for processing
If the model starts successfully, you’ll see output indicating the server is running and listening on http://localhost:8080. The initial load takes 30-60 seconds as the model loads into GPU memory.
Running with Ollama
Ollama simplifies the process significantly. First, create a model configuration:
ollama pull hf.co/unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
This might take some time, depending on the speed of your internet:

Then run the model:
ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q4_K_M
Ollama automatically handles server startup and provides a chat interface in your terminal. You can also run it as a background service:
ollama serve
This starts the Ollama server at http://localhost:11434, making the model available via API calls.
Running with vLLM
For vLLM, use the following command to start an OpenAI-compatible API server:
vllm serve unsloth/GLM-4.7-Flash-GGUF \--trust-remote-code \--tensor-parallel-size 1 \--dtype bfloat16 \--port 8000
The server starts at http://localhost:8000 and provides endpoints compatible with OpenAI’s API format, making it easy to integrate with existing tools and libraries.
Running with SGLang
SGLang offers similar functionality with additional optimization features:
python -m sglang.launch_server \--model-path unsloth/GLM-4.7-Flash-GGUF \--host 0.0.0.0 \--port 8000 \--mem-fraction-static 0.8
The --mem-fraction-static parameter controls how much GPU memory is reserved for the model, helping prevent out-of-memory errors during inference.
Once your chosen inference engine starts, you should see log messages indicating successful model loading and server startup. For llama.cpp and vLLM, you can verify the server by visiting http://localhost:8080 or http://localhost:8000 in your browser. Most inference engines provide a simple web interface for testing.
Step 4: Testing the GLM-4.7 Flash model
After starting the inference server, let’s verify that GLM-4.7 Flash is working and responding as expected. You can test the model through several methods depending on your workflow.
Testing via browser interface
Most inference engines provide a built-in web interface for quick testing. Open your browser and navigate to the appropriate URL:
llama.cpp: http://localhost:8080
vLLM/SGLang: http://localhost:8000
Ollama: Use the terminal interface directly or access via http://localhost:11434
You’ll see a chat interface where you can type prompts and receive responses.

Try a basic prompt first:
Write a Python function that calculates the factorial of a number.
A successful response should generate a working code within a few seconds. Pay attention to the response time and quality, you should see tokens appearing smoothly without long pauses.
Testing via command line (curl)
For quick API testing without a browser, use curl to send HTTP requests:
curl http://localhost:8080/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "glm-4.7-flash","messages": [{"role": "user", "content": "What is the capital of France?"}],"temperature": 0.7}'
This sends a simple question to the model and returns a JSON response. You should receive a properly formatted answer within seconds.
Testing via Python script
For more comprehensive testing, use Python with the OpenAI library (works with most inference engines):
from openai import OpenAI# Point to your local serverclient = OpenAI(base_url="http://localhost:8080/v1",api_key="not-needed")response = client.chat.completions.create(model="glm-4.7-flash",messages=[{"role": "user", "content": "Explain what a binary search tree is."}],temperature=0.7,max_tokens=500)print(response.choices[0].message.content)
Beyond checking that the model responds, evaluate these aspects:
Response speed: On a properly configured GPU setup, you should see 60-100 tokens per second. Slower speeds may indicate CPU fallback or insufficient VRAM.
Output quality: Test with coding tasks, reasoning questions, and creative writing to assess whether the model maintains quality across different domains.
Consistency: Run the same prompt multiple times with different temperature settings to ensure stable behavior.
Memory usage: Monitor GPU memory with nvidia-smi to confirm the model is loaded on the GPU and not swapping to system RAM.
If responses are slow, incoherent, or the server crashes, revisit your system requirements and ensure CUDA is properly configured. Most issues stem from insufficient VRAM or incorrect quantization selection for your hardware.
Even with careful setup, you might encounter issues when running GLM-4.7 Flash locally. Let’s see the common problems and how to fix them.
Troubleshooting common issues when running GLM-4.7 Flash locally
Most issues relate to GPU configuration, memory constraints, or software compatibility. This section covers the most common problems and their solutions to help you get back on track quickly.
GPU not being detected
If your model runs on CPU instead of GPU or you see CUDA-related errors, the issue is usually with your CUDA installation or configuration.
Solutions:
Verify CUDA installation with
nvidia-smicommand, you should see your GPU listedRebuild llama.cpp with
-DGGML_CUDA=ONflag if neededUpdate NVIDIA drivers to the latest version
Ensure CUDA is in your system PATH
Slow inference speeds
When token generation falls below 20 tokens per second on a GPU, something is limiting performance.
Solutions:
Confirm GPU usage with
nvidia-smiwhile running, utilization should be above 70%Reduce context size to 2048 or lower with
--ctx-size 2048Set
--n-gpu-layers99 to offload all layers to GPUSwitch to Q4 quantization if running out of VRAM
Close other GPU-intensive applications
Startup crashes or memory errors
If the server crashes during model loading or shows out-of-memory errors, your hardware is struggling with the model size.
Solutions:
Check available VRAM with
nvidia-smi, we need at least 18GB free for Q4 modelsUse more aggressive quantization like Q4_K_M instead of Q5 or Q8
Re-download model files if corruption is suspected
Kill other processes consuming GPU memory
Reduce
--n-gpu-layersto offload fewer layers initially
Context length or quantization mismatches
Errors about incompatible context sizes or garbled output usually point to configuration problems.
Solutions:
Start with
--ctx-size 4096or lowerVerify you’re loading the correct GGUF file for your quantization level
Use
--jinjaflag with llama.cpp if requiredTry a different quantization variant if issues persist
Most problems stem from mismatched configurations or hardware limitations. Working through these issues will help you identify the root cause and get GLM-4.7 Flash running smoothly.
With these troubleshooting tips in hand, you’re equipped to resolve most issues and keep GLM-4.7 Flash running smoothly on your local hardware.
Conclusion
Running GLM-4.7 Flash locally gives you a powerful AI coding assistant with zero API costs, complete privacy, and 60-100 tokens per second on consumer GPUs. The setup requires a compatible GPU with at least 16GB VRAM, an inference engine like llama.cpp or Ollama, and about 30-60 minutes for installation. Once running GLM 4.7 Flash locally, you can build coding agents, automated testing systems, private chatbots, and agentic workflows entirely on your hardware.
Key advantages of running GLM-4.7 Flash locally include the 200K context window for handling large codebases, native tool calling support for agent workflows, and offline capability for complete data privacy. Performance ranges from 60-220 tokens per second depending on your GPU and quantization choice.
Ready to dive deeper into language models? Check out Codecademy’s Intro to Language Models in Python course to learn the fundamentals of how these models work and how to use them effectively in your projects.
Frequently asked questions
1. How long is the context window for GLM-4.7 Flash?
GLM-4.7 Flash supports up to 200,000 tokens. For local deployment on consumer hardware, start with 4,096 or 8,192 tokens to manage memory usage effectively.
2. How is the “Flash” version different from the standard GLM-4.7?
Flash is optimized for speed and local deployment. It activates only 3 billion parameters per token from its 30 billion total, making it faster and more resource-efficient for consumer hardware.
3. Is GLM-4.7-Flash the same as GLM-4.7?
No, GLM-4.7 Flash is a variant optimized for fast inference and lower hardware requirements. It provides the best balance of speed and capability for local deployment.
4. How does GLM-4.7 compare to Claude Sonnet?
GLM-4.7 Flash is open-weight and runs locally with no ongoing costs, unlike Claude Sonnet which requires API access and per-token fees. It excels at coding, tool calling, and offers full privacy and offline capability.
5. Why is GLM-4.7 called “Flash”?
The name reflects its speed, delivering 60-100 tokens per second on consumer GPUs. It’s designed for real-time applications like coding assistants without requiring enterprise infrastructure.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
How to Use llama.cpp to Run LLaMA Models Locally
Learn how to run LLaMA models locally using `llama.cpp`. Follow our step-by-step guide to harness the full potential of `llama.cpp` in your projects. - Article
How to Fine-Tune Google Gemma 270M with Unsloth and QLoRA
Learn to fine-tune Google Gemma 270M and 1B models with Unsloth and QLoRA on free Google Colab. - Article
How to Run Deepseek R1 Locally
Learn how to set up and use Deepsake R1 locally using Ollama.
Learn more on Codecademy
- Apply Gemini Flash Image for generating campaign-ready visuals through prompts, brand-safe edits, character continuity, watermarking, and polished design skill.
- Beginner Friendly.< 1 hour
- Learn how to create the model layer of a web application using Mongoose and TDD.
- Intermediate.2 hours
- Learn machine learning operations best practices to deploy, monitor, and maintain production AI systems that are reliable, secure, and cost-effective.
- With Certificate
- Intermediate.1 hour