Setup and Fine-Tune Qwen 3 with Ollama
This guide shows you how to set up and fine-tune Qwen 3 models using Ollama.
What you’ll learn:
- Install Ollama and download different Qwen3 models.
- Manage thinking vs non-thinking modes in Qwen 3 models for different use cases.
- Integrate Qwen3 with Python using APIs and Ollama client.
- Create fine-tuned models for tasks like sentiment analysis.
Let’s start by understanding what Qwen 3 is.
What is Qwen 3?
Qwen3 is the latest generation of large language models (LLMs) by Alibaba. It includes eight main LLM variants and six specialized models for retrieval and ranking tasks. Qwen3 models are trained on a web-scale dataset with 36 trillion tokens in 119 languages, making them useful for multilingual tasks and global customer-facing applications.
All the Qwen3 models are open-sourced under the Apache 2.0 license, and we can download, fine-tune, and use them in our LLM-based applications. We can use tools like HuggingFace, ModelScope, Kaggle, and Ollama for this. In this article, we will use Ollama. So, let’s first install Ollama and set up different Qwen3 models.
Set up Qwen 3 with Ollama
To download, run, and fine-tune Qwen 3 models, we will first install Ollama.
Install Ollama
You can install Ollama on a Windows machine using the guide on how to install Ollama on Windows.
In macOS, you can install Ollama using brew by executing the following command in the command-line terminal:
brew install ollama
We can install and set up Ollama using snap on a Linux machine by executing the following command in the command-line terminal:
sudo snap install ollama
Alternatively, you can use the following command to install Ollama from the official downloads page:
curl -fsSL https://ollama.com/install.sh | sh
After installing Ollama, let’s download the different Qwen 3 models.
Download Qwen3 models with Ollama
Ollama provides the ollama pull command for downloading LLMs. There are different Qwen 3 models having parameters ranging from 0.6 billion to 235 billion. We can download the Qwen3 model with 0.6 billion parameters using the ollama pull command as follows:
ollama pull qwen3:0.6b
Similarly, you can download the Qwen3 1.7B model as follows:
ollama pull qwen3:1.7b
After downloading the models, you can have a look at all the installed models using the ollama list command.
As we have two Qwen3 models installed, let’s run these models in Ollama.
Run Qwen3 models using Ollama
We can run an LLM in Ollama using the ollama run command and the model name. For example, we can run the Qwen3 0.6B model using the ollama run command as follows:
ollama run qwen3:0.6b
After executing this command, we can chat with the Qwen3 model as shown in the following image:
We can also directly run models in Ollama without downloading the model. When we run a model that hasn’t already been downloaded, Ollama first downloads it and then runs it. For instance, we haven’t yet downloaded the Qwen3 4B model. Let’s run it directly using ollama run:
ollama run qwen3:4b
After executing this command, Ollama downloads and runs the Qwen3 4B model as shown in the following image:
By default, Qwen 3 models run in thinking mode, which shows the model reasoning in the output. Thinking mode is helpful for multi-step tasks that require reasoning capabilities. However, it also adds latency while generating the output. We can turn the thinking mode on or off while running the models. Let’s discuss how to do so.
Manage thinking and non-thinking modes in Qwen3 models
We can run Qwen3 in thinking or non-thinking mode while starting the model. We can also toggle between thinking and non-thinking modes while chatting with a Qwen3 model, allowing us to switch between the two modes without restarting the model. Let’s discuss both approaches to using Qwen3 models in thinking and non-thinking modes.
Manage thinking and non-thinking mode while starting the model
By default, all the Qwen3 models run in thinking mode. To explicitly run a Qwen 3 model in thinking mode, you can use the --think flag while executing the ollama run command.
ollama run qwen3:0.6b --think
After executing the model in thinking mode, we get the model’s reasoning in the output:
If you want to run the Qwen3 model in thinking mode without showing the reasoning output, you can use the --hidethinking flag while executing the ollama run command.
ollama run qwen3:4b --hidethinking
To run the Qwen3 model in non-thinking mode, you can set the --think parameter to false.
ollama run qwen3:0.6b --think=false
After we start a Qwen3 model by setting the --think parameter to false, it doesn’t show the model reasoning in the output.
We can also toggle between thinking and non-thinking mode during chat, i.e., after executing the ollama run command. Let’s discuss how to do that.
Manage thinking and non-thinking mode during chat
During chat, you can give the /set think input to the Qwen3 model to activate thinking mode. You can provide /set nothink as the input to activate non-thinking mode.
We don’t always use LLMs in the command-line interface. We need to access them using APIs to integrate these models into real-world applications. Let’s discuss how to use Qwen3 models with Ollama in Python.
How to use Qwen 3 LLM with Ollama in Python?
Python provides us the requests module to make API calls to any server. We will use this module to access the Ollama server. Ollama also provides an official Python client, ollama, which helps us integrate Ollama models into Python applications. You can install both these modules using PIP by executing the following command in the command-line terminal:
pip install ollama requests
Now, let’s discuss how to access Qwen3 models using the requests module in Python.
Access Qwen3 models in Ollama using the requests module
We can access Ollama models using the /api/generate API endpoint through the port number 11434 on the Ollama server.
- As we have installed Ollama on our local machine, we can access the Qwen3 model through the URL
http://localhost:11434/api/generateusing therequests.post()function. Thepost()function takes the URL as its first input and a JSON object as the request body as its second input argument. - In the request body, we will pass the model name and query using the
"model"and"prompt"keys. We will also set the"stream"key toFalseso that Ollama returns the entire output at once instead of giving a streaming output. - We can convert the output of the
post()method to JSON and get the model response for our query using the"response"key of the output.
For instance, we can get the answer for the query “How Codecademy helps students learn AI? Explain concisely.” using the qwen3:4b model using Ollama and the requests module as follows:
import requestsimport json# Define queryquery = "How Codecademy helps students learn AI? Explain concisely."# Define the API endpoint and payloadurl = "http://localhost:11434/api/generate"payload = {"model": "qwen3:4b","prompt": query,"stream": False}# Send POST requestoutput = requests.post(url, json=payload).json()# Retrieve model responsereponse = output["response"]print("The query is:")print(query)print("The response from Qwen3 model is:")print(reponse)
Output:
The query is:How Codecademy helps students learn AI? Explain concisely.The response from Qwen3 model is:<think>Okay, the user is asking how Codecademy helps students learn AI. I need to explain this concisely. Let me start by recalling what Codecademy offers.I should mention the structured curriculum. Maybe they start with fundamentals like data structures and algorithms, then move into more advanced topics like machine learning.Wait, the user wants it concise. So I need to highlight key points without going into too much detail. Maybe start with the interactive lessons, then the practical application, then the resources and community. Also, mention that they cover the necessary programming and tools for AI, like Python libraries. Make sure to keep it brief but informative.</think>Codecademy helps students learn AI through interactive, project-based courses that teach programming fundamentals, Python, and machine learning concepts. It offers structured lessons with hands-on coding exercises, real-world examples, and a supportive community, enabling learners to build AI skills progressively and apply them to practical problems.
In the output, you can see that the response also contains model reasoning as it runs in thinking mode by default.
To run Qwen3 in non-thinking mode while generating responses using the requests module, you can set the "think" key to False in the payload provided in the post() method.
import requestsimport json# Define queryquery = "How Codecademy helps students learn AI? Explain concisely."# Define the API endpoint and payloadurl = "http://localhost:11434/api/generate"payload = {"model": "qwen3:4b","prompt": query,"stream": False,"think": False}# Send POST requestoutput = requests.post(url, json=payload).json()# Retrieve model responsereponse = output["response"]print("The query is:")print(query)print("The response from Qwen3 model is:")print(reponse)
Output:
The query is:How Codecademy helps students learn AI? Explain concisely.The response from Qwen3 model is:Codecademy helps students learn AI by offering interactive, project-based courses that introduce fundamental concepts like machine learning, neural networks, and data analysis. It provides hands-on coding practice with Python, allowing students to build real-world AI projects. The platform simplifies complex topics with clear explanations and practical examples, making AI accessible to beginners while also offering advanced content for those looking to deepen their skills.
We haven’t received the model reasoning in this output because the "think" field is set to False in the payload given to the /api/generate endpoint.
Instead of using the requests module, we can use the ollama module in Python to get responses from Qwen3 models through Ollama. Let’s discuss how to do that.
Access Qwen3 models using Ollama client
We can use the ollama module to get responses from Qwen3 models in Ollama. For this, we will use the following steps:
- First, we will create an Ollama client using the
Client()function defined in theollamamodule. - Then, we will generate responses from Ollama models using the
chat()method. Thechat()method, when invoked on the Client object, takes the model name as input to itsmodelparameter and a list of dictionaries containing past messages and a new query as input to themessagesparameter. - After executing the
chat()method, we get the model response using themessage.contentattribute of the output.
For instance, we can get the answer to the query “How Codecademy helps students learn AI? Explain concisely.” using the ollama module in Python and the qwen3:4b model hosted in Ollama as follows:
from ollama import Client# Create Ollama clientclient=Client()# Define a queryquery="How Codecademy helps students learn AI? Explain concisely."# Get model responseoutput=client.chat(model= "qwen3:4b",messages=[{"role":"user", "content":query}],)# Retrieve model responseresponse=output["message"]["content"]print("The query is:")print(query)print("The response from Qwen3 model is:")print(reponse)
Output:
The query is:How Codecademy helps students learn AI? Explain concisely.The response from Qwen3 model is:Codecademy helps students learn AI by offering interactive, project-based courses that introduce fundamental concepts like machine learning, neural networks, and data analysis. It provides hands-on coding practice with Python, allowing students to build real-world AI projects. The platform simplifies complex topics with clear explanations and practical examples, making AI accessible to beginners while also offering advanced content for those looking to deepen their skills.
By default, the chat() method runs the Qwen3 models in thinking mode, but doesn’t show the model reasoning in the output. However, running the models in thinking mode adds latency while generating responses. Hence, if you want to run the Qwen3 models in non-thinking while using the chat() method, you can set the think parameter to False, as shown in the following code:
from ollama import Client# Create Ollama clientclient=Client()# Define a queryquery="How Codecademy helps students learn AI? Explain concisely."# Get model responseoutput=client.chat(model= "qwen3:4b",messages=[{"role":"user", "content":query}],think=False)# Retrieve model responseresponse=output["message"]["content"]print("The response from Qwen3 model is:")print(reponse)
Output:
The response from Qwen3 model is:Codecademy helps students learn AI by offering interactive, project-based courses that introduce fundamental concepts like machine learning, neural networks, and data analysis. It provides hands-on coding practice with Python, allowing students to build real-world AI projects. The platform simplifies complex topics with clear explanations and practical examples, making AI accessible to beginners while also offering advanced content for those looking to deepen their skills.
Now that we know how to install and use Qwen3 models using Ollama and Python, let’s discuss how to adapt them for specific use cases by fine-tuning.
Fine-tuning Qwen 3 with Ollama
Ollama provides a way to locally host and serve LLMs through APIs. We cannot fine-tune LLMs with new datasets using Ollama. However, we can create new models using existing models in Ollama by adding system prompts to control the model behavior. Let’s discuss how to fine-tune an LLM in Ollama using prompt engineering.
Using a system prompt, we will fine-tune the Qwen3 0.6B module for the sentiment analysis task. To do this, let’s first create a new folder named qwen3-sentiment where we will store the system prompt:
mkdir qwen3-sentiment
After creating the folder, go to the qwen3-sentiment folder and create a text file named Modelfile using any text editor.
cd qwen3-sentimentgedit Modelfile
In the text file, we will first specify the source model name we want to fine-tune using the FROM parameter. Next, we will add the system prompt for the model with a few-shot example that instructs the model to work as a sentiment analysis model. The Modelfile will have the following content:
FROM qwen3:0.6b# Add a system prompt to decide the model behavior.SYSTEM """You are a sentiment analysis assistant. Classify each text as Positive, Negative, or Neutral.Examples:Input: "I absolutely loved the product! It exceeded my expectations."Sentiment: PositiveInput: "The service was okay, nothing special."Sentiment: NeutralInput: "This is the worst experience I’ve ever had."Sentiment: NegativeNow analyze the following text:Input: "{{text}}"Sentiment:"""
After creating the Modelfile, we will create the fine-tuned sentiment analysis model using the ollama create command. The ollama create command takes the new model’s name as its first argument, with Modelfile as the input file.
ollama create qwen3-sentiment-analyzer -f Modelfile
After executing the ollama create command, we get a new model named qwen3-sentiment-analyzer:
After creating the fine-tuned model, we can run it as any other ollama model using the ollama run command:
ollama run qwen3-sentiment-analyzer
As we have created the sentiment analysis model, the qwen3-sentiment-analyzer model gives the sentiment of every input text, as shown in the following image:
We have created the new model using the Qwen3 0.6B model. Hence, the qwen3-sentiment-analyzer preserves all the properties from the base model. As shown in the image, we can also switch the qwen3-sentiment-analyzer model from thinking to non-thinking mode and vice versa.
Fine-tune Qwen3 models using new dataset
We cannot fine-tune a model using a new dataset in Ollama. To do this, you can use PyTorch, TensorFlow, HuggingFace, or ModelScope. To learn how to fine-tune Qwen3 using HuggingFace, you can go through this article on LLM fine-tuning using HuggingFace in Python.
Features of Qwen 3
Qwen3 comes with several features that make it suitable for everything from sentiment analysis and text summarization to advanced research and enterprise deployment:
- Hybrid architecture: Qwen 3 is available as different dense and mixture of experts (MoE) models. Dense models use every part of their architecture during inference, and all the layers and neurons are active for each input. MoE models split their architecture into specialized sub-networks. While inference, only a subset of the layers and neurons of MoE models participate in processing the inputs. Different dense and MoE models provide scalability for diverse resource availability.
- Multiple parameter scale: Qwen 3 is available in multiple parameter scales with dense models ranging from 0.6B to 32B. It also provides 235B and 30B MoE models. Based on the available infrastructure and task complexity, we can use any of these models in our LLM-based applications.
- Hybrid reasoning: Qwen3 has thinking and non-thinking modes that we can use according to a specific task. Thinking mode is better for step-by-step reasoning, math, and coding problems but has computational overhead and a delay in response. Non-thinking mode is best suited for general-purpose tasks where we need outputs quickly.
- Multilingual support: Qwen 3 natively supports 119 languages and dialects, including many Indo-European, Afro-Asiatic, Austronesian, Dravidian, Turkic, and Sino-Tibetan languages, along with Japanese and Korean. Thus, we can use it in customer-facing applications and chatbots globally.
- Computational efficiency: Qwen 3 MoE models activate only a subset of the layers and neurons during inference as required for the given task. For example, Qwen3 235B only uses 22B parameters for a given task. This helps us achieve high performance without the need for massive computational resources.
Conclusion
Qwen3 provides a wide range of models for our LLM-based applications. In this article, we discussed how to install and use Qwen3 models with Ollama and how to fine-tune them by adding system prompts.
To learn more about LLMs, you can take this course on how to use ChatGPT. You might also like this course on OpenAI API coding with Python that discusses how to use chat completion methods for getting outputs from self-hosted large language models.
Frequently asked questions
1. How much RAM does Qwen 3 need?
Different Qwen3 models have different RAM requirements. For the Qwen3 0.6B and Qwen3 1.7 B models, 8GB RAM is sufficient for local inference. For the Qwen3 4B and 8B models, 16 GB RAM is sufficient, but larger contexts might need up to 32 GB. The Qwen3 14B, 30B MoE, and 32B models require at least 32 GB RAM for good performance, whereas the 235B model requires at least 128GB RAM.
2. Does Qwen3 support image input?
The open-source Qwen3 models support only text input by default. However, you can provide image inputs to the Qwen3 models while using them through the official Qwen chat interface.
3. Can I run Ollama without a GPU?
Yes, you can run models through Ollama without a GPU for local inference and building small projects. However, you will need a GPU to use LLMs in production environments.
4. What is the difference between dense and MoE in Qwen3?
Dense and MoE (mixture of experts) are two different types of Qwen3 models. Dense models use all their layers and neurons to generate outputs for a query. In an MoE model, various subsets of layers and neurons are responsible for different tasks. Hence, only a subset of layers and neurons is used to generate outputs in an MoE model.
5. Is Qwen a thinking model?
Yes, Qwen 3 is considered a thinking model for its built-in reasoning and problem-solving abilities. It can understand and analyze natural language inputs, and we can run these models in thinking and non-thinking modes.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
How to Run Deepseek R1 Locally
Learn how to set up and use Deepsake R1 locally using Ollama. - Article
Building Visual RAG Pipelines with Llama 3.2 Vision & Ollama
Explore how to build multimodal RAG pipelines using LLaMA 3.2 Vision and Ollama for intelligent document understanding and visual question answering. - Article
How to Run and Use OpenAI’s GPT-OSS Locally
Learn how to install and run GPT-OSS locally using Ollama, Hugging Face, and LM Studio. Complete guide to OpenAI's new open-weight models.
Learn more on Codecademy
- Learn about the differences between different regression models and how to decide which one to use.
- Intermediate.1 hour
- Explore fine-tuning AI models like GPT-3 and 4 with OpenAI APIs. Learn to utilize the Assistants API and understand the creation and comparison of text embeddings.
- Intermediate.1 hour
- Excel in OpenAI APIs using Python. Discover API key authentication, access to completions APIs via endpoints, model configurations, and control of creativity and response length.
- Beginner Friendly.2 hours