At V2 Digital, we can see AI is changing the world. It's disrupting industries, deriving business value and reducing the barrier to entry to many fields of work. Our recent State of AI in Australia report talks to many of these attributes.
Developer Productivity is one of those areas that is showing real results with AI.
Like many of you reading this, I too am a regular user of , (Free GPT 4) and have used and Amazon Q Developer to accelerate delivery of code generation.
Before we get started talking about coding assistance, here is a quick end-to-end demo of what we will cover and build in this blog post.
https://www.youtube.com/watch?v=ntJxm5LNdk8
AI is useful and helps supercharge my productivity. It does not remove the need for developers and builders, because today, you still need to steer these assistants in order to reach your outcome, but it's clear to me, that it is impacting software development, and you know what, that's a good thing. High value versus low value, and enables all of us to focus on what counts. Developer time isn’t free.
The most polished commercial offering from my experience is GitHub's Copilot.
GitHib is where the world stores its code (by and large) and 46% of code committed to GitHub is now created by AI.
GitHub recently released results of a survey, which aligns with my personal experiences.
Source: GitHub Copilot Research
Developers who use AI are more efficient. Statistics from a recent 2024 GitHub Copilot Research Report show that developers who use GitHub Copilot complete tasks 55% faster. Is this Marketing spin or fact? Your mileage may vary, but we hear values of 25%+ in our network, with developer experience being a large factor.
The quote that I took away from this survey was:
Developers who used GitHub Copilot completed the task significantly faster–55% faster than the developers who didn’t use GitHub Copilot
https://visualstudiomagazine.com/articles/2024/01/25/copilot-research.aspx
It's a claim from a vendor, so your mileage may vary depending on whether you are a battle-hardened developer or someone early in your journey. For myself, being not a classically trained developer, I believe this. It has helped me to learn new coding functions and accelerate my output.
Of course what AI can produce is only as good as the inputs it receives. To drive the AI you need to have a solid grasp on the fundamentals and a clear view on what you’re looking to achieve, then most of the heavy lifting can be performed by the AI - which is where the value lies.
Whilst I think GitHub Copilot and other commercial offerings are of great value, there are other options. As a consultant advising organisations on their AI strategy and a builder at heart, I wanted to taste and sample other Large Language Models (LLM), for a few reasons:
Customisation with Specific LLMs: Whilst I think GitHub Copilot and other commercial offerings are of great value, there are other options. As a consultant advising organisations on their AI strategy and a builder at heart, I wanted to taste and sample other Large Language Models (LLM), for a few reasons: Specific LLMs can be more bespoke, like , and , and therefore they presumably deliver better outcomes. A LLM focused solely on Python coding should be better than GitHub Copilot for pure Python tasks, right?
Data Privacy: My data is my data; certain versions of GitHub Copilot will use your data for training purposes. Business and Enterprise versions, however, exclude your data from training.
Cost Considerations: The prices for the versions of GitHub Copilot that I would consider using range from $19 USD to $39 USD per month.
Sustainability and Green IT: Large Language Models are generally very capable. Inference doesn’t need to occur in the cloud (although training does), but local inference is rapidly becoming an option. Local inference offers advantages from speed to reduced costs and carbon emissions. Even Microsoft’s Scott Hanselman talks about the benefits of local inference, demonstrating this on a PC. Running LLMs in the cloud can be excessive for inference, which requires far fewer computational resources compared to training.
It would be naive of me to discount GitHub Copilot. Plenty of business benefits have already been widely shared, from efficiency to quality improvements. The biggest one that stands out to me (wearing my CTO/CIO hat) is Intellectual Property (IP) indemnity. You can read more about it at but the gist of it is:
"Both GitHub and Microsoft provide indemnity for their Copilot products. This means that if you use GitHub Copilot, you're backed by a contractual commitment: If any suggestion made by GitHub Copilot is challenged as infringing on third-party intellectual property (IP) rights and you have enabled the duplicate detection filter, Microsoft will assume legal liability."
For many organisations, especially whilst AI and its legalities are still in its infancy stage, the cost of Copilot at $19US - $39US per developer per month could be considered cheap insurance.
Enter Open Source
Looking at functionality, what you can achieve with other LLM's and plugins has a huge overlap to GitHub Copilot, to the point for my use (spoiler), I see little functional difference. I say 'my' use, as in this instance my usage primarily consists of:
Chatting with my code
Commenting existing code
Debugging my code
Writing new code functions
So in this blog post, I am going to demonstrate how you can run your own LLM locally on Apple Silicon and configure Visual Studio Code to query this LLM.
Apple Silicon
Why you ask, why not a PC? PC's with a discrete GPU have separate memory for the CPU and GPU. You will have your memory for your CPU and memory for your GPU. Apple Silicon Macs combine DRAM accessible to the CPU and GPU into a , or SoC.
Building everything into one chip gives the system a UMA (Unified Memory Architecture). This means that the GPU and CPU are working over the same memory.
The benefit of this is, you don't need a $2000+ discrete GPU, your MAC will be able to run these LLM's. Until companies start building comparable integrated units for Windows laptops, then your only option is a MAC or a decent or GPU (12GB+ RAM ideally 16GB or more, this means and and above)
Our Software Stack
To build our GitHub Copilot we need two pieces of software and a plugin and . LM Studio will feed our VS Code plugin and Ollama will provide that 'tab' autocomplete function.
LM Studio - Installation
To get started, we will be using . You can use this application whether you are running CodeLlama, Mistral, StarCoder2, or any LLM plugins that are designed for OpenAI, to run local, open source models on your computer.
It provides a Local Inference Server, based on the specification, which allows you to use any model running on LM Studio as a drop-in replacement for the OpenAI API.
Whether you are running CodeLlama, Mistral, StarCoder2 or any LLM plugins designed for OpenAI, they all leverage Open AI. If you have built against OpenAI or , you will realise how familiar this structure is
[2024-03-21 20:55:52.061] [INFO] [LM STUDIO SERVER] Verbose server logs are ENABLED
[2024-03-21 20:55:52.064] [INFO] [LM STUDIO SERVER] Success! HTTP server listening on port 1234
[2024-03-21 20:55:52.064] [INFO] [LM STUDIO SERVER] Supported endpoints:
[2024-03-21 20:55:52.065] [INFO] [LM STUDIO SERVER] -> GET http://localhost:1234/v1/models
[2024-03-21 20:55:52.065] [INFO] [LM STUDIO SERVER] -> POST http://localhost:1234/v1/chat/completions
[2024-03-21 20:55:52.065] [INFO] [LM STUDIO SERVER] -> POST http://localhost:1234/v1/completions
[2024-03-21 20:55:52.065] [INFO] [LM STUDIO SERVER] Logs are saved into /tmp/lmstudio-server-log.txt
This also means if you are a builder you can simply enter
pip3 install openai
And your python code will think it's looking at and talking to OpenAI.
The first thing we need to do is install LM Studio. You can download LM Studio from . Download the DMG, and you will notice it's only for (No Intel Macs) and follow the prompts to install.
LM Studio - Model Selection
There is no perfect hardware platform. Everything requires compromise. Whilst I talk about the virtues of Apple Silicon above, depending on who you speak to is a blessing or curse.
LM Studio allows me to run LLMs on the laptop, entirely offline, and use models through in-app Chat UI or OpenAI compatible local server and download compatible model files.
A drawback of this platform is there is no means to upgrade memory (and other components). On the other hand, the plus is that memory is part of the SOC and can be leveraged by the GPU and CPU.
To get started, search for models, and then click download. Note most downloads per model are between 5GB and 20GB in size. I have close to 50GB of models on my laptop.
LM Studio - Performing Local Chat
The first thing you will need to do is load your model. At the very top of the page, select your models (bar in purple) and load the model. This will take 10-15 seconds, depending on the speed of your storage, and during this time, you will see your memory consumption increase.
In this example, Starcoder2 15b Q8_0 is using 15.72 GB of memory. Ideally, you need 32GB of RAM or more to load the larger models.
However as I start to run a local chat with my models, there is a noticeable GPU increase, even when using the Apple Metal API.
LM Studio - Running An Inference Server
The part that excites me is not chat, but its local inference. By running inference locally, I can leverage plugins and other solutions that take the place of OpenAI. Click on the <-> icon on the left-hand side, choose your model and start the server. On the right of this screen you can dial up or down the GPU (Apple Metal) and see some code samples.
Given that our inference server is running and you are now familiar with running an LLM, we will move on to how we can query our self-hosted LLM in VSCode.
Continue - Installation & Configuration
A great way to leverage LLM's, paid or otherwise, is using the VS Code plugin called Continue. Continue is an open-source autopilot for VSCode and and this is the easiest way to code with any LLM. Continue will allow you to leverage the LLM you choose in LMStudio in VSCode.
Install Continue from Visual Studio Marketplace.
Features of Continue are outlined below. Those with a keen eye will notice that on the bottom right, there is a 'Continue' with a tick, which is very similar to GitHub Copilot when it is working.
Add highlighted code to context (chat with your code)
Fix this code
Optimise this code
Write a DocString for this code
Write Comments for this code
Continue does need to be configured. It needs to be told to use LMStudio, which is pretty simple. Click on the Continue icon on the left-hand side the '>CD_' and then the gears. This will bring up a JSON based configuration file.
You will need to add JSON for the provider to go to LMStudio. Below is an example of some of the models I have added. The key part here is the provider being LMStudio.
If you are running just one model at a time on LMStudio then it really doesn't matter what you call the title and model, as there is just one local inference server running. Newer versions of LMStudio (0.2.17) and above allow you to run multiple models simultaneously.
My models are based on the configuration file, “Continue configuration”
Ollama - Adding in autocomplete
Continue with LMStudio does not provide the tab autocomplete function in VSCode. If you are not familiar with autocomplete, here is an example.
You may type in your IDE, 'Write me a function that finds the first 100 prime numbers in C++', if i hit 'tab' I would expect the function to appear. We need Ollama. To do this follow the steps below.
First, download and install Ollama from . Running Ollama will yield nothing more than an icon in your taskbar.
Our cute little icon is the only hint Ollama is running (our cute little Alpaca).
We then need to head to the terminal, and it has the same 'Docker' like vibes in its command line syntax.
ollama run, ollama list.
To run a model, you use the Ollama run command. This means you can run a different model for autocomplete if you wish.
Model | Parameters | Size | Download |
Llama 2 | 7B | 3.8GB | ollama run llama2 |
Mistral | 7B | 4.1GB | ollama run mistral |
Dolphin Phi | 2.7B | 1.6GB | ollama run dolphin-phi |
Phi-2 | 2.7B | 1.7GB | ollama run phi |
Neural Chat | 7B | 4.1GB | ollama run neural-chat |
Starling | 7B | 4.1GB | ollama run starling-lm |
Code Llama | 7B | 3.8GB | ollama run codellama |
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
Llama 2 13B | 13B | 7.3GB | ollama run llama2:13b |
Llama 2 70B | 70B | 39GB | ollama run llama2:70b |
Orca Mini | 3B | 1.9GB | ollama run orca-mini |
Vicuna | 7B | 3.8GB | ollama run vicuna |
LLaVA | 7B | 4.5GB | ollama run llava |
Gemma | 2B | 1.4GB | ollama run gemma:2b |
Gemma | 7B | 4.8GB | ollama run gemma:7b |
Ollama model list as of March 2024
To download a model, issue the Ollama run command.
For example to run llama2, I issue:
Ollama will now Yeah automatically start the model you have selected to run, and this will persist on boot.
Continue - Configuring AutoComplete
We need to go back to the configuration of Continue in VSCode. Open up the JSON configuration file and scroll to the bottom where you will see the 'tabAutocompleteModel' brace. Enter in the provider as Ollama, the model needs to match the model in the table above, and I suggest you use the model name as the title.
"tabAutocompleteModel": {
"title": "codellama 7b",
"provider": "ollama",
"model": "codellama-7b"
}
Congratulations, you have successfully configured Continue, LMStudio and Ollama, so what can you do? Let's go for a lap around the features.
Adding Code To Context & Chatting With Code
Select a blog code, right-click, select 'Continue' and add to context. This will shift the code block to the context on the left of the screen, allowing you to chat with your code.
Fix This Code
Select the code, right click, select ‘Continue’ and ‘Fix this code’. This will send your code to LM Studio and you will be able to ‘accept or reject’.
Comment On / DocString My Code
Select the code, right-click, select 'Continue' and either Comment or DocString. I find this really handy when describing code I haven't written, especially when I don't understand the code base.
Optimise This Code
Select the code, right click, select 'Continue', and select Optimise this code. Much of my code I know is not perfect, I enjoy this function which makes my code cleaner, and I often learn a trick or two.
Tab Autocomplete
Simply start typing a comment and press the tab button. This is perfect if you are adding a function or starting from scratch. It enables you to quickly get your solution's boilerplate up and running.
How Good Is This Setup? - Let's Perform A Quick Test To Find Out
I have been using this setup for a week now and am blown away. My baseline is being a user of GitHub Copilot for the past 6 months, is it better than GitHub Copilot? Well that depends. This setup allows you to use specialist models that are designed for a given task, for example Wizard Coder Python is a model designed purely for Python coding.
I am going to put this solution through its paces. I am going to ask LMStudio running Wizard Coder Python to "write Python code to parse https://v2.digital and return the top 10 posts".
Here is what Codellama replied with, and as someone who codes in Python it looks pretty solid. It is using (BS4) and .
It did. First time and without an issue. Granted, it did not mention that I needed to install the Python packages BS4 (Beautiful Soup) or Requests via PIP prior, but neither did the commercial offerings.
My Thoughts & Summary
AI coding assistants are here to stay, accelerating low-value pieces of work. They are amazing, simple as that.
In this article, we illustrated that you can build a solution that gives you all the power of a commercial AI Assistant sitting on your laptop! Not only is this free, no internet, no worries, it is running locally, meaning I can use this solution anywhere without connectivity (flying for example) and there is no chance of my data being used to train further foundational models.
Unless you are concerned about IP infringement (which is a valid point), the benefits of commercial AI assistants are hard to justify to builder personas. Almost all of the value that GitHub Copilot provides is provided in this combination of software and configuration, free of charge.
The barrier to entry with Apple Silicon is lower than ever. Apple Mac's are common, $2500+ GPU's in desktops, less common.
But whilst the barrier is low for builders, it is still far higher than using a commercial offering (starting at $19USD a month), but I would encourage you to step through this process as it will give you a better understanding of how AI coding assistants work in general & OpenAI’s API model which will have other synergies if you are planning to build an against an LLM in the future.
LLM's are appearing every day, with their unique value propositions. Whilst all LLM's are by nature incredibly capable, they do still have their unique pros and cons.
This approach allows you to sample what’s good, and what's bad. It is not a one-size-fits-all approach. I have tested Python Wizard Coder, StarCoder2, Mistral, CodeLlama and and they also have unique pros and cons.
Whilst I don't have an extensive set of testing data, there are of this pattern performing better than GitHub Copilot, and it makes sense. GitHub Copilot is a general model compared to domain-specific models.
As we look to decarbonise the world, play your part and run your models locally or on that EC2/Azure VM Instance (GPU based).
Everyone is on their own journey, and whilst this may work for some, it may not work for others. If you are looking to reduce operational overhead, there is plenty of value in a hosted solution and depending on your capabilities and risk profile, a self-hosted solution may be the way to go.