Data Processing with Large Language Models
In this post I’ll use a Large Language Model (LLM) to process an existing dataset into a new dataset. What otherwise might be a complex parsing script that is time consuming to write is instead fairly simple code that takes advantage of the LLM’s natural language processing capabilities.
The Initial Dataset
While looking for a list of job titles I could use in another project I found a number of references to the Occupational Employment and Wage Statistics datasets from the US Bureau of Labor Statitistics. The National dataset has a great list of job titles but unfortunately they are plural nouns and often each line includes more than one title.
Example data from national_M2022_dl.xlsx
:
Top Executives
Chief Executives
Chief Executives
General and Operations Managers
General and Operations Managers
Legislators
Legislators
Advertising, Marketing, Promotions, Public Relations, and Sales Managers
Advertising and Promotions Managers
Advertising and Promotions Managers
Marketing and Sales Managers
Marketing Managers
Rather than first learning the structure and features within this dataset, and then writing a parsing algorithm, instead I’m going to load an LLM locally and prompt it to get the data I want. For speed this requires GPU hardware but you could use remote APIs like Azure or a ChatGPT API, or load GGML models for (slower) CPU inference.
Loading A Model
Depending on the size of the model (and your hardware), initial loading can take some time so my approach will be to load the LLM once and then run a query against that loaded model iteratively for each line of the dataset. On my machine, loading a 7B model takes 2 seconds and loading a 13B model takes 10 seconds.
During development of the script I don’t want to have to load the model every time the script runs so likely some kind of server is needed to load the model once and then being able to start/restart the script multiple times while debugging. One option is use a Jupyter Notebook and something like Transformers where you can load the model in the first cell and then use subsequent cells for writing the script.
Rather than learn Transformers
syntax, I’ll instead run oobabooga and I’ll make requests to its API endpoint. To start oobabooga with an api is easy enough:
$ python server.py --api
Starting streaming server at ws://127.0.0.1:5005/api/v1/stream
Starting API at http://127.0.0.1:5000/api
Running on local URL: http://127.0.0.1:7860
Browsing to http://localhost:7860
> Models allows you to either download or load a model. Once the model is loaded, we’re ready to write the script. api-example.py
from oobabooga source shows the approach to querying a loaded model and the run
function referenced below is a copy/pasta of this.
Writing The Script
I’ve reduced the dataset to just a line by line listing of the occupation categories and saved that to datasets/occupation_categories.txt
. We’ll read that file and from each line create a prompt that we’ll query the API with.
from api import run
import json
def read_categories():
with open("datasets/occupation_categories.txt", "r") as file:
categories = file.readlines()
return categories
def create_prompt(category):
return f'''Job title list. From the category, produce a list of job titles.
Do not include different levels of the same title like Junior or Senior.
Category: Accountants, Auditors
Job Title(s): ["Accountant", "Auditor"]
Category: {category}
Job Title(s):'''
def get_result(line):
result = run(create_prompt(line))
try:
parsed = json.loads(result)
except:
print(f"Could not parse {result}")
return None
return parsed
categories = read_categories()
occupations = set()
for category in categories:
result = get_result(category)
if result is None:
continue
lower_case = list(map(str.lower, result))
print(lower_case)
occupations.update(lower_case)
with open("datasets/occupations.txt", "w") as file:
for x in occupations:
file.write(x + "\n")
My initial approach included a request within the prompt for the response to be a JSON result, however this often resulted in invalid JSON, or, not JSON at all and instead markdown numbered or bulleted lists. The improved prompt includes an example of how the response should be structured and this resulted in the majority of responses being valid JSON. To remove duplicates the array items are converted to lower case and added to a set
.
Results
Running the code against TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ resulted in ~1,826 job titles and the script took 409 seconds to run. The same script against TheBloke_Wizard-Vicuna-7B-Uncensored-GPTQ took 361 seconds to run and resulted in ~2,291 job titles.
During processing I found the 13B model would often miss the closing ]
on the array making the JSON not parseable. The 7B model would often switch double quotes for single quotes making the JSON unparseable. Interestingly it appears for this task the 7B model performs better as it’s both quicker and the number of job titles returned is higher.
Source
The source code for this post, including the datasets before and after, can be seen here.
After writing this post I found another dataset that required no processing. 🥲