You are using an unsupported browser. Please update your browser to the latest version on or before July 31, 2020.
close
You are viewing the article in preview mode. It is not live at the moment.
Home > RAW > Documentation > RAW Data Documentation > Data Dictionary - RAW Data Package Feeds
Data Dictionary - RAW Data Package Feeds
print icon

Raw Job Records 

This is the core of our data. Generally, models are built off of this data, or aggregates of the data. In general, when using any other file, do a left join with this file - dropping any reference information that does not have a job record (ie companies with no job records). 

 

Daily File CSV Parquet
Location Standard/Feeds/Raw Daily Job Records/ Standard/Feeds/Parquet/Raw Daily Job Records/
 File Name   raw_daily_records_YYYY-MM-DD.csv.gz raw_daily_records_YYYY-MM-DD-part-###.parquet
Monthly File CSV Parquet
Location Standard/Feeds/Raw Full Job Records v2/ Standard/Feeds/Parquet/Raw Full Job Records v2/
File Name raw_job_archive_v2_full_YYYY-MM-DD.tar.gz raw_job_archive_v2_YYYY-MM-DD-part-####.parquet

 

 

Job Records Table

 

Field Name Data Type Description
hash string/varchar The unique identifier for job records. This is used to join job descriptions to job records.
title string/varchar Job title scraped from the unique url for the job post
company_id string/varchar Unique identifier for a company-scrape. This is is used to join to reference files with job records.
company_name string/varchar This is the name of the company pertaining to the company_ID.
city string/varchar City location for the job posting.
state string/varchar State or region for the job posting.
zip string/varchar Postal code for the job posting.
country string/varchar  
created timestamp The first time this job was observed and scraped.
last_checked timestamp The most recent time this site was scraped and this job posting was observed.
last_updated timestamp The most recent time that this site was scraped and observed a change in the job posting.
delete_date timestamp The most recent time this site was scraped and this job posting was not found.
unmapped_location boolean If True, this indicates that the job posting location was not able to be accurately identified.
url string The unique URL for the job posting.
base_hash string/varchar The unique identifier for job records on employer websites. This differs from hash as one base_hash might have multiple “hashes” as our dataset splits out jobs for each location listed on a job posting.

 

Raw Job Record Structured Fields  

This table provides several additional fields typically found on a job portal or job record for any one job listing. Join this data to the job records table by hash.

 

What is a structured field? 

 A structured field is a data point found on a job portal or job record and is defined by the employer consistently across their job listings. The consistency of structured fields allows our web scrapes to capture additional data points not found in the job title or description.

 

Will every employer have all structured fields? 

The structured fields on any job listing are provided by the employer at their discretion. It is not expected that all employers will provide all structured field data. 

 

Daily File CSV Parquet
Location Standard/Feeds/Raw Daily Structured Fields/ Standard/Feeds/Parquet/Raw Daily Structured Fields/
 File Name  raw_daily_structured_fields_YYYY-MM-DD.csv.gz raw_daily_structured_fields_YYYY-MM-DD-part-###.parquet
Monthly File CSV Parquet
Location Standard/Feeds/Raw Full Structured Fields/ Standard/Feeds/Parquet/Raw Full Structured Fields/
File Name raw_full_structured_fields_YYYY-MM-DD.tar.gz raw_full_extended_fields_YYYY-MM-DD-part-####.parquet

 

 

Structured Fields Table

 

Field Name Data Type Description
hash string/varchar The unique identifier for job records. This is used to join job descriptions to job records.
reqid string/varchar The job ID or requisition ID for the job opening defined by the employer. This field can be used to track a job opening across multiple employer-owned career portals.
category string/varchar

A classification or group which the job is categorized into by the employer on the career portal.  Category could be a team, department, or brand defined by the employer on the job listing or career portal.

subcategory string/varchar

Sub-category is a secondary or additional category to the primary category.

The sub-category field is defined by the employer as a secondary classification for their open roles. 


An example of sub-category would further define the role under a category. Where a category may be Retail the sub-category could be Shift-Supervisor.
address string/varchar The address for the role provided  on the job listing
certifications string/varchar Any required certifications or licenses for the role provided on the job listing.
posted_date timestamp The date shown on the job listing as the posted date or the date the job became available on the career page in YYYY-MM-DD format.
close_date timestamp The date shown on the job listing as the close date or the date the employer stops accepting applications, in YYYY-MM-DD format.
commission_eligible string/varchar Information provided on the job listing about commission eligibility.
compensation string/varchar The compensation amount or range provided on the job listing.
contract_length string/varchar Information detailing the length of a contracted position.
division string/varchar The brand or division of a company the job listing falls under.
education_requirements string/varchar Educational requirements provided on the job listing.
employment_type string/varchar Describes the role as permanent, temporary, Internship, Contract, Seasonal or Project based work.
experience_required string/varchar Minimum experience required for the role.
signing_bonus string/varchar Provides any signing bonus offered by the employer on the job listing.
shift string/varchar What shift or shifts the job listing is hiring for.
site_id string/varchar The store number or facility associated with the job listing.
time_type string/varchar Describes the work schedule. Typically full-time and/or part-time.
travel_requirements string/varchar Any travel requirements for the role provided on the job listing.
vacancy_count string/varchar The count of hires being made for the job listing.
work_location string/varchar Describes where the work for this role takes place, such as on-site, hybrid, or work from home.
 

 

Raw Job Descriptions

This contains only the job descriptions. To use job descriptions, you will join with Job records on hash. For NLP applications you would typically join this with job descriptions and use a combination of job title from job records file and job description.

 

Daily File XML Parquet
Location /Feeds/Raw Daily Job Descriptions/ /Feeds/Parquet/Raw Daily Job Descriptions/
 File Name  raw_daily_descriptions_YYYY-MM-DD.xml raw_daily_descriptions_YYYY-MM-DD.parquet
Monthly File XML Parquet
Location /Feeds/Raw Full Job Descriptions/  
File Name linkup_job_descriptions_YYYY-M-DD.tar.gz  

 

 

Raw Job Descriptions Table

 

Field Name Data Type Description
hash string/varchar The unique identifier for job records. This is used to join job descriptions to job records.
description string/varchar The full job description scraped from the website for the job posting.
 

Full-time/Part-time

 

The fulltime_parttime table is a point-in-time table used in the LinkUp dataset to identify if a job has language in its posting that indicates if the job is full-time, part-time or both.

 

The full-time/part-time logic uses a robust keyword analysis which is applied to a job’s title, description and structured fields for jobs written in English to determine how to categorize the job. 

 

 

Daily File CSV Parquet
Location Standard/Feeds/Raw Daily Fulltime Parttime/ Standard/Feeds/Parquet/Raw Daily Fulltime Parttime/
 File Name  raw_daily_fulltime_parttime_YYYY-MM-DD.csv.gz raw_daily_records_YYYY-MM-DD-part-###.parquet
Monthly File CSV Parquet
Location Standard/Feeds/Raw Full Fulltime Parttime/ Standard/Feeds/Parquet/Raw Full Fulltime Parttime/
File Name raw_full_fulltime_parttime_YYYY-MM-DD-part-###.csv.gz raw_full_fulltime_parttime_YYYY-MM-DD-part-###.parquet

 

 

Full-time/Part-time Table

 

Field Name Data Type Description
job_hash string/varchar The unique identifier for job records. This is used to join job descriptions to job records.
fulltime_parttime string

fulltime = The job title, description, or time_type structured field has language that the position is a full-time position and/or indicates at least 40 hours of work a week.


parttime = The job title, description, or time_type structured field has language that the position is a part-time position and/or indicates less than 40 hours of work a week.


fulltime_parttime = The job title, description, or time_type structured field has language that the position is either full-time or part-time.

start_date date The start date for this row. This is used for joining point-in-time information. 
end_date date The end date for this row. A NULL end date indicates that the row is the current or most recent full-time/part-time status.

 

 

 

Company Scrape Log

This is a record of when scrapes run and changes to scrapes. The primary use is identifying outliers. This can be used to classify outliers as legitimate or due to a scrape break. This can eliminate noise and false signals in job data.

 

Scrape changes can provide meaningful information as well as just reducing noise. For example after an influx of financing, a company may change Applicant Tracking Systems (e.g., Charming Charlie coming out of bankruptcy). A change was recognized because a change was made to the scrape to fix it and the documentation of when that change occurred is in the scrape log.

 

 

Daily File CSV Parquet
Location /Feeds/Raw Daily Company Scrape Log/ /Feeds/Parquet/Raw Daily Company Scrape Log/
 File Name   raw_company_scrape_log_daily_YYYY-MM-DD.csv raw_company_scrape_log_daily_YYYY-MM-DD.parquet
Monthly File CSV Parquet
Location /Feeds/Raw Full Company Scrape Log/ /Feeds/Parquet/Raw Full Company Scrape Log/
File Name raw_company_scrape_log_full_YYYY-MM-DD.csv raw_company_scrape_log_daily_YYYY-MM-DD.parquet

 

 

Company Scrape Log Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
date date The date the change occurred for the scrape.
scrape_run_complete boolean If True, this indicates the company-id was scraped on this date
scrape_changed boolean If True, this indicates that the code was modified for the scrape on this date

 

Company Ticker Reference

This file shows point-in-time ticker information for company ids. This can be joined to job records to understand mergers and acquisitions.

 

 

Daily File CSV Parquet
Location /Feeds/Company Ticker Reference/ /Feeds/Parquet/Company Ticker Reference/
 File Name   company_ticker_YYYY-MM-DD.csv company_ticker_YYYY-MM-DD.parquet

 

 

Company Ticker Reference Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
start_date date The start date for this row. This is used for joining point-in-time information. Please see the join tutorial for assistance with join. 
end_date date The end date for this row. This is used for joining point-in-time information. Please see join tutorial for assistance with join.
ticker_symbol string/varchar The ticker symbol
stock_exchange_country string/varchar The country of the stock exchange that the ticker symbol is traded on.
stock_exchange_name string/varchar The stock exchange symbol that the ticker is traded on.
primary_flag boolean If True, this is the primary ticker for the company_id.

 

 

Employer Type Reference

Employer Type can be used to filter the RAW dataset down to public, private, or multiple levels of government employers and jobs such as post-secondary or K-12 education and federal or local government employers.

 

The Employer Type table is point-in-time, starting in 2024 when we reviewed each active company_id and assigned an employer type. If a change is detected, a new row for the company_id will be created and denoted with a new start_date. An end_date is applied to the previous row to delineate that the row is not the most current.  

 

 

Daily File CSV Parquet
Location /Feeds/Raw Daily Employer Type Reference/ /Feeds/Parquet/Raw Daily Employer Type Reference/
 File Name  raw_daily_employer_type_YYYY-MM-DD.csv raw_daily_employer_type_YYYY-MM-DD.parquet
Monthly File   CSV Parquet
Location /Feeds/Raw Full Employer Type Reference/ /Feeds/Parquet/Raw Full Employer Type Reference/
File name raw_full_employer_type_YYYY-MM-DD.csv raw_full_employer_type_YYYY-MM-DD.parquet

 

 

Employer Type Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
employer_type_id integer The numerical ID for employer type. 
employer_type string/varchar The description of the employer type id
start_date date The start date for this row. This is used for joining point-in-time information.  
end_date date The end date for this row. This is used for joining point-in-time information. 

 

 

Employer Types

 

Value Employer_type_id Description
Public Company 1 For-profit, publicly traded corporations.
Private Company 2 For-profit, non-public corporations.
Post-secondary Education  3 Non-profit or governmental post-secondary educational institutions.
K-12 Education 4 Non-profit or governmental K-12 educational institutions.
Non-Profit 5 Non-profit corporations.
Federal Government 6 Federal government, military, and non-U.S. equivalent entities.
Local Government 7 Non federal government employers at the state, county, territorial, city, township, parish, etc.

 

 

Remote Tag

The remote tag is a point-in-time table used to find all remote and non-remote work  in the LinkUp dataset. Hybrid roles are considered remote work.  

 

Remote tag is determined using a robust keyword analysis which is applied to job records written in English.   

 

LinkUp is working on providing an additional data element to distinguish between remote and hybrid positions. 

 

 

Daily File CSV Parquet
Location /Feeds/Raw Daily Remote Tag/ /Feeds/Parquet/Raw Daily Remote Tag/
 File Name  raw_daily_remote_tag_YYYY-MM-DD.csv raw_daily_remote_tag_YYYY-MM-DD.parquet
Monthly File CSV Parquet
Location /Feeds/Raw Full Remote Tag/ /Feeds/Parquet/Raw Full Remote Tag/
File name raw_full_remote_tag_YYYY-MM-DD.csv raw_full_remote_tag_YYYY-MM-DD.parquet

 

Remote Tag Table

 

Field Name Data Type Description
hash string/varchar The unique identifier for job records. This is used to join to job record data.
remote_status boolean TRUE = The job title, description, or a structured field on the job listing contains keywords or phrases that indicate the role, in some way, is a remote position. This includes hybrid, telecommute, or work from home roles.

FALSE = The job title, description, or a structured field on the job listing either contains keywords that identify the job as a non-remote position, or contains no keywords/language in the description that indicates the role is remote, work from home, or hybrid capable.
 
remote_detail string The classification of Hybrid or Remote, where applicable.
start_date date The start date for this row. This is used for joining point-in-time information. 
end_date date The end date for this row. A NULL end date indicates that the row is the current or most recent remote_status.

 

Company ISIN* Reference

This file shows point-in-time ISIN (International Securities Identification Number), mapped via FactSet concordance. Please note, a license is required to receive this file. 

 

 

Daily File CSV Parquet
Location /Feeds/Company ISIN Reference/ /Feeds/Parquet/Company ISIN Reference/
 File Name  company_isin_YYYY-MM-DD.csv company_isin_YYYY-MM-DD.parquet

 

Company ISIN Reference Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
start_date date The start date for this row. This is used for joining point-in-time information. Please see the join tutorial for assistance with join. 
end_date date The end date for this row. This is used for joining point-in-time information. Please see join tutorial for assistance with join.
isin string/varchar International Securities Identification Number, mapped using FactSet
primary_flag boolean If True, this is the primary ISIN for the company_id.

 

 

Company CUSIP* Reference

This file shows point-in-time CUSIP (Committee on Uniform Securities Identification Procedures), mapped via FactSet concordance. This identifier is primarily used for publicly traded organizations in the United States.  Please note, a license is required to receive this file.

 

Daily File CSV Parquet
Location /Feeds/Company CUSIP Reference/ /Feeds/Parquet/Company CUSIP Reference/
 File Name  company_cusip_YYYY-MM-DD.csv  company_cusip_YYYY-MM-DD.parquet

 

 

Company CUSIP Reference Table

 

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
start_date date The start date for this row. This is used for joining point-in-time information. Please see the join tutorial for assistance with join. 
end_date date The end date for this row. This is used for joining point-in-time information. Please see join tutorial for assistance with join.
cusip string/varchar Committee on Uniform Securities Identification procedures number, mapped via FactSet.
primary_flag boolean If True, this is the primary CUSIP for the company_id.

 

 

Company Sedol Reference

This file shows point-in-time SEDOL (Stock Exchange Daily Official List), mapped via FactSet concordance. This identifier is managed by the London Stock Exchange. Please note a license is required to receive this file. 

 

Daily File CSV Parquet
Location /Feeds/Company Sedol Reference/ /Feeds/Parquet/Company Sedol Reference/
 File Name  company_sedol_YYYY-MM-DD.csv company_sedol_YYYY-MM-DD.parquet

 

 

Company SEDOL Reference Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
start_date date The start date for this row. This is used for joining point-in-time information. Please see the join tutorial for assistance with join. 
end_date date The end date for this row. This is used for joining point-in-time information. Please see join tutorial for assistance with join.
sedol string/varchar Stock exchange daily official list number, mapped via FactSet.
primary_flag boolean If True, this is the primary SEDOL for the company_id.

 

 

PIT Company Reference

This file shows point-in-time company information for a company_id. This can be joined to job records to understand some corporate change (e.g, corporate name change, url changes, etc.). This would be joined to job records or aggregates to be used, dropping any company ids with no job records.

 

 

Daily File CSV Parquet
Location /Feeds/Raw Daily PIT Company Reference/ /Feeds/Parquet/Raw Daily PIT Company Reference/
 File Name   raw_pit_company_reference_daily_YYYY-MM-DD.csv raw_pit_company_reference_daily_YYYY-MM-DD.parquet
Monthly File CSV Parquet
Location /Feeds/Raw Full PIT Company Reference/ /Feeds/Parquet/Raw Full PIT Company Reference/
File Name raw_pit_company_reference_full_YYYY-MM-DD.csv raw_pit_company_reference_full_YYYY-MM-DD.parquet

 

 

PIT Company Reference Table

 

Field Name Data Type Description
company_id string/varchar The unique identifier for job records. This is used to join to reference files or aggregates.
start_date date The start date for this row. This is used for joining point-in-time information. Please see the join tutorial for assistance with join. 
end_date date The end date for this row. This is used for joining point-in-time information. Please see join tutorial for assistance with join.
company_name string/varchar The name for the company_id.
company_url string/varchar The URL for the company_id.
lei string/varchar Legal Entity Identifier
open_perm_id string/varchar Open source company identifier used for joining to other data.
naics_code string/varchar Industry classification. This can be used to join to the Bureau of Labor Statistics salary data. 
 

 

O*Net-SOC Taxonomy 2019 Reference

This file provides ONET-SOC code by job_hash, currently available in 2010 and 2019 taxonomy which supports databases 25.1 through the latest release, 26.1. ONET is the primary source for standardized occupation information in the US for over 1,000 occupations covering the entire US Economy.

NOTE: 2010 ONet data is available up to October 1st, 2022. 


​​​​
 

Daily File CSV Parquet
Location /Feeds/ONet Taxonomy 2019 Daily v2/ /Feeds/Parquet/ONet Taxonomy 2019 Daily v2/
 File Name  onet_taxonomy_2019_daily_v2_YYYY-MM-DD.csv onet_taxonomy_2019_daily_YYYY-MM-DD.csv
Monthly File CSV Parquet
Location /Feeds/ONet Taxonomy 2019 Full v2/ /Feeds/Parquet/ONet Taxonomy 2019 Full v2/
File Name onet_taxonomy_2019_full_v2_YYYY-MM-DD.csv onet_taxonomy_2019_full_v2_YYYY-MM-DD-part-###.parquet

 

 

ONET Taxonomy Table

 

Field Name Data Type Description
hash string/varchar The unique identifier for job records. This is used to join job descriptions to job records.
onet_occupation_code string The ONET classification of the job. It is one way of normalizing job titles.

 

 

*Copyright © 2020, American Bankers Association CUSIP Database provided by S&P Global Market Intelligence LLC. All rights reserved.

Feedback
3 out of 3 found this helpful

scroll to top icon