Skip to content

Instantly share code, notes, and snippets.

@harrisoncramer
Last active June 18, 2022 20:17
Show Gist options
  • Save harrisoncramer/6ec98f7af528b3747909dab32fe7563a to your computer and use it in GitHub Desktop.
Save harrisoncramer/6ec98f7af528b3747909dab32fe7563a to your computer and use it in GitHub Desktop.
C2C_Weighted_Query

Problem Summary

We need to write two queries, that will return:

  1. The weighted rate of growth for graduates who studied certain fields
  2. The weighted rate of growth for jobs in those fields

We need these rates in order to calculate the "spread": the difference between the rate of growth for graduates and jobs per job type.

spread = weighted_job_growth_rate - weighted_job_growth_rate

For this technical document I've included the SQL commands alongside a copy command that will create CSVs on your desktop so that you can see the output along each step of the way.

The Weighted Job Growth Query

One half of our equation is the weighted job growth. To do this, we first have to get the total number of jobs in a given region for the selected fields.

The first subquery for this half of the equation looks like this:

\copy (
  SELECT SUM(employment_2019) as jobs, region
  FROM labor.data
  WHERE job_category_label IN ('Computer and Mathematical', 'Military Specific')
  GROUP BY region
) TO '~/Desktop/all_jobs.csv' With CSV DELIMITER ',' HEADER;

This query just returns the total number of jobs, grouped by region, for the fields selected.

Our next subquery will, for each field, multiply the number of jobs by the weighted job growth value, and group the results by region and job category:

\copy (
SELECT
  (employment_2019 * employment_five_year_growth_rate) AS weighted_value,
  SUM(employment_2019) AS total_2019,
  employment_five_year_growth_rate,
  region,
  job_category_label
FROM labor.data
WHERE job_category_label IN ('Computer and Mathematical', 'Military Specific')
GROUP BY employment_five_year_growth_rate, employment_2019, region, job_category_label
) TO weighted_job_growth.csv With CSV DELIMITER ',' HEADER;

For each row in the labor data that matches the fields we're interested in, we multiply the number of employees in 2019 by the five year growth rate, which returns a weighted value.

We then add up those weighted values, grouping by region and job type, and divide them by the total number of jobs in the region, which we get from our first subquery. We multiply that number by 100 to get a percentage. That gives us the weighted growth rate for a given region per job type.

Here's the full query:

\copy (
 WITH total_jobs AS (
   SELECT SUM(employment_2019) as jobs, region
   FROM labor.data
   WHERE job_category_label IN ('Computer and Mathematical', 'Military Specific')
   GROUP BY region
 ),
 jobs_with_weight AS (
   SELECT
   (employment_2019 * employment_five_year_growth_rate) AS weighted_value,
   SUM(employment_2019) AS total_2019,
   employment_five_year_growth_rate,
   region,
   job_category_label
   FROM labor.data
   WHERE job_category_label IN ('Computer and Mathematical', 'Military Specific')
   GROUP BY employment_five_year_growth_rate, employment_2019, region, job_category_label
 )
  SELECT (SUM(weighted_value) / SUM(total_jobs.jobs)) * 100 AS weighted_job_growth_rate,
  jobs_with_weight.region,
  jobs_with_weight.job_category_label
  FROM jobs_with_weight
  JOIN total_jobs
  ON total_jobs.region = jobs_with_weight.region
  GROUP BY jobs_with_weight.region, jobs_with_weight.job_category_label
) TO '~/Desktop/weighted_job_growth_per_region.csv' With CSV DELIMITER ',' HEADER;

Here's a quick peek at what the table produced by this query looks like:

┌─────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────┬───────────────────────────┐
│  weighted_job_growth_rate   │                                        region                                         │    job_category_label     │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────┤
│                      (NULL) │ The Villages, FL                                                                      │ Military Specific         │
│      8.08320185764483243600 │ Hot Springs, AR                                                                       │ Computer and Mathematical │
│      0.64561702837486221600 │ Hickory-Lenoir-Morganton, NC                                                          │ Computer and Mathematical │
│      2.53242205250346038000 │ Crestview-Fort Walton Beach-Destin, FL                                                │ Military Specific         │
│      0.15466346703692049000 │ Worcester, MA-CT                                                                      │ Military Specific         │
│                      (NULL) │ St. George, UT                                                                        │ Military Specific         │
│      0.44609576940604977900 │ Portland-Vancouver-Hillsboro, OR-WA                                                   │ Computer and Mathematical │
│     15.84707572911962775800 │ White, Jackson, Prairie & Woodruff Counties                                           │ Computer and Mathematical │
│     33.51955307262569832400 │ St. Lawrence County                                                                   │ Military Specific         │
│     16.48772053669391746400 │ Las Cruces, NM                                                                        │ Computer and Mathematical │
│      7.26077642020796139400 │ Jonesboro, AR                                                                         │ Computer and Mathematical │
│     11.76641183993670956000 │ Isabella, Gratiot & Clare Counties                                                    │ Computer and Mathematical │
│      7.44545324010031289900 │ Beaumont-Port Arthur, TX                                                              │ Computer and Mathematical │
│     15.91293232866185552100 │ Highland, Clinton & Adams Counties                                                    │ Computer and Mathematical │
│      5.12390292877144782900 │ Des Moines, Lee, Henry & Louisa Counties                                              │ Computer and Mathematical │
│      3.50914150154451035500 │ Jacksonville, NC                                                                      │ Military Specific         │
│      3.06273062730627306300 │ Lewiston, ID-WA                                                                       │ Military Specific         │
│                      (NULL) │ San German, PR                                                                        │ Military Specific         │
│      0.63090987525921571400 │ Greensboro-High Point, NC                                                             │ Military Specific         │
│      0.01181614084839891300 │ Green Bay, WI                                                                         │ Military Specific         │
│      2.98367951498535026000 │ Springfield, MO                                                                       │ Military Specific         │
│     12.10965522702029689500 │ Moore & Montgomery Counties                                                           │ Military Specific         │

We now need a similar query to get the weighted graduate growth, the other half of our equation.

The Weighted Graduate Growth

We still need the other half of our equation: the weighted graduate growth.

First, we'll get the total number of gradutes in each region for the supplied job types. This query is slightly more complicated than before because we have to use a crosswalk.

We need to use a crosswalk table to get the correct degrees, and then query all of our graduate data using that subquery. It looks like this:

\copy (
  WITH cip_codes AS (
    SELECT DISTINCT cip_4, job_category_label
    FROM c2c.crosswalk
    WHERE job_category_label IN ('Computer and Mathematical', 'Management')
  )
  SELECT
  SUM(ctotalt_2019) as graduates,
  region
  FROM c2c.field
  JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
  JOIN c2c.institution ON institution.unitid = field.unitid
  GROUP BY region
) TO '~/Desktop/total_graduates.csv' With CSV DELIMITER ',' HEADER;

Next, we can use the same crosswalk to get the weighted graduate value of graduates in each region, for each job type.

\copy (
  WITH cip_codes AS (
   SELECT DISTINCT cip_4, job_category_label
   FROM c2c.crosswalk
   WHERE job_category_label IN ('Computer and Mathematical', 'Management')
  )
  SELECT
    (ctotalt_2019 * tot_five_grw_rt) AS weighted_value,
    region,
    cip_codes.job_category_label
  FROM c2c.field
  JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
  JOIN c2c.institution ON institution.unitid = field.unitid
  WHERE institution.region =  'Mount Vernon-Anacortes, WA'
  GROUP BY tot_five_grw_rt, region, ctotalt_2019, cip_codes.job_category_label
) TO '~/Desktop/weighted_grad_growth.csv' With CSV DELIMITER ',' HEADER;

Finally, we want to aggregate all of these values into a single percentage per region and job type. To do this, according to the spreadsheets that Michael supplied, we should add up all the weighted values, grouping by region and job category label, and divide them by the total that we calculated earlier.

\copy (
  WITH cip_codes AS (
    SELECT DISTINCT cip_4, job_category_label
    FROM c2c.crosswalk
    WHERE job_category_label IN ('Computer and Mathematical', 'Management')
  ),
  total_grads AS (
    SELECT SUM(ctotalt_2019) as grads, region
    FROM c2c.field
    JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
    JOIN c2c.institution ON institution.unitid = field.unitid
    GROUP BY institution.region
  ),
  grads_with_weight AS (
    SELECT
      (ctotalt_2019 * tot_five_grw_rt) AS weighted_value,
      region,
      cip_codes.job_category_label
    FROM c2c.field
    JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
    JOIN c2c.institution ON institution.unitid = field.unitid
    GROUP BY tot_five_grw_rt, region, ctotalt_2019, cip_codes.job_category_label
  )
  SELECT
    (SUM(weighted_value) / SUM(total_grads.grads)) * 100 AS weighted_grad_growth_rate,
    grads_with_weight.region,
    grads_with_weight.job_category_label
  FROM grads_with_weight
  JOIN total_grads
  ON total_grads.region = grads_with_weight.region
  GROUP BY grads_with_weight.region, grads_with_weight.job_category_label
) TO '~/Desktop/weighted_grad_growth.csv' WITH CSV DELIMITER ',' HEADER;

Joining These Results

With the previous two queries, we've compiled rows of data that contain:

  • The weighted job growth rate, per region, per job type
  • The weighted graduate growth rate, per region, per job type

We can now aggregate all of this data together to produce the various numbers we need for our visualization.

\copy (
 WITH total_jobs AS (
   SELECT SUM(employment_2019) as jobs, region
   FROM labor.data
   WHERE job_category_label IN ('Computer and Mathematical', 'Management')
   GROUP BY region
   ),
 jobs_with_weight AS (
   SELECT
   (employment_2019 * employment_five_year_growth_rate) AS weighted_value,
   SUM(employment_2019) AS total_2019,
   employment_five_year_growth_rate,
   region,
   job_category_label
   FROM labor.data
  WHERE job_category_label IN ('Computer and Mathematical', 'Management')
   GROUP BY employment_five_year_growth_rate, employment_2019, region, job_category_label
   ),
 job_rate AS (
   SELECT (SUM(weighted_value) / SUM(total_jobs.jobs)) * 100 AS weighted_job_growth_rate,
   jobs_with_weight.region,
   jobs_with_weight.job_category_label
   FROM total_jobs
   JOIN jobs_with_weight
   ON total_jobs.region = jobs_with_weight.region
   GROUP BY jobs_with_weight.region, jobs_with_weight.job_category_label
   ),
 cip_codes AS (
     SELECT DISTINCT cip_4, job_category_label
     FROM c2c.crosswalk
     WHERE job_category_label IN ('Computer and Mathematical', 'Management')
     ),
 total_grads AS (
     SELECT SUM(ctotalt_2019) as grads, region
     FROM c2c.field
     JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
     JOIN c2c.institution ON institution.unitid = field.unitid
     GROUP BY institution.region
     ),
 grads_with_weight AS (
     SELECT
     (ctotalt_2019 * tot_five_grw_rt) AS weighted_value,
     region,
     cip_codes.job_category_label
     FROM c2c.field
     JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
     JOIN c2c.institution ON institution.unitid = field.unitid
     GROUP BY tot_five_grw_rt, region, ctotalt_2019, cip_codes.job_category_label
     ),
 grad_rate AS (
     SELECT
     (SUM(weighted_value) / SUM(total_grads.grads)) * 100 AS weighted_grad_growth_rate,
     grads_with_weight.region,
     grads_with_weight.job_category_label
     FROM grads_with_weight
     JOIN total_grads
     ON total_grads.region = grads_with_weight.region
     GROUP BY grads_with_weight.region, grads_with_weight.job_category_label
     )

   SELECT
   (grad_rate.weighted_grad_growth_rate - job_rate.weighted_job_growth_rate) AS spread,
   job_rate.*,
   grad_rate.weighted_grad_growth_rate
   FROM job_rate
   FULL JOIN grad_rate
   ON grad_rate.region = job_rate.region
   AND job_rate.job_category_label = grad_rate.job_category_label;
 ) TO '~/Desktop/weighted_final.csv' WITH CSV DELIMITER ',' HEADER;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment