Using EdSurvey to Analyse PIAAC Data

Bailey, Paul; Lee, Michael; Nguyen, Trang; Zhang, Ting

doi:10.1007/978-3-030-47515-4_9

Paul Bailey¹¹,
Michael Lee¹¹,
Trang Nguyen¹² &
…
Ting Zhang¹¹

Part of the book series: Methodology of Educational Measurement and Assessment ((MEMA))

6294 Accesses
3 Citations
1 Altmetric

Abstract

This chapter describes the use of the R package EdSurvey and its use in analysing PIAAC data. The package allows users to download public use PIAAC data, explore the codebooks, explore data, read in and edit relevant variables, and run analyses such as regression, logistic regression, and gap analysis.

This publication was prepared for NCES under Contract No. ED-IES-12-D-0002 with the American Institutes for Research. Mention of trade names, commercial products, or organisations does not imply endorsement by the US government.

You have full access to this open access chapter, Download chapter PDF

Statistical Packages for Data Analysis

Data Entry

Multivariate Data Analysis: Its Approach, Evolution, and Impact

Keywords

9.1 Introduction

The EdSurvey package is a collection of functions for use in the R programming language R Core Team (2019) to help users easily work with data from the National Center for Education Statistics (NCES) and international large-scale assessments. Developed by the American Institutes for Research and commissioned by the NCES, this package manages the entire process of analyses of Programme for the International Assessment of Adult Competencies (PIAAC) data: downloading, searching the codebook and other metadata, conducting exploratory data analysis, cleaning and manipulating the data, extracting variables of interest, and finally data analysis. This chapter describes the use of EdSurvey for each activity, with a focus on PIAAC data.^{Footnote 1} ^, ^{Footnote 2}

Because of the scope and complexity of data from large-scale assessment programmes, such as PIAAC, the analysis of their data requires proper statistical methods—namely, the use of weights and plausible values. The EdSurvey package gives users intuitive one-line functions to perform analyses that account for these methods.

Given the size of large-scale data and the constraint of limited computer memory, the EdSurvey package is designed to minimise memory usage. Users with computers that have insufficient memory to read in entire datasets—the OECD Cycle 1 data are over a gigabyte once read in to R—can still perform analyses without having to write special code to limit the dataset. This is all addressed directly in the EdSurvey package—behind the scenes and without any additional intervention by the user—allowing researchers to more efficiently explore and analyse variables of interest.

The results of analyses on this saved data connection can then be stored or further manipulated. Alternatively, the getData function reads in selected variables of interest to generate an R data.frame. Individuals familiar with R programming might prefer to clean and explore their data using supplementary packages, which EdSurvey supports. These data.frames can then be used with all EdSurvey analytical functions.

The next section shows how to load EdSurvey and download and read in PIAAC data. The third section describes how you can see survey attributes in EdSurvey. The fourth deals with exploring PIAAC data. The fifth section describes data manipulation. The sixth section describes data analysis. The final section explains how to stay current with new developments in EdSurvey.

9.2 Getting Started

R is an open-source software and can be downloaded free of charge from www.r-project.org/ R Core Team (2019). The Comprehensive R Archive Network (CRAN) stores extensions to the base R functionality and can be used to install EdSurvey using the command

Having downloaded the EdSurvey package from CRAN, it must be loaded in every session with the command

Then the user can download the OECD 2012 files with

When downloadPIAAC is run, the data are stored in a folder in the directory that the user specifies, here an operating system-defined folder called ’~/’. On all machines this is the user’s home folder. After the download is complete, users can manually change the folder structure. This chapter will assume that the download call used the folder ’~/’, and the data were not subsequently moved from that folder. Within the target folder, the user specified (here ’~/’) the data will be stored in a subfolder named ‘PIAAC’. All data for participating countries in Cycle 1 will be stored in the subdirectory ‘PIAAC/Cycle 1’. At the time of writing, only Cycle 1 is available for download.

One also can manually download desirable PIAAC data from the Organisation for Economic Co-operation and Development (OECD) webpage^{Footnote 3}, including the 2012/2014 data, or acquire a data licence and access the restricted-use data files. When downloading manually, note that the PIAAC read-in function, readPIAAC, requires both the .csv files with the data and a codebook spreadsheet (.xlsx file) to be in the same folder.

The next step in running analysis is reading in the data. For PIAAC data, this is accomplished with the readPIAAC function, which creates an edsurvey.data.frame that stores information about the specific data files processed. This includes the location on disk, the file format and layout of those files, and the metadata that will allow EdSurvey to analyse the data. A PIAAC edsurvey.data.frame includes information for all variables at the individual level and any household-level variables.

Upon the first read-in, the EdSurvey package caches existing data as a flat text file; for all future sessions, this flat file stores the variables needed for any analysis. The PIAAC Cycle 1 data can be read-in by pointing to the pathway in the PIAAC Cycle 1 data folder and defining the country of interest. By setting countries = c( ’ITA’) in a call to readPIAAC, an edsurvey.data.frame containing Cycle 1 data for Italy is created as the object ita:

The function uses the three-digit International Organization for Standardization country code to select countries to import (here, ‘ITA)’. Section 9.6.3 describes how to read in and analyse data from multiple countries at once. For now, other countries can be read in and analysed separately by repeating the above command with the code of another country, such as the Netherlands:

9.3 Survey Design Attributes

When analysing data with EdSurvey, the package automatically accounts for the plausible values of scores as well as the sample survey design when conducting data analyses by storing metadata in the edsurvey.data.frame. There are four important survey design attributes that have a great influence on the output of later analysis: plausible values, weights, omitted levels, and achievement levels. This section describes these metadata elements and how users can display them.

PIAAC Cycle 1 data have ten plausible values for each domain (numeracy, literacy, and problem solving), as shown in the output of showPlausibleValues function. The showPlausibleValues function not only tells users about the PIAAC domain of skills this round of survey questionnaires contains but also shows the plausible value domain names representing their corresponding domain/subject scale as used in EdSurvey analytical functions.

For example, the ten variables named pvlit1 to pvlit10 store an individual set of plausible values for the literacy scale score domain. These ten variables can simply be referred to by the name lit, and EdSurvey functions will correctly account for the plausible values in both estimation and variance estimation.

The PIAAC sample is a probability sample that was a single stage sample in some countries but a multistage sample in other countries Mohadjer et al. (2016). In addition, because of oversampling and nonresponse, the weights are informative. Users can print the available weights with the showWeights function

Similar to other PIAAC Cycle 1 countries, only one full sample weight (spfwt0) is available for Italy data, and the showWeights function displays it along with 80 replicate weights associated with it. Because it is the default and exclusive full sample weight, it is not necessary to specify the weight in EdSurvey analytical functions; spfwt0 will be used by default. In addition, the jackknife replicates associated with spfwt0 will be used by the variance estimation procedures without the user having to further specify anything.

By default, EdSurvey will show results from the analyses after listwise deletion of respondents with any special values, which are referred as ‘omitted levels’ in EdSurvey. For any data, the omitted levels can be seen with the omittedLevels command

Users wishing to include these levels in their analysis can do so, usually, by recoding them or setting omittedLevels=TRUE. More information is available in the help documentation for each respective function.

To see all this information at once, the user can simply ’show’ the data by typing the name of the edsurvey.data.frame object (i.e. ita) in the console

9.4 Exploring PIAAC Data

Once the desired data have been read in, EdSurvey provides data exploration functions that users can use in combination with PIAAC codebooks and technical documents in preparation for analysis.

It is worth mentioning that many of the basic functions that work on a data.frame, such as dim, nrow, ncol, and $, also work on an edsurvey.data.frame and can be used for exploration. Editing data is not similar to a data.frame and is covered in Sect. 9.5.2.

To view the codebook, the user can use the showCodebook function. The output will be long, given the number of columns in the PIAAC data; use the function View to display it in spreadsheet format

Even with spreadsheet formatting, the codebook can be somewhat daunting to browse. The searchSDF function allows the user to search the codebook variable names and labels

Notice that the search is not case sensitive and uses regular expressions. The search can be refined by adding additional terms in a vector, using the c function; this refines the search to just those rows where all the strings named are present. This search refines the previous results to a single variable

Sometimes knowing the variable name and label is insufficient, and knowing the levels helps. Users can show these levels by setting the levels argument to TRUE

To get an initial insight into a variable’s response frequencies, population estimated response frequencies, and response percentages, use the summary2 function. The function prints out weighted summary statistics using the default weight variable, which is automatically picked up in readPIAAC function. The summary statistics for the variable ’d_q18a_t’ are shown in Table 9.1

Table 9.1 Results from summary2( ita, ’d_q18a_t’)

Full size table

Note that EdSurvey will show variables that OECD includes in the data, some of which will be entirely missing; summary2 will show this. An example of this is the d_q18a_t variable in Canada.

Similarly, summary2 can show summary statistics for continuous variables. The following example code shows the summary statistics for the set of plausible values for the literature domain (’lit’), as shown in Table 9.2

Table 9.2 Results from summary2( ita, ’lit’)

Full size table

Another powerful exploratory function in the package is edsurveyTable. This function allows users to run weighted cross-tab analyses for any number of categorical variables along with or without an outcome (or continuous) variable.

The following example shows how to create a cross-tab table of employment status (c_d05) by age groups in 10-year intervals (ageg10lfs) on literacy outcome

Similar to summary2, the edsurveyTable function returns the weighted percentage (PCT) and conditional means (MEAN) of a selected outcome variable—in this case the literacy score.

The results also can be broken down by multiple variables by using a plus (+) between variables. For example, we add c_d05, the current employment status, in the equation.

Finally, the correlation function can help users explore associations between variables. The function cor.sdf allows for Pearson (for bivariate normal variables), Spearman (for two continuous variables), polyserial (for one continuous and one discrete variable), and polychoric (for two discrete variables) correlations.^{Footnote 4}

These results show a polyserial correlation between literacy and income quintile as .20 (after rounding), with weight spfwt0 applied by default. Because a correlation analysis assumes that the discrete outcome is ordered, the levels of the discrete variable d_q18a_t are shown to allow users to check that it moves in one direction; here, increasing from 1 to 6.

9.5 Accessing and Manipulating PIAAC Data

Typically, before performing an analysis, users edit data consistent with their research goals. This can happen in one of two ways in the EdSurvey package:

1.
Clean and analyse data within the EdSurvey package functions,
2.
Use getData to extract a data.frame to clean and edit with any R tool, and then use rebindAttributes to use EdSurvey functions to analyse the data.

This section describes these two ways of preparing data for an analysis for use in the EdSurvey package (see fig. 9.1 for an overview).

9.5.1 Cleaning Data in EdSurvey

EdSurvey provides three data manipulation functions: subset, recode, and rename.

The subset function limits the rows that are used in an analysis to those that meet a condition. For example, to return the summary statistics for the literacy variable, restricting the population of interest to Italian males, one could use subset. Note the level label (e.g. the ‘MALE’ in the following code) needs to be consistent with the label that is in the data, which can be revealed through a call such as table( ita$gender_r) .

The recode function allows us to change the labels or condense on a discrete variable. For example, the user may want to generate conditional means of the employment status variable (c_d05), wherein those individuals who are (a) ‘UNEMPLOYED’ or (b) ‘OUT OF THE LABOUR FORCE’ are condensed to one level to compare to the subgroup of individuals employed. This leaves a level (‘NOT KNOWN’) that is then removed with subset

Finally, rename allows the user to adjust a variable’s name.

9.5.2 Using getData

Users may want to perform extensive recoding of variables but have preferred methods of recoding using specific R packages. The getData function allows users to select variables to read into memory, extract, and then edit freely. The rebindAttributes function allows the final data.frame to be used with EdSurvey analysis functions.

In this example, getData extracts the following:

two variables: gender_r and c_d05
ten plausible values associated with lit
the weight for this data frame: spfwt0

Some important things to note:

1.
addAttributes is set to the default value of FALSE. Setting add Attributes = TRUE is one method in which the resultant data object (itaRaw) can be passed to other EdSurvey package functions.
2.
All the jackknife replicate weights are returned automatically (spfwt1 to spfwt80).
3.
omittedLevels is set to TRUE, the default, so that variables with special values (such as multiple entries or NAs) are removed by getData. This setting removes these values from factors that are not typically included in regression analysis and cross-tabulation. Alternatively, this can be set to FALSE to be manipulated by the user.

The itaRaw data object is a class data.frame, which allows it to be manipulated with any supplementary R function. For instance, the head function shows us a preview of our data, focusing on Columns 1 through 15, revealing the requested variables and the first few rows of the resulting data

To replicate the data manipulation from Sect. 9.5.1, gsub, a base R function that uses pattern matching to replace values in a variable, recodes the values in the variable c_d05. The base function subset then removes the level ‘NOT KNOWN’.

The rebindAttributes function allows us to reassign survey attributes so that EdSurvey package functions are accessible. Simply call the manipulated data frame and the edsurvey.data.frame containing the requisite attributes

Now we can apply EdSurvey functions, for example,

9.6 Data Analysis

9.6.1 Regression

Regression is a well-known and frequently used tool that EdSurvey provides in the lm.sdf function. Regression equations are typically written as

$$\displaystyle \begin{aligned} y_{i} = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_{i} \end{aligned} $$

(9.1)

where y _i is the outcome for individual i, α is an intercept, x _ki is the level of the kth explanatory (exogenous) variable, β _k is the kth regression coefficient, and 𝜖 _i is the regression residual for individual i.

As an example, the outcome is the literacy score (lit), which is described as a function of income quintile (d_q18a_t) and age (age_r). See results in Table 9.3.

Table 9.3 Results from summary( lm1)

Full size table

In R, the formula for this regression equation is written as y ~x1 + x2. Note that there is no need to generate dummy codes for discrete variables like d_q18a_t.

The typical outcome contains a header similar to edsurveyTable, which is not shown for brevity. To explore the unprinted attributes, print summary( lm1) in the console.

EdSurvey calculates the regression coefficients by running one weighted regression per plausible value:

$$\displaystyle \begin{aligned} \hat{\beta_k} = \frac{1}{P} \sum_{p=1}^{P} \beta_{k}^{(p)} {} \end{aligned} $$

(9.2)

where there are P plausible values, each indexed with a p, and the superscript (p) indicates the pth plausible value was used.

Variance estimation is complicated because of the presence of the plausible values and because many countries used a multistage, geography-based, sampling technique to form the PIAAC sample. Because of the geographic proximity between respondents, there is a correlation between respondents’ scores within a sampled group, relative to two randomly selected individuals. The variance estimator EdSurvey uses accounts for both of these using the variance estimator

$$\displaystyle \begin{aligned} V = V_{I} + V_{S} \end{aligned} $$

(9.3)

where V is the total variance of an estimator, V _I is the imputation variance—accounting for the plausible values—and V _S is the sampling variance, accounting for the covariance between geographically clustered individuals. V _I is estimated according to Rubin’s rule (Rubin 1987)

$$\displaystyle \begin{aligned} V_I=\frac{M}{M+1} \sum_{p=1}^P \left(\beta_k^{(p)} - \beta_k \right) \end{aligned} $$

(9.4)

where β _k is averaged across the plausible values (Eq. 9.2). Then the sampling variance frequently uses the jackknife variance estimator and can be estimated with each plausible value as

$$\displaystyle \begin{aligned} V_S^{(p)} = \sum_{j=1}^J \left(\beta_{kj}^{(p)} - \beta_k \right) \end{aligned} $$

(9.5)

where $\beta _{kj}^{(p)}$ is the estimate of the regressor estimated with the jth replicate weights, with the pth plausible value. In EdSurvey, the jrrIMax argument sets the number of plausible values used; any number is valid, but lower numbers are faster.

$$\displaystyle \begin{aligned} V_S = \frac{1}{ \text{jrrIMax}} \sum_{p=1}^{\text{jrrIMax}} V_S^{(p)} \end{aligned} $$

(9.6)

As a convenience, EdSurvey sets values larger than the number of plausible values equal to the number of plausible values, so using jrrIMax=Inf uses all plausible values.

The EdSurvey package also can use a Taylor series variance estimator—available by adding the argument varMethod=’Taylor’ (Binder 1983). More details regarding variance estimation can be found in the EdSurvey Statistics vignette.

Although most of the model details are returned in the regression output, a few additional elements are available to inform interpretation of the results. First, there is a head block that describes the weight used (spfwt0), the variance method (jackknife), the number of jackknife replicates (80), the full data n-size (4,621), and the n-size for this regression (2,271). The latter n-size includes the extent of listwise deletion.

The coefficients block has many typically displayed statistics, including the degrees of freedom (dof) by coefficient. This is calculated using the Welch-Satterthwaite equation (Satterthwaite 1946). For the kth coefficient, the notation of (Wikipedia Contributors 2019), k _i = 1 and s _i = β _kj − β _k, indicates the difference between the estimated value for the jth jackknife replicate weight and the value estimated with the full sample weights (β _k). Because this statistic varies by coefficient, so do the degrees of freedom. EdSurvey applies the Rust and Johnson modification to the Welch-Satterthwaite equation that multiplies the Welch-Satterthwaite degrees of freedom by a factor of $3.16 - \frac {2.77}{J^{1/2}}$, where J is the number of jackknife replicates (Rust and Johnson 1992).

9.6.2 Binomial Regression

When a regression’s dependent variable (outcome) is binary—consisting of 1s and 0s or true and false—the regression is a binomial regression. EdSurvey allows for two such regressions: logistic regression and probit regression. The corresponding functions for these methods are logit.sdf and probit.sdf. This section focuses on logit.sdf, but most components also apply to probit.sdf.

An example of a binomial regression is to look at the outcome of income percentile being in the mid-quintile or higher as described by mother’s education (j_q06b) and own age (age_r). The user may first wish to inspect j_q06b (results in Table 9.4).^{Footnote 5}

Table 9.4 Results from summary2( ita,’j_q06b’)

Full size table

When a regression is run, EdSurvey will exclude the values other than ‘ISCED 1, 2, AND 3C SHORT’, ‘ISCED 3 ( EXCLUDING 3C SHORT) AND 4’, and ‘ISCED 5 AND 6’; the first of these levels will be the omitted group and treated as the reference.

For binomial regression, we recommend explicitly dichotomising the dependent variable in the logit.sdf call so that the desired level has the ‘high state’ associated with positive regressors—this is done with the I( ⋅ ) function. Here, the function makes the dependent variable a 1 when the condition is TRUE and a 0 when the condition is FALSE; the results are shown in Table 9.5.

Table 9.5 Results from summary( logit1)

Full size table

This regression shows that there is a larger contrast between individuals with mother’s highest education in ‘ISCED 3 ( EXCLUDING 3C SHORT) AND 4’ and the reference group (‘ISCED 1, 2, AND 3C SHORT’) at 0.62 than there is between ‘ISCED 5 and 6’) and the reference group at 0.07, with the former coefficient being statistically significant and the latter not. Some researchers appreciate the odds ratios when interpreting regression results. The oddsRatio function can show these, along with their confidence intervals. The results are shown in Table 9.6.

Table 9.6 Results from oddsRatio( logit1)

Full size table

The oddsRatio function works only for results from the logit.sdf function—not probit.sdf results—because only logistic regression has invariant odds ratios.

Although the t-test statistic in logistic regression output is a good test for an individual regressor (such as age_r), a Wald test is needed to conduct joint hypothesis testing. Typically, it is possible to use the Akaike information criterion (AIC) (Akaike 1974) or a likelihood-ratio test. However, the likelihood shown in the results is actually a pseudo-likelihood, or a population estimate likelihood for the model. Because the entire population was not sampled, deviance-based tests—such as those shown in McCullagh and Nelder (1989)—cannot be used. Although it would be possible to use Lumley and Scott (2015) to form an AIC comparison, that does not account for plausible values.^{Footnote 6}

For example, it would be reasonable to ask if the j_j06b variable is jointly significant. To test this, we can use a Wald test

This is a test of both coefficients in j_q06b being zero. Two test results are shown: the chi-square test and the F-test. In the case of a well-known sample design, it probably makes more sense to use the F-test (Korn and Graubard 1990).

9.6.3 Gap Analysis

A gap analysis compares the levels of two groups and tests if they are different. The gap function supports testing gaps in mean scores, survey responses, score percentiles, and achievement levels. In this section, we discuss gaps in mean scores.

The simplest gap is within a single survey on a score and requires a selection of two groups. In the following example, we compare literacy scores of the self-employed and those who are employees

The gap output contains three blocks: labels, percentage, and results.

In the first block, ‘labels’, the definition of the groups A and B is shown, along with a reminder of the full data n count (nFullData) and the n count of the number of individuals who are in the two subgroups with valid scores (nUsed).

The second block, ‘percentage’, shows the percentage of individuals who fall into each category, with omitted levels removed. In the preceding example, the estimated percentage of Italians who are self-employed (in Group A) is shown in the pctA column, and the percentage of employees (in Group B) is shown in the pctB column. In this case, the only nonomitted levels are ‘SELF-EMPLOYED’ and ‘EMPLOYEE’, so they add up to 100%. The other columns listed in the ‘percentage’ block regard uncertainty in those percentages and tests determining whether the two percentages are equal.

The third block, ‘results’, shows the estimated average literacy score for Italians who are self-employed (Group A) in column estimateA and the estimated average literacy score of Italians who are employees in column estimateB. The diffAB column shows that the estimated difference between these two statistics is 3.04 literacy scale score points, whereas the diffABse column shows that the estimate has a standard error of 2.59 scale score points. A t-test for the difference being zero has a p-value of 0.24 is shown in column difABpValue.

Some software does not calculate a covariance between groups when the groups consist of distinct individuals. When survey collection was administered in such a way that respondents have more in common than randomly selected individuals—as in the Italian PIAAC sample—this is not consistent with the survey design. When there is no covariance between two units in the same variance estimation strata—as in the case of countries that use one-stage sampling—there is little harm in estimating the covariance, because it will be close to zero.

The gap output information listed is not exhaustive; similar to other EdSurvey functions, the user can see the list of output variables using the ? function and typing the function of interest.

The ‘Value’ section describes all columns contained in gap outputs.

Another type of gap compares results across samples. For example, the male/female gap in literacy scores can be compared between Italy and the Netherlands by forming an edsurvey.data.frame.list and running gap with that combined data.

This output contains the same three blocks and columns as in the previous gap analysis. Several additional columns have been added, focusing on the contrasts between Italy and the Netherlands. The results block columns labelled with an AA, such as diffAA, compare Italian males to Dutch males. The columns labelled with a BB, such as diffBB, compare Italian females to Dutch females. Here the diffAA column has a value of − 36.7, indicating that Italian males have an average scale score 36.7 points less than Dutch males. The column diffAAse has a value of 1.83, indicating that the standard error of that difference is 1.83. The two samples were collected separately, so there is no covariance in these estimates, and the covAA column is zero.

It also is possible to compare the male/female gap in literacy scores within and across countries. Looking at the diffAB column, the gap is − 0.25 in Italy and 6.13 in the Netherlands, indicating that females outscore males in Italy, but males outscore females in the Netherlands. The diffABAB column shows that the difference in the gaps is − 6.39, with a standard error (taken from diffABABse) of 2.32, and an associated p-value of 0.007, taken from diffABABpValue.

9.6.4 Percentile Analysis

Discussions presented so far have focused on the mean and other measures of centrality. This section describes the percentile function, which calculates statistics regarding the distribution of continuous variables—namely, the percentiles of a numeric variable in the range 0 to 100 for a survey dataset. For example, to compare the PIAAC index of reading skills at home (‘lit’) at the 10th, 25th, 50th, 75th, and 90th percentile, include these as integers in the percentiles argument; the results are shown in Table 9.7.

Table 9.7 Results from percentile( variable = ’lit’, percentiles = c( 10, 25, 50, 75, 90) , data = ita)

Full size table

If researchers are interested in a comparison of percentile distributions between males and females, the subset function can be used together with the percentile function. Alternatively, EdSurvey’s gap function, covered in Sect. 9.6.3, can calculate distributions in percentiles. The results of the percentile by gender are shown in Table 9.8.

Table 9.8 Results from percentile by gender_r

Full size table

9.6.5 Proficiency Level Analysis

Scale score averages and distributions have the advantage of being numeric expressions of respondent ability; they also have the disadvantage of being essentially impossible to interpret or compare to an external benchmark. Proficiency levels, developed by experts to compare scores with performance criteria, provide an external benchmark against which scale scores can be compared (PIAAC Numeracy Expert Group 2009).

In EdSurvey, users can see the proficiency level cutpoints with the showCutPoints function:

The achievementLevels function applies appropriate weights and the variance estimation method for each edsurvey.data.frame, with several arguments for customising the aggregation and output of the analysis results.^{Footnote 7} Namely, by using these optional arguments, users can

choose to generate the percentage of individuals performing at each proficiency level (discrete) or at or above each proficiency level (cumulative),
calculate the percentage distribution of individuals by proficiency level (discrete or cumulative) and selected characteristics (specified in aggregateBy), and
compute the percentage distribution of individuals by selected characteristics within a specific proficiency level.

The achievementLevels function also can produce statistics by both discrete and cumulative proficiency levels. By default, the achievementLevels function produces the results only for discrete proficiency levels. Setting the returnCumulative argument to TRUE generates results by both discrete and cumulative proficiency levels.

The achievementLevels function can calculate the overall cumulative proficiency level analysis of the literacy. These results are shown in Table 9.9, where the term ‘Performance Level’ has been replaced by ‘PL’ for brevity.

Table 9.9 Results from achievementLevels( c( ’lit’, ’gender_r’) , data=ita, aggregateBy = ’gender_r’, returnDiscrete = FALSE, returnCumulative = TRUE)

Full size table

This call requests that the Italian literacy proficiency levels can be broken down by the gender_r variable—the aggregateBy argument is set to ‘gender_r’ and therefore the Percent column sums to 100 within each gender. The results show that 31% of Italian males are at or above Proficiency Level 3, whereas 28.8% of Italian females are at or above Proficiency Level 3. Note that proficiency levels are useful only if considered in the context of the descriptor, which is available from NCES at https://nces.ed.gov/surveys/piaac/litproficiencylevel.asp.

The advantage of cumulative proficiency levels is that increases are always unambiguously good. Conversely, discrete proficiency levels can change because individuals moved between levels, making their interpretation ambiguous, although increases in the highest and lowest proficiency levels are always unambiguously good (highest) or bad (lowest).

9.7 Expansion

The EdSurvey package continues to be developed, and new features are added in each subsequent release. To learn about current features, visit the EdSurvey webpage to see the latest version and most recent documentation.^{Footnote 8} The webpage also has many user guides and a complete explanation of the methodology involved in EdSurvey.

Notes

1.
EdSurvey 2.4 also can work with public and/or restricted use datasets from ECLS:K, ICCS, ICILS, NAEP, PIRLS, ePIRLS, PISA, TALIS, TIMSS, and TIMSS advanced; more datasets are added with each release.
2.
EdSurvey uses a variety of other packages; for a complete list, see https://CRAN.R-project.org/package=EdSurvey.
3.
https://www.oecd.org/skills/piaac/data/
4.
For more details on the correlations and their computation, see vignette( ’wCorrFormulas’,package=’wCorr’) .
5.
In the tables the level ‘ISCED 3 (EXCLUDING 3C SHORT) AND 4’ is sometimes shortened to ‘ISCED 3 (EXCL 3C SHORT) AND 4’.
6.
The use of plausible values is allowed by logit.sdf and probit.sdf. An example of an outcome with plausible values would be a comparison of literature scores above the user-specified cutoff.
7.
The terms proficiency levels, benchmarks, or achievement levels are all operationalised in the same way: individuals above a cutpoint are regarded as having met that level of proficiency or benchmark or have that achievement. EdSurvey calls all these achievement levels in the function names, cutpoints, and documentation. But the difference is entirely semantic and so can be ignored.
8.
https://www.air.org/project/nces-data-r-project-edsurvey

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Article Google Scholar
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292, doi: 10.2307/1402588.
Korn, E. L., & Graubard, B. I. (1990). Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics. The American Statistician, 44(4):270–276, doi: 10.1080/00031305.1990.10475737.
Lumley, T., & Scott, A. (2015). AIC and BIC for modeling with complex survey data. Journal of Survey Statistics and Methodology, 3(1), 1–18, doi: 10.1093/jssam/smu021.
McCullagh, P., & Nelder, J. (1989). Generalized linear models (Chapman and Hall/CRC Monographs on statistics and applied probability series, 2nd ed.) Boca Raton: Chapman & Hall.
Google Scholar
Mohadjer, L., Krenzke, T., Van de Kerckhove, W., & Li, L. (2016). Sampling design. In I. Kirsch, & W. Thorn (Eds.), Survey of adult skills technical report (2nd ed., chapter 14, pp. 14-1–14-36). Paris: OECD.
Google Scholar
PIAAC Numeracy Expert Group. (2009). PIAAC numeracy: A conceptual framework. Technical Report 35. Paris: OECD Publishing.
Google Scholar
R Core Team. (2019). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Google Scholar
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Hoboken: Wiley.
Book Google Scholar
Rust, K., & Johnson, E. (1992). Sampling and weighting in the national assessment. Journal of Educational and Behavioral Statistics, 17(2), 111–129, doi: 10.3102/10769986017002111.
Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6), 110–114.
Article Google Scholar
Wikipedia Contributors. (2019). Welch-Satterthwaite equation—Wikipedia, the free encyclopedia. [Online; Accessed 24 Feb 2019]
Google Scholar

Download references

Author information

Authors and Affiliations

American Institutes for Research, Washington, DC, USA
Paul Bailey, Michael Lee & Ting Zhang
Tamr Inc., Cambridge, MA, USA
Trang Nguyen

Authors

Paul Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Michael Lee
View author publications
You can also search for this author in PubMed Google Scholar
Trang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ting Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Bailey .

Editor information

Editors and Affiliations

Survey Design and Methodology, GESIS – Leibniz Institute for the Social Sciences, Mannheim, Germany
Débora B. Maehler
Survey Design and Methodology, GESIS – Leibniz Institute for the Social Sciences, Mannheim, Germany
Beatrice Rammstedt

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bailey, P., Lee, M., Nguyen, T., Zhang, T. (2020). Using EdSurvey to Analyse PIAAC Data. In: Maehler, D., Rammstedt, B. (eds) Large-Scale Cognitive Assessment . Methodology of Educational Measurement and Assessment. Springer, Cham. https://doi.org/10.1007/978-3-030-47515-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-47515-4_9
Published: 28 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47514-7
Online ISBN: 978-3-030-47515-4
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Using EdSurvey to Analyse PIAAC Data

Abstract

Similar content being viewed by others

Statistical Packages for Data Analysis

Data Entry

Multivariate Data Analysis: Its Approach, Evolution, and Impact

Keywords

9.1 Introduction

9.2 Getting Started

9.3 Survey Design Attributes

9.4 Exploring PIAAC Data

9.5 Accessing and Manipulating PIAAC Data

9.5.1 Cleaning Data in EdSurvey

9.5.2 Using getData

9.6 Data Analysis

9.6.1 Regression

9.6.2 Binomial Regression

9.6.3 Gap Analysis

9.6.4 Percentile Analysis

9.6.5 Proficiency Level Analysis

9.7 Expansion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Using EdSurvey to Analyse PIAAC Data

Abstract

Similar content being viewed by others

Statistical Packages for Data Analysis

Data Entry

Multivariate Data Analysis: Its Approach, Evolution, and Impact

Keywords

9.1 Introduction

9.2 Getting Started

9.3 Survey Design Attributes

9.4 Exploring PIAAC Data

9.5 Accessing and Manipulating PIAAC Data

9.5.1 Cleaning Data in EdSurvey

9.5.2 Using getData

9.6 Data Analysis

9.6.1 Regression

9.6.2 Binomial Regression

9.6.3 Gap Analysis

9.6.4 Percentile Analysis

9.6.5 Proficiency Level Analysis

9.7 Expansion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation