I was at NIAS working on machine learning applications.
Mathijs van Dijk is a Professor in Finance. He wants to understand the functions of financial systems for society.
I was at NIAS working on machine learning applications.
Mathijs van Dijk is a Professor in Finance. He wants to understand the functions of financial systems for society.
Institutional economists (e.g., Nobel Prize winner Douglass North) stress the importance of institutions for economic development.
Similar questions addressed in anthropology (e.g., Service, 1975)
Similar questions in sociology (e.g., functionalism of Parsons (1951), Lenksi's Ecological-Evolutionary Theory (1974/2006) - also see Sanderson (1990)).
Institutional economists (e.g., Nobel Prize winner Douglass North) stress the importance of institutions for economic development.
Similar questions addressed in anthropology (e.g., Service, 1975)
Similar questions in sociology (e.g., functionalism of Parsons (1951), Lenksi's Ecological-Evolutionary Theory (1974/2006) - also see Sanderson (1990)).
What are the necessary preconditions for societal complexity / state formation?
Can we predict 'societal complexity' (or a proxy thereof)? How well does 'AI' do?
Relative importance of 'ecology' vs. 'culture'. (Which is potentially quite a misnomer... .).
Can we predict 'societal complexity' (or a proxy thereof)? How well does 'AI' do?
Relative importance of 'ecology' vs. 'culture'. (Which is potentially quite a misnomer... .).
Macro-economists emphasize the importance of institutions. What is the evidence for this?
Technical variables (e.g., ID variable for the SCCS) were removed from the dataset.
Some variables which were derived from other variables.
Technical variables (e.g., ID variable for the SCCS) were removed from the dataset.
Some variables which were derived from other variables.
Some variables which proved to be problematic in a preliminary screening (for example due to a high percentage of missing cases or nominal variables with nearly unique codes).
Technical variables (e.g., ID variable for the SCCS) were removed from the dataset.
Some variables which were derived from other variables.
Some variables which proved to be problematic in a preliminary screening (for example due to a high percentage of missing cases or nominal variables with nearly unique codes).
With and without 7 variables (country_Codes, Area_Region, SubContinent_Region, Continent, Language_Continent, Old_New_Class, Region)
Technical variables (e.g., ID variable for the SCCS) were removed from the dataset.
Some variables which were derived from other variables.
Some variables which proved to be problematic in a preliminary screening (for example due to a high percentage of missing cases or nominal variables with nearly unique codes).
With and without 7 variables (country_Codes, Area_Region, SubContinent_Region, Continent, Language_Continent, Old_New_Class, Region)
Two variables in our preliminary analyses (‘Mean_Size_of_Local_Communities’ and ‘Jurisdictional_Hierarchy_Beyond_Local_Community’) --> overlap to a large degree with our dependent variable.
‘settlement complexity’ (Settlement_Patterns).
Treated as an ordinal variable with 8 categories in the key analyses. Continuous for some figures.
‘settlement complexity’ (Settlement_Patterns).
Treated as an ordinal variable with 8 categories in the key analyses. Continuous for some figures.
“Fully migratory or nomadic bands / Separated hamlets where several such form a more or less permanent single community. / Neighborhoods of dispersed family homesteads / Seminomadic communities whose members wander in bands for at least half of the year but occupy a fixed settlement at some season or seasons, e.g., recurrently occupied winter quarters / Semisedentary communities whose members shift from one to another fixed settlement at different seasons or who occupy more or less permanently a single settlement from which a substantial proportion of the population departs seasonally to occupy shifting camps, e.g., during transhumance. + Compact and relatively permanent settlements, i.e., nucleated villages or towns. / Compact but impermanent settlements, i.e., villages whose location is shifted every few years. / Complex settlements consisting of a nucleated village or town with outlying homesteads or satellite hamlets.”
Broad categories: Culture / Ecology / none (e.g., Language class)
Narrower categories: Culture: Culture / Culture: Ecology / Ecology: Climate / Ecology: Geography / Ecology: Both
More narrow coding of cultural variables. (none= not a cultural variable).
"Conditional inference trees estimate a regression relationship by binary recursive partitioning in a conditional inference framework. Roughly, the algorithm works as follows:
1) Test the global null hypothesis of independence between any of the input variables and the response (which may be multivariate as well). Stop if this hypothesis cannot be rejected. Otherwise select the input variable with strongest association to the response. This association is measured by a p-value corresponding to a test for the partial null hypothesis of a single input variable and the response.
2) Implement a binary split in the selected input variable.
3) Recursively repeate steps 1) and 2)." Hothorn et al., 2018: 8
Some key benefits:
Some key benefits:
Some key benefits:
automatically detects interactions and non-linearities
no overfitting (no pruning)
Some key benefits:
automatically detects interactions and non-linearities
no overfitting (no pruning)
BUT: trees are 'truly random' --> Forests.
Build a large number of these trees! (N=500).
A commonly used tool for small N large P problems. (137 predictors! , (combination of) 30 considered at each split ('mtry'))
Build a large number of these trees! (N=500).
A commonly used tool for small N large P problems. (137 predictors! , (combination of) 30 considered at each split ('mtry'))
Analyses suggest that random forest models are on a par or regularly outperform other machine learning methods (e.g., Karuna et al. 2008). Lay explanation
Build a large number of these trees! (N=500).
A commonly used tool for small N large P problems. (137 predictors! , (combination of) 30 considered at each split ('mtry'))
Analyses suggest that random forest models are on a par or regularly outperform other machine learning methods (e.g., Karuna et al. 2008). Lay explanation
We used the 'party' package in R.
"More precisely, it measures the difference between the OOB error rate after and before permuting the values of the predictor of interest. ... The idea underlying this VIM is the following: If the predictor is not associated with the response, the permutation of its values has no influence on the classification, and thus also no influence on the error rate. The error rate of the forest is not substantially affected by the permutation and the VI of the predictor takes a value close to zero, indicating no association between the predictor and the response. In contrast, if response and predictor are associated, the permutation of the predictor values destroys this association. “Knocking out” this predictor by permuting its values results in a worse classification leading to an increased error rate. The difference in error rates before and after randomly permuting the predictor thus takes a positive value reflecting the high importance of this predictor" Janitza et al. 2013, BMC Bioinformatics: p. 3
"More precisely, it measures the difference between the OOB error rate after and before permuting the values of the predictor of interest. ... The idea underlying this VIM is the following: If the predictor is not associated with the response, the permutation of its values has no influence on the classification, and thus also no influence on the error rate. The error rate of the forest is not substantially affected by the permutation and the VI of the predictor takes a value close to zero, indicating no association between the predictor and the response. In contrast, if response and predictor are associated, the permutation of the predictor values destroys this association. “Knocking out” this predictor by permuting its values results in a worse classification leading to an increased error rate. The difference in error rates before and after randomly permuting the predictor thus takes a positive value reflecting the high importance of this predictor" Janitza et al. 2013, BMC Bioinformatics: p. 3
More information in the party manual and here and here
\(\rho\)
The Spearman \(\rho\)
correlations between the observed and predicted values were .815 for run 1 and .812 for run 2 of the full model.
For the reduced models, these correlations were .810 for run 1 and .808 for run 2, respectively.
Is that good or bad?
\(\rho\)
= .938; Reduced: Pearson r= .996, Spearman \(\rho\)
= .927)Different starting seeds yield similar results. (Correspondence between two runs for variable importance was very high (Full: Pearson r= .997, Spearman \(\rho\)
= .938; Reduced: Pearson r= .996, Spearman \(\rho\)
= .927)
Varied 'mtry' (e.g., 46) and 'ntrees' (e.g., N=1,000) qualitatively same results
Forests are not really interpretable. Remains a black box.
A sample tree --> note exploratory. (Alternative: partial dependence plots).
https://www.explainxkcd.com/wiki/index.php/1838:_Machine_Learning
Subsistence economy/agriculture best predictors. Unsurprising conclusion perhaps?
Property rights and 'institutions' (or proxies thereof) have relatively little predictive influence.
Subsistence economy/agriculture best predictors. Unsurprising conclusion perhaps?
Property rights and 'institutions' (or proxies thereof) have relatively little predictive influence.
'Ecology' is on a par with 'culture' (Provided we can meaningfully disentangle).
Subsistence economy/agriculture best predictors. Unsurprising conclusion perhaps?
Property rights and 'institutions' (or proxies thereof) have relatively little predictive influence.
'Ecology' is on a par with 'culture' (Provided we can meaningfully disentangle).
Random Forest is useful tool in our toolbox?
Are agriculture/subsistence economy just the 'same thing' as settlement complexity?
'Garbage in, garbage out'.
Problems with modelling in these types of datasets (see Towner et al., 2016)
Machine learning challenge? How do other methods do (support vector machines)?
'Qualitative Comparative Analysis'. (necessary and sufficient conditions for causal inference)
Machine learning challenge? How do other methods do (support vector machines)?
'Qualitative Comparative Analysis'. (necessary and sufficient conditions for causal inference)
Phylogenetic work (Mathijs van Dijk was working with Andrew Meade)
The out-of-bag (oob) error estimate: "In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests."
This approach does not overfit.
"This statistical approach ensures that the right sized tree is grown and no form of pruning or cross-validation or whatsoever is needed." (from here)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |