Notes
Slide Show
Outline
1
Supervised Categorization of JavaScriptTM using Program Analysis Features
  •            Wei Lu            and      Min-Yen Kan
  •        luwei@nus.edu.sg                             kanmy@comp.nus.edu.sg
      Computer Science Program                   School of Computing
  •      Singapore-MIT Alliance         National University of Singapore


2
Why to categorize JavaScript?
  • Most crawlers and indexers ignore crucial information conveyed by external programs like JavaScript™
  • Can we have them summarized automatically?
3
Categorization Scheme
4
Outline
  • Introduction
  • Categorization Scheme
  • Feature Extraction Techniques
    • Related Work
    • Our Work
      • Lexical Analysis
      • Syntax Analysis
      • Metrics Analysis
      • Object Communication Analysis
      • Contextual Analysis
  • Conclusion


5
Related Work
  • Source code categorization: Ugurel et.al.’s work published in 2002
    • Seminal work on the subject
    • Consists of two tasks
      • Programming Language classification: find out the type of programming language – keywords, bi-grams… non-interesting!
      • Topic classification: find out the topic related to the code - relies heavily on external resources (e.g. README file, code header), did not analyze source code itself much
  • Can we extract features from source code itself?
6
Text Categorization Baseline
  • Vector Space Model (VSM) is used in Text Categorization
  • Bag of words approach



  • Our baseline using Text Categorization Bag-of-Word approach: accuracy 87.47%
    • A high baseline, but still has a gap of 12.5% to improve
    • Also important to measure error reduction rate

7
Lexical Analysis
  • How to tokenize JavaScript more reasonably?
  • First attempt of improvement over baseline - we believe using a compiler-based approach is more reasonable
    • Similar to POS tags used in Natural Language Processing , we introduce a tagset for JavaScript tokenization/tagging
    • The tagging process comes with a token normalization
  • Tags used:
  • KEY:keyword; VAR:variable; SYM:symbol; NUM:number; STR:string; CMT:comment; REG:regexp


8
Evaluation on Lexical Analysis
9
Syntax Analysis
10
Evaluation on Syntax Analysis
11
Code Metrics Analysis
  • Metrics: measurement of program source code
    • Complexity: statistical measurements, e.g. CC, IFIN
      •  We proposed our own metrics in addition to published metrics
12
Evaluation on Metrics Analysis
13
Outline
  • Introduction
  • Categorization Scheme
  • Feature Extraction Techniques
    • Related Work
    • Our Work
      • Lexical Analysis
      • Syntax Analysis
      • Metrics Analysis
      • Object Communication Analysis
      • Contextual Analysis
  • Conclusion


14
Object Communication Analysis
  • var msg = "Welcome to this page";
  • banner(0);
  • function banner (index){
  •   var newWin = window.open();
  •   frm.txt.value="ok";
  •   window.status = msg.substring(0, index);
  •   index = index++;
  •   if (index >= msg.length) index = 0;
  •   window.setTimeout("banner("+index+" ) " , 100);
  • }
15
Contextual Analysis
  • Extracting information from the context of the enclosing web page


16
Evaluation on Object Communication and Contextual Analysis
17
Outline
  • Introduction
  • Categorization Scheme
  • Feature Extraction Techniques
    • Related Work
    • Our Work
    • Overall Evaluation
  • Conclusion


18
Evaluation on All Components
19
Outline
  • Categorization Scheme
  • Feature Extraction Techniques
  • Conclusion
20
Contributions
  • Shown that program analysis can enhance source code categorization performance
    • Both context-free and context-sensitive analysis
  • Case study of JavaScript categorization
    • New, functionality-based categorization
    • Tool for feature extraction from JavaScript
21
Conclusions
  • Limitations:
    • Annotator Agreement
    • Dynamic Analysis Incompleteness
    • Choice of Classifier


  • Future Work
    • Source code classification of other languages
    • Firefox extension / IE plug-in


  • Dataset and system prototype available at:
  • http://wing.comp.nus.edu.sg/~luwei/SMART
22
Question?
  • Dataset and system prototype available at:
    • http://wing.comp.nus.edu.sg/~luwei/SMART
  • First author’s undergraduate honours year project thesis:
    • http://wing.comp.nus.edu.sg/publications/theses/weiLuThesis.pdf
  • Contacts:
    • Wei LU: luwei@nus.edu.sg
    • Min-Yen KAN: kanmy@comp.nus.edu.sg
23
Guidelines for talk
  • 20 minutes
  • 5 minutes for questions