Skip to content
Snippets Groups Projects
Commit 76c076f8 authored by Dmitriy Zaporozhets's avatar Dmitriy Zaporozhets
Browse files

Merge tag 'v2.9.5' of https://github.com/github/linguist

Linguist 2.9.5
parents c3d6fc5a a00967dd
No related branches found
No related tags found
No related merge requests found
Showing
with 42824 additions and 1137 deletions
source :rubygems
source 'https://rubygems.org'
gemspec
Copyright (c) 2011 GitHub, Inc.
Copyright (c) 2011-2013 GitHub, Inc.
 
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
Loading
Loading
Loading
Loading
@@ -8,44 +8,38 @@ We use this library at GitHub to detect blob languages, highlight code, ignore b
 
Linguist defines the list of all languages known to GitHub in a [yaml file](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml). In order for a file to be highlighted, a language and lexer must be defined there.
 
Most languages are detected by their file extension. This is the fastest and most common situation. For script files, which are usually extensionless, we do "deep content inspection"™ and check the shebang of the file. Checking the file's contents may also be used for disambiguating languages. C, C++ and Obj-C all use `.h` files. Looking for common keywords, we are usually able to guess the correct language.
Most languages are detected by their file extension. This is the fastest and most common situation.
For disambiguating between files with common extensions, we use a [Bayesian classifier](https://github.com/github/linguist/blob/master/lib/linguist/classifier.rb). For an example, this helps us tell the difference between `.h` files which could be either C, C++, or Obj-C.
 
In the actual GitHub app we deal with `Grit::Blob` objects. For testing, there is a simple `FileBlob` API.
 
Linguist::FileBlob.new("lib/linguist.rb").language.name #=> "Ruby"
```ruby
Linguist::FileBlob.new("lib/linguist.rb").language.name #=> "Ruby"
 
Linguist::FileBlob.new("bin/linguist").language.name #=> "Ruby"
Linguist::FileBlob.new("bin/linguist").language.name #=> "Ruby"
```
 
See [lib/linguist/language.rb](https://github.com/github/linguist/blob/master/lib/linguist/language.rb) and [lib/linguist/languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml).
 
### Syntax Highlighting
 
The actual syntax highlighting is handled by our Pygments wrapper, [Albino](https://github.com/github/albino). Linguist provides a [Lexer abstraction](https://github.com/github/linguist/blob/master/lib/linguist/lexer.rb) that determines which highlighter should be used on a file.
We typically run on a prerelease version of Pygments to get early access to new lexers. The [lexers.yml](https://github.com/github/linguist/blob/master/lib/linguist/lexers.yml) file is a dump of the lexers we have available on our server. If there is a new lexer in pygments-main not on the list, [open an issue](https://github.com/github/linguist/issues) and we'll try to upgrade it soon.
### MIME type detection
Most of the MIME types handling is done by the Ruby [mime-types gem](https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data). But we have our own list of additions and overrides. To add or modify this list, see [lib/linguist/mimes.yml](https://github.com/github/linguist/blob/master/lib/linguist/mimes.yml).
MIME types are used to set the Content-Type of raw binary blobs which are served from a special `raw.github.com` domain. However, all text blobs are served as `text/plain` regardless of their type to ensure they open in the browser rather than downloading.
The MIME type also determines whether a blob is binary or plain text. So if you're seeing a blob that says "View Raw" and it is actually plain text, the mime type and encoding probably needs to be explicitly stated.
The actual syntax highlighting is handled by our Pygments wrapper, [pygments.rb](https://github.com/tmm1/pygments.rb). It also provides a [Lexer abstraction](https://github.com/tmm1/pygments.rb/blob/master/lib/pygments/lexer.rb) that determines which highlighter should be used on a file.
 
Linguist::FileBlob.new("linguist.zip").binary? #=> true
See [lib/linguist/mimes.yml](https://github.com/github/linguist/blob/master/lib/linguist/mimes.yml).
We typically run on a pre-release version of Pygments, [pygments.rb](https://github.com/tmm1/pygments.rb), to get early access to new lexers. The [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) file is a dump of the lexers we have available on our server.
 
### Stats
 
The [Language Graph](https://github.com/github/linguist/graphs/languages) is built by aggregating the languages of all repo's blobs. The top language in the graph determines the project's primary language. Collectively, these stats make up the [Top Languages](https://github.com/languages) page.
The Language Graph you see on every repository is built by aggregating the languages of all repo's blobs. The top language in the graph determines the project's primary language. Collectively, these stats make up the [Top Languages](https://github.com/languages) page.
 
The repository stats API can be used on a directory:
 
project = Linguist::Repository.from_directory(".")
project.language.name #=> "Ruby"
project.languages #=> { "Ruby" => 0.98,
"Shell" => 0.02 }
```ruby
project = Linguist::Repository.from_directory(".")
project.language.name #=> "Ruby"
project.languages #=> { "Ruby" => 0.98, "Shell" => 0.02 }
```
 
These stats are also printed out by the binary. Try running `linguist` on itself:
 
Loading
Loading
@@ -56,21 +50,27 @@ These stats are also printed out by the binary. Try running `linguist` on itself
 
Checking other code into your git repo is a common practice. But this often inflates your project's language stats and may even cause your project to be labeled as another language. We are able to identify some of these files and directories and exclude them.
 
Linguist::FileBlob.new("vendor/plugins/foo.rb").vendored? # => true
```ruby
Linguist::FileBlob.new("vendor/plugins/foo.rb").vendored? # => true
```
 
See [Linguist::BlobHelper#vendored?](https://github.com/github/linguist/blob/master/lib/linguist/blob_helper.rb) and [lib/linguist/vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml).
 
#### Generated file detection
 
Not all plain text files are true source files. Generated files like minified js and compiled CoffeeScript can be detected and excluded from language stats. As an extra bonus, these files are suppressed in Diffs.
Not all plain text files are true source files. Generated files like minified js and compiled CoffeeScript can be detected and excluded from language stats. As an extra bonus, these files are suppressed in diffs.
 
Linguist::FileBlob.new("underscore.min.js").generated? # => true
```ruby
Linguist::FileBlob.new("underscore.min.js").generated? # => true
```
 
See [Linguist::BlobHelper#generated?](https://github.com/github/linguist/blob/master/lib/linguist/blob_helper.rb).
See [Linguist::Generated#generated?](https://github.com/github/linguist/blob/master/lib/linguist/generated.rb).
 
## Installation
 
To get it, clone the repo and run [Bundler](http://gembundler.com/) to install its dependencies.
github.com is usually running the latest version of the `github-linguist` gem that is released on [RubyGems.org](http://rubygems.org/gems/github-linguist).
But for development you are going to want to checkout out the source. To get it, clone the repo and run [Bundler](http://gembundler.com/) to install its dependencies.
 
git clone https://github.com/github/linguist.git
cd linguist/
Loading
Loading
@@ -80,17 +80,16 @@ To run the tests:
 
bundle exec rake test
 
*Since this code is specific to GitHub, is not published as a official rubygem.*
## Contributing
 
If you are seeing errors like `StandardError: could not find any magic files!`, it means the CharlockHolmes gem didn’t install correctly. See the [installing section](https://github.com/brianmario/charlock_holmes/blob/master/README.md) of the CharlockHolmes README for more information.
The majority of patches won't need to touch any Ruby code at all. The [master language list](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) is just a configuration file.
 
## Contributing
We try to only add languages once they have some usage on GitHub, so please note in-the-wild usage examples in your pull request.
Almost all bug fixes or new language additions should come with some additional code samples. Just drop them under [`samples/`](https://github.com/github/linguist/tree/master/samples) in the correct subdirectory and our test suite will automatically test them. In most cases you shouldn't need to add any new assertions.
### Testing
Sometimes getting the tests running can be too much work, especially if you don't have much Ruby experience. It's okay, be lazy and let our build bot [Travis](http://travis-ci.org/#!/github/linguist) run the tests for you. Just open a pull request and the bot will start cranking away.
 
1. Fork it.
2. Create a branch (`git checkout -b detect-foo-language`)
3. Make your changes
4. Run the tests (`bundle install` then `bundle exec rake`)
5. Commit your changes (`git commit -am "Added detection for the new Foo language"`)
6. Push to the branch (`git push origin detect-foo-language`)
7. Create a [Pull Request](http://help.github.com/pull-requests/) from your branch.
8. Promote it. Get others to drop in and +1 it.
Here's our current build status, which is hopefully green: [![Build Status](https://secure.travis-ci.org/github/linguist.png?branch=master)](http://travis-ci.org/github/linguist)
require 'rake/clean'
require 'rake/testtask'
 
task :default => :test
 
Rake::TestTask.new do |t|
t.warning = true
Rake::TestTask.new
task :samples do
require 'linguist/samples'
require 'yajl'
data = Linguist::Samples.data
json = Yajl::Encoder.encode(data, :pretty => true)
File.open('lib/linguist/samples.json', 'w') { |io| io.write json }
end
namespace :classifier do
LIMIT = 1_000
desc "Run classifier against #{LIMIT} public gists"
task :test do
require 'linguist/classifier'
require 'linguist/samples'
total, correct, incorrect = 0, 0, 0
$stdout.sync = true
each_public_gist do |gist_url, file_url, file_language|
next if file_language.nil? || file_language == 'Text'
begin
data = open(file_url).read
guessed_language, score = Linguist::Classifier.classify(Linguist::Samples::DATA, data).first
total += 1
guessed_language == file_language ? correct += 1 : incorrect += 1
print "\r\e[0K%d:%d %g%%" % [correct, incorrect, (correct.to_f/total.to_f)*100]
$stdout.flush
rescue URI::InvalidURIError
else
break if total >= LIMIT
end
end
puts ""
end
def each_public_gist
require 'open-uri'
require 'json'
url = "https://api.github.com/gists/public"
loop do
resp = open(url)
url = resp.meta['link'][/<([^>]+)>; rel="next"/, 1]
gists = JSON.parse(resp.read)
for gist in gists
for filename, attrs in gist['files']
yield gist['url'], attrs['raw_url'], attrs['language']
end
end
end
end
end
#!/usr/bin/env ruby
 
# linguist — detect language type for a file, or, given a directory, determine language breakdown
#
# usage: linguist <path>
require 'linguist/file_blob'
require 'linguist/repository'
 
Loading
Loading
@@ -23,12 +27,11 @@ elsif File.file?(path)
 
puts "#{blob.name}: #{blob.loc} lines (#{blob.sloc} sloc)"
puts " type: #{type}"
puts " extension: #{blob.pathname.extname}"
puts " mime type: #{blob.mime_type}"
puts " language: #{blob.language}"
 
if blob.large?
puts " blob is to large to be shown"
puts " blob is too large to be shown"
end
 
if blob.generated?
Loading
Loading
Gem::Specification.new do |s|
s.name = 'linguist'
s.version = '1.0.0'
s.name = 'github-linguist'
s.version = '2.9.5'
s.summary = "GitHub Language detection"
 
s.authors = "GitHub"
s.authors = "GitHub"
s.homepage = "https://github.com/github/linguist"
 
s.files = Dir['lib/**/*']
s.executables << 'linguist'
 
s.add_dependency 'charlock_holmes', '~> 0.6.6'
s.add_dependency 'escape_utils', '~> 0.2.3'
s.add_dependency 'mime-types', '~> 1.18'
s.add_dependency 'pygments.rb', '~> 0.2.11'
s.add_dependency 'escape_utils', '~> 0.3.1'
s.add_dependency 'mime-types', '~> 1.19'
s.add_dependency 'pygments.rb', '~> 0.5.2'
s.add_development_dependency 'mocha'
s.add_development_dependency 'json'
s.add_development_dependency 'rake'
s.add_development_dependency 'yajl-ruby'
end
require 'linguist/blob_helper'
require 'linguist/generated'
require 'linguist/language'
require 'linguist/mime'
require 'linguist/pathname'
require 'linguist/repository'
require 'linguist/samples'
require 'linguist/generated'
require 'linguist/language'
require 'linguist/mime'
require 'linguist/pathname'
 
require 'charlock_holmes'
require 'escape_utils'
require 'mime/types'
require 'pygments'
require 'yaml'
 
module Linguist
# DEPRECATED Avoid mixing into Blob classes. Prefer functional interfaces
# like `Language.detect` over `Blob#language`. Functions are much easier to
# cache and compose.
#
# Avoid adding additional bloat to this module.
#
# BlobHelper is a mixin for Blobish classes that respond to "name",
# "data" and "size" such as Grit::Blob.
module BlobHelper
# Internal: Get a Pathname wrapper for Blob#name
#
# Returns a Pathname.
def pathname
Pathname.new(name || "")
end
# Public: Get the extname of the path
#
# Examples
Loading
Loading
@@ -27,7 +26,23 @@ module Linguist
#
# Returns a String
def extname
pathname.extname
File.extname(name.to_s)
end
# Internal: Lookup mime type for extension.
#
# Returns a MIME::Type
def _mime_type
if defined? @_mime_type
@_mime_type
else
guesses = ::MIME::Types.type_for(extname.to_s)
# Prefer text mime types over binary
@_mime_type = guesses.detect { |type| type.ascii? } ||
# Otherwise use the first guess
guesses.first
end
end
 
# Public: Get the actual blob mime type
Loading
Loading
@@ -39,7 +54,23 @@ module Linguist
#
# Returns a mime type String.
def mime_type
@mime_type ||= pathname.mime_type
_mime_type ? _mime_type.to_s : 'text/plain'
end
# Internal: Is the blob binary according to its mime type
#
# Return true or false
def binary_mime_type?
_mime_type ? _mime_type.binary? : false
end
# Internal: Is the blob binary according to its mime type,
# overriding it if we have better data from the languages.yml
# database.
#
# Return true or false
def likely_binary?
binary_mime_type? && !Language.find_by_filename(name)
end
 
# Public: Get the Content-Type header value
Loading
Loading
@@ -71,7 +102,7 @@ module Linguist
elsif name.nil?
"attachment"
else
"attachment; filename=#{EscapeUtils.escape_url(pathname.basename)}"
"attachment; filename=#{EscapeUtils.escape_url(File.basename(name))}"
end
end
 
Loading
Loading
@@ -90,15 +121,6 @@ module Linguist
@detect_encoding ||= CharlockHolmes::EncodingDetector.new.detect(data) if data
end
 
# Public: Is the blob binary according to its mime type
#
# Return true or false
def binary_mime_type?
if mime_type = Mime.lookup_mime_type_for(pathname.extname)
mime_type.binary?
end
end
# Public: Is the blob binary?
#
# Return true or false
Loading
Loading
@@ -132,23 +154,28 @@ module Linguist
#
# Return true or false
def image?
['.png', '.jpg', '.jpeg', '.gif'].include?(extname)
['.png', '.jpg', '.jpeg', '.gif'].include?(extname.downcase)
end
# Public: Is the blob a supported 3D model format?
#
# Return true or false
def solid?
extname.downcase == '.stl'
end
 
# Public: Is the blob a possible drupal php file?
# Public: Is this blob a CSV file?
#
# Return true or false
def drupal_extname?
['.module', '.install', '.test', '.inc'].include?(extname)
def csv?
text? && extname.downcase == '.csv'
end
 
# Public: Is the blob likely to have a shebang?
# Public: Is the blob a PDF?
#
# Return true or false
def shebang_extname?
extname.empty? &&
mode &&
(mode.to_i(8) & 05) == 05
def pdf?
extname.downcase == '.pdf'
end
 
MEGABYTE = 1024 * 1024
Loading
Loading
@@ -160,6 +187,28 @@ module Linguist
size.to_i > MEGABYTE
end
 
# Public: Is the blob safe to colorize?
#
# We use Pygments for syntax highlighting blobs. Pygments
# can be too slow for very large blobs or for certain
# corner-case blobs.
#
# Return true or false
def safe_to_colorize?
!large? && text? && !high_ratio_of_long_lines?
end
# Internal: Does the blob have a ratio of long lines?
#
# These types of files are usually going to make Pygments.rb
# angry if we try to colorize them.
#
# Return true or false
def high_ratio_of_long_lines?
return false if loc == 0
size / loc > 5000
end
# Public: Is the blob viewable?
#
# Non-viewable blobs will just show a "View Raw" link
Loading
Loading
@@ -190,7 +239,12 @@ module Linguist
#
# Returns an Array of lines
def lines
@lines ||= (viewable? && data) ? data.split("\n", -1) : []
@lines ||=
if viewable? && data
data.split(/\r\n|\r|\n/, -1)
else
[]
end
end
 
# Public: Get number of lines of code
Loading
Loading
@@ -211,153 +265,16 @@ module Linguist
lines.grep(/\S/).size
end
 
# Internal: Compute average line length.
#
# Returns Integer.
def average_line_length
if lines.any?
lines.inject(0) { |n, l| n += l.length } / lines.length
else
0
end
end
# Public: Is the blob a generated file?
#
# Generated source code is supressed in diffs and is ignored by
# Generated source code is suppressed in diffs and is ignored by
# language statistics.
#
# Requires Blob#data
#
# Includes:
# - XCode project XML files
# - Minified JavaScript
#
# Please add additional test coverage to
# `test/test_blob.rb#test_generated` if you make any changes.
# May load Blob#data
#
# Return true or false
def generated?
if xcode_project_file? || generated_net_docfile?
true
elsif generated_coffeescript? || minified_javascript?
true
elsif name == 'Gemfile.lock'
true
else
false
end
end
# Internal: Is the blob an XCode project file?
#
# Generated if the file extension is an XCode project
# file extension.
#
# Returns true of false.
def xcode_project_file?
['.xib', '.nib', '.pbxproj', '.xcworkspacedata', '.xcuserstate'].include?(extname)
end
# Internal: Is the blob minified JS?
#
# Consider JS minified if the average line length is
# greater then 100c.
#
# Returns true or false.
def minified_javascript?
return unless extname == '.js'
average_line_length > 100
end
# Internal: Is the blob JS generated by CoffeeScript?
#
# Requires Blob#data
#
# CoffeScript is meant to output JS that would be difficult to
# tell if it was generated or not. Look for a number of patterns
# outputed by the CS compiler.
#
# Return true or false
def generated_coffeescript?
return unless extname == '.js'
# CoffeeScript generated by > 1.2 include a comment on the first line
if lines[0] =~ /^\/\/ Generated by /
return true
end
if lines[0] == '(function() {' && # First line is module closure opening
lines[-2] == '}).call(this);' && # Second to last line closes module closure
lines[-1] == '' # Last line is blank
score = 0
lines.each do |line|
if line =~ /var /
# Underscored temp vars are likely to be Coffee
score += 1 * line.gsub(/(_fn|_i|_len|_ref|_results)/).count
# bind and extend functions are very Coffee specific
score += 3 * line.gsub(/(__bind|__extends|__hasProp|__indexOf|__slice)/).count
end
end
# Require a score of 3. This is fairly arbitrary. Consider
# tweaking later.
score >= 3
else
false
end
end
# Internal: Is this a generated documentation file for a .NET assembly?
#
# Requires Blob#data
#
# .NET developers often check in the XML Intellisense file along with an
# assembly - however, these don't have a special extension, so we have to
# dig into the contents to determine if it's a docfile. Luckily, these files
# are extremely structured, so recognizing them is easy.
#
# Returns true or false
def generated_net_docfile?
return false unless extname.downcase == ".xml"
return false unless lines.count > 3
# .NET Docfiles always open with <doc> and their first tag is an
# <assembly> tag
return lines[1].include?("<doc>") &&
lines[2].include?("<assembly>") &&
lines[-2].include?("</doc>")
end
# Public: Should the blob be indexed for searching?
#
# Excluded:
# - Files over 0.1MB
# - Non-text files
# - Langauges marked as not searchable
# - Generated source files
#
# Please add additional test coverage to
# `test/test_blob.rb#test_indexable` if you make any changes.
#
# Return true or false
def indexable?
if binary?
false
elsif language.nil?
false
elsif !language.searchable?
false
elsif generated?
false
elsif size > 100 * 1024
false
else
true
end
@_generated ||= Generated.generated?(name, lambda { data })
end
 
# Public: Detects the Language of the blob.
Loading
Loading
@@ -366,33 +283,15 @@ module Linguist
#
# Returns a Language or nil if none is detected
def language
if defined? @language
@language
return @language if defined? @language
if defined?(@data) && @data.is_a?(String)
data = @data
else
@language = guess_language
data = lambda { (binary_mime_type? || binary?) ? "" : self.data }
end
end
# Internal: Guess language
#
# Please add additional test coverage to
# `test/test_blob.rb#test_language` if you make any changes.
#
# Returns a Language or nil
def guess_language
return if binary_mime_type?
# Disambiguate between multiple language extensions
disambiguate_extension_language ||
# See if there is a Language for the extension
pathname.language ||
# Look for idioms in first line
first_line_language ||
 
# Try to detect Language from shebang line
shebang_language
@language = Language.detect(name.to_s, data, mode)
end
 
# Internal: Get the lexer of the blob.
Loading
Loading
@@ -402,269 +301,16 @@ module Linguist
language ? language.lexer : Pygments::Lexer.find_by_name('Text only')
end
 
# Internal: Disambiguates between multiple language extensions.
#
# Delegates to "guess_EXTENSION_language".
#
# Please add additional test coverage to
# `test/test_blob.rb#test_language` if you add another method.
#
# Returns a Language or nil.
def disambiguate_extension_language
if Language.ambiguous?(extname)
name = "guess_#{extname.sub(/^\./, '')}_language"
send(name) if respond_to?(name)
end
end
# Internal: Guess language of .cls files
#
# Returns a Language.
def guess_cls_language
if lines.grep(/^(%|\\)/).any?
Language['TeX']
elsif lines.grep(/^\s*(CLASS|METHOD|INTERFACE).*:\s*/i).any? || lines.grep(/^\s*(USING|DEFINE)/i).any?
Language['OpenEdge ABL']
elsif lines.grep(/\{$/).any? || lines.grep(/\}$/).any?
Language['Apex']
elsif lines.grep(/^(\'\*|Attribute|Option|Sub|Private|Protected|Public|Friend)/i).any?
Language['Visual Basic']
else
# The most common language should be the fallback
Language['TeX']
end
end
# Internal: Guess language of header files (.h).
#
# Returns a Language.
def guess_h_language
if lines.grep(/^@(interface|property|private|public|end)/).any?
Language['Objective-C']
elsif lines.grep(/^class |^\s+(public|protected|private):/).any?
Language['C++']
else
Language['C']
end
end
# Internal: Guess language of .m files.
#
# Objective-C heuristics:
# * Keywords
#
# Matlab heuristics:
# * Leading function keyword
# * "%" comments
#
# Returns a Language.
def guess_m_language
# Objective-C keywords
if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
Language['Objective-C']
# File function
elsif lines.first.to_s =~ /^function /
Language['Matlab']
# Matlab comment
elsif lines.grep(/^%/).any?
Language['Matlab']
# Fallback to Objective-C, don't want any Matlab false positives
else
Language['Objective-C']
end
end
# Internal: Guess language of .pl files
#
# The rules for disambiguation are:
#
# 1. Many perl files begin with a shebang
# 2. Most Prolog source files have a rule somewhere (marked by the :- operator)
# 3. Default to Perl, because it is more popular
#
# Returns a Language.
def guess_pl_language
if shebang_script == 'perl'
Language['Perl']
elsif lines.grep(/:-/).any?
Language['Prolog']
else
Language['Perl']
end
end
# Internal: Guess language of .r files.
#
# Returns a Language.
def guess_r_language
if lines.grep(/(rebol|(:\s+func|make\s+object!|^\s*context)\s*\[)/i).any?
Language['Rebol']
else
Language['R']
end
end
# Internal: Guess language of .t files.
#
# Returns a Language.
def guess_t_language
score = 0
score += 1 if lines.grep(/^% /).any?
score += data.gsub(/ := /).count
score += data.gsub(/proc |procedure |fcn |function /).count
score += data.gsub(/var \w+: \w+/).count
# Tell-tale signs its gotta be Perl
if lines.grep(/^(my )?(sub |\$|@|%)\w+/).any?
score = 0
end
if score >= 3
Language['Turing']
else
Language['Perl']
end
end
# Internal: Guess language of .v files.
#
# Returns a Language
def guess_v_language
if lines.grep(/^(\/\*|\/\/|module|parameter|input|output|wire|reg|always|initial|begin|\`)/).any?
Language['Verilog']
else
Language['Coq']
end
end
# Internal: Guess language of .gsp files.
#
# Returns a Language.
def guess_gsp_language
if lines.grep(/<%|<%@|\$\{|<%|<g:|<meta name="layout"|<r:/).any?
Language['Groovy Server Pages']
else
Language['Gosu']
end
end
# Internal: Guess language from the first line.
#
# Look for leading "<?php" in Drupal files
#
# Returns a Language.
def first_line_language
# Only check files with drupal php extensions
return unless drupal_extname?
# Fail fast if blob isn't viewable?
return unless viewable?
if lines.first.to_s =~ /^<\?php/
Language['PHP']
end
end
# Internal: Extract the script name from the shebang line
#
# Requires Blob#data
#
# Examples
#
# '#!/usr/bin/ruby'
# # => 'ruby'
#
# '#!/usr/bin/env ruby'
# # => 'ruby'
#
# '#!/usr/bash/python2.4'
# # => 'python'
#
# Please add additional test coverage to
# `test/test_blob.rb#test_shebang_script` if you make any changes.
#
# Returns a script name String or nil
def shebang_script
# Fail fast if blob isn't viewable?
return unless viewable?
if lines.any? && (match = lines[0].match(/(.+)\n?/)) && (bang = match[0]) =~ /^#!/
bang.sub!(/^#! /, '#!')
tokens = bang.split(' ')
pieces = tokens.first.split('/')
if pieces.size > 1
script = pieces.last
else
script = pieces.first.sub('#!', '')
end
script = script == 'env' ? tokens[1] : script
# python2.4 => python
if script =~ /((?:\d+\.?)+)/
script.sub! $1, ''
end
# Check for multiline shebang hacks that exec themselves
#
# #!/bin/sh
# exec foo "$0" "$@"
#
if script == 'sh' &&
lines[0...5].any? { |l| l.match(/exec (\w+).+\$0.+\$@/) }
script = $1
end
script
end
end
# Internal: Get Language for shebang script
#
# Returns the Language or nil
def shebang_language
# Skip file extensions unlikely to have shebangs
return unless shebang_extname?
if script = shebang_script
Language[script]
end
end
# Public: Highlight syntax of blob
#
# options - A Hash of options (defaults to {})
#
# Returns html String
def colorize(options = {})
return if !text? || large? || generated?
return unless safe_to_colorize?
options[:options] ||= {}
options[:options][:encoding] ||= encoding
lexer.highlight(data, options)
end
# Public: Highlight syntax of blob without the outer highlight div
# wrapper.
#
# options - A Hash of options (defaults to {})
#
# Returns html String
def colorize_without_wrapper(options = {})
if text = colorize(options)
text[%r{<div class="highlight"><pre>(.*?)</pre>\s*</div>}m, 1]
else
''
end
end
Language.overridden_extensions.each do |extension|
name = "guess_#{extension.sub(/^\./, '')}_language".to_sym
unless instance_methods.map(&:to_sym).include?(name)
warn "Language##{name} was not defined"
end
end
end
end
require 'linguist/tokenizer'
module Linguist
# Language bayesian classifier.
class Classifier
# Public: Train classifier that data is a certain language.
#
# db - Hash classifier database object
# language - String language of data
# data - String contents of file
#
# Examples
#
# Classifier.train(db, 'Ruby', "def hello; end")
#
# Returns nothing.
#
# Set LINGUIST_DEBUG=1 or =2 to see probabilities per-token,
# per-language. See also dump_all_tokens, below.
def self.train!(db, language, data)
tokens = Tokenizer.tokenize(data)
db['tokens_total'] ||= 0
db['languages_total'] ||= 0
db['tokens'] ||= {}
db['language_tokens'] ||= {}
db['languages'] ||= {}
tokens.each do |token|
db['tokens'][language] ||= {}
db['tokens'][language][token] ||= 0
db['tokens'][language][token] += 1
db['language_tokens'][language] ||= 0
db['language_tokens'][language] += 1
db['tokens_total'] += 1
end
db['languages'][language] ||= 0
db['languages'][language] += 1
db['languages_total'] += 1
nil
end
# Public: Guess language of data.
#
# db - Hash of classifier tokens database.
# data - Array of tokens or String data to analyze.
# languages - Array of language name Strings to restrict to.
#
# Examples
#
# Classifier.classify(db, "def hello; end")
# # => [ 'Ruby', 0.90], ['Python', 0.2], ... ]
#
# Returns sorted Array of result pairs. Each pair contains the
# String language name and a Float score.
def self.classify(db, tokens, languages = nil)
languages ||= db['languages'].keys
new(db).classify(tokens, languages)
end
# Internal: Initialize a Classifier.
def initialize(db = {})
@tokens_total = db['tokens_total']
@languages_total = db['languages_total']
@tokens = db['tokens']
@language_tokens = db['language_tokens']
@languages = db['languages']
end
# Internal: Guess language of data
#
# data - Array of tokens or String data to analyze.
# languages - Array of language name Strings to restrict to.
#
# Returns sorted Array of result pairs. Each pair contains the
# String language name and a Float score.
def classify(tokens, languages)
return [] if tokens.nil?
tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)
scores = {}
if verbosity >= 2
dump_all_tokens(tokens, languages)
end
languages.each do |language|
scores[language] = tokens_probability(tokens, language) +
language_probability(language)
if verbosity >= 1
printf "%10s = %10.3f + %7.3f = %10.3f\n",
language, tokens_probability(tokens, language), language_probability(language), scores[language]
end
end
scores.sort { |a, b| b[1] <=> a[1] }.map { |score| [score[0], score[1]] }
end
# Internal: Probably of set of tokens in a language occurring - P(D | C)
#
# tokens - Array of String tokens.
# language - Language to check.
#
# Returns Float between 0.0 and 1.0.
def tokens_probability(tokens, language)
tokens.inject(0.0) do |sum, token|
sum += Math.log(token_probability(token, language))
end
end
# Internal: Probably of token in language occurring - P(F | C)
#
# token - String token.
# language - Language to check.
#
# Returns Float between 0.0 and 1.0.
def token_probability(token, language)
if @tokens[language][token].to_f == 0.0
1 / @tokens_total.to_f
else
@tokens[language][token].to_f / @language_tokens[language].to_f
end
end
# Internal: Probably of a language occurring - P(C)
#
# language - Language to check.
#
# Returns Float between 0.0 and 1.0.
def language_probability(language)
Math.log(@languages[language].to_f / @languages_total.to_f)
end
private
def verbosity
@verbosity ||= (ENV['LINGUIST_DEBUG'] || 0).to_i
end
# Internal: show a table of probabilities for each <token,language> pair.
#
# The number in each table entry is the number of "points" that each
# token contributes toward the belief that the file under test is a
# particular language. Points are additive.
#
# Points are the number of times a token appears in the file, times
# how much more likely (log of probability ratio) that token is to
# appear in one language vs. the least-likely language. Dashes
# indicate the least-likely language (and zero points) for each token.
def dump_all_tokens(tokens, languages)
maxlen = tokens.map { |tok| tok.size }.max
printf "%#{maxlen}s", ""
puts " #" + languages.map { |lang| sprintf("%10s", lang) }.join
tokmap = Hash.new(0)
tokens.each { |tok| tokmap[tok] += 1 }
tokmap.sort.each { |tok, count|
arr = languages.map { |lang| [lang, token_probability(tok, lang)] }
min = arr.map { |a,b| b }.min
minlog = Math.log(min)
if !arr.inject(true) { |result, n| result && n[1] == arr[0][1] }
printf "%#{maxlen}s%5d", tok, count
puts arr.map { |ent|
ent[1] == min ? " -" : sprintf("%10.3f", count * (Math.log(ent[1]) - minlog))
}.join
end
}
end
end
end
module Linguist
class Generated
# Public: Is the blob a generated file?
#
# name - String filename
# data - String blob data. A block also maybe passed in for lazy
# loading. This behavior is deprecated and you should always
# pass in a String.
#
# Return true or false
def self.generated?(name, data)
new(name, data).generated?
end
# Internal: Initialize Generated instance
#
# name - String filename
# data - String blob data
def initialize(name, data)
@name = name
@extname = File.extname(name)
@_data = data
end
attr_reader :name, :extname
# Lazy load blob data if block was passed in.
#
# Awful, awful stuff happening here.
#
# Returns String data.
def data
@data ||= @_data.respond_to?(:call) ? @_data.call() : @_data
end
# Public: Get each line of data
#
# Returns an Array of lines
def lines
# TODO: data should be required to be a String, no nils
@lines ||= data ? data.split("\n", -1) : []
end
# Internal: Is the blob a generated file?
#
# Generated source code is suppressed in diffs and is ignored by
# language statistics.
#
# Please add additional test coverage to
# `test/test_blob.rb#test_generated` if you make any changes.
#
# Return true or false
def generated?
name == 'Gemfile.lock' ||
minified_files? ||
compiled_coffeescript? ||
xcode_project_file? ||
generated_parser? ||
generated_net_docfile? ||
generated_net_designer_file? ||
generated_protocol_buffer?
end
# Internal: Is the blob an XCode project file?
#
# Generated if the file extension is an XCode project
# file extension.
#
# Returns true of false.
def xcode_project_file?
['.xib', '.nib', '.storyboard', '.pbxproj', '.xcworkspacedata', '.xcuserstate'].include?(extname)
end
# Internal: Is the blob minified files?
#
# Consider a file minified if it contains more than 5% spaces.
# Currently, only JS and CSS files are detected by this method.
#
# Returns true or false.
def minified_files?
return unless ['.js', '.css'].include? extname
if data && data.length > 200
(data.each_char.count{ |c| c <= ' ' } / data.length.to_f) < 0.05
else
false
end
end
# Internal: Is the blob of JS generated by CoffeeScript?
#
# CoffeeScript is meant to output JS that would be difficult to
# tell if it was generated or not. Look for a number of patterns
# output by the CS compiler.
#
# Return true or false
def compiled_coffeescript?
return false unless extname == '.js'
# CoffeeScript generated by > 1.2 include a comment on the first line
if lines[0] =~ /^\/\/ Generated by /
return true
end
if lines[0] == '(function() {' && # First line is module closure opening
lines[-2] == '}).call(this);' && # Second to last line closes module closure
lines[-1] == '' # Last line is blank
score = 0
lines.each do |line|
if line =~ /var /
# Underscored temp vars are likely to be Coffee
score += 1 * line.gsub(/(_fn|_i|_len|_ref|_results)/).count
# bind and extend functions are very Coffee specific
score += 3 * line.gsub(/(__bind|__extends|__hasProp|__indexOf|__slice)/).count
end
end
# Require a score of 3. This is fairly arbitrary. Consider
# tweaking later.
score >= 3
else
false
end
end
# Internal: Is this a generated documentation file for a .NET assembly?
#
# .NET developers often check in the XML Intellisense file along with an
# assembly - however, these don't have a special extension, so we have to
# dig into the contents to determine if it's a docfile. Luckily, these files
# are extremely structured, so recognizing them is easy.
#
# Returns true or false
def generated_net_docfile?
return false unless extname.downcase == ".xml"
return false unless lines.count > 3
# .NET Docfiles always open with <doc> and their first tag is an
# <assembly> tag
return lines[1].include?("<doc>") &&
lines[2].include?("<assembly>") &&
lines[-2].include?("</doc>")
end
# Internal: Is this a codegen file for a .NET project?
#
# Visual Studio often uses code generation to generate partial classes, and
# these files can be quite unwieldy. Let's hide them.
#
# Returns true or false
def generated_net_designer_file?
name.downcase =~ /\.designer\.cs$/
end
# Internal: Is the blob of JS a parser generated by PEG.js?
#
# PEG.js-generated parsers are not meant to be consumed by humans.
#
# Return true or false
def generated_parser?
return false unless extname == '.js'
# PEG.js-generated parsers include a comment near the top of the file
# that marks them as such.
if lines[0..4].join('') =~ /^(?:[^\/]|\/[^\*])*\/\*(?:[^\*]|\*[^\/])*Generated by PEG.js/
return true
end
false
end
# Internal: Is the blob a C++, Java or Python source file generated by the
# Protocol Buffer compiler?
#
# Returns true of false.
def generated_protocol_buffer?
return false unless ['.py', '.java', '.h', '.cc', '.cpp'].include?(extname)
return false unless lines.count > 1
return lines[0].include?("Generated by the protocol buffer compiler. DO NOT EDIT!")
end
end
end
Loading
Loading
@@ -2,6 +2,9 @@ require 'escape_utils'
require 'pygments'
require 'yaml'
 
require 'linguist/classifier'
require 'linguist/samples'
module Linguist
# Language names that are recognizable by GitHub. Defined languages
# can be highlighted, searched and listed under the Top Languages page.
Loading
Loading
@@ -9,28 +12,22 @@ module Linguist
# Languages are defined in `lib/linguist/languages.yml`.
class Language
@languages = []
@overrides = {}
@index = {}
@name_index = {}
@alias_index = {}
@extension_index = {}
@filename_index = {}
@extension_index = Hash.new { |h,k| h[k] = [] }
@filename_index = Hash.new { |h,k| h[k] = [] }
@primary_extension_index = {}
 
# Valid Languages types
TYPES = [:data, :markup, :programming]
 
# Internal: Test if extension maps to multiple Languages.
# Names of non-programming languages that we will still detect
#
# Returns true or false.
def self.ambiguous?(extension)
@overrides.include?(extension)
end
# Include?: Return overridden extensions.
#
# Returns extensions Array.
def self.overridden_extensions
@overrides.keys
# Returns an array
def self.detectable_markup
["AsciiDoc", "CSS", "Creole", "Less", "Markdown", "MediaWiki", "Org", "RDoc", "Sass", "Textile", "reStructuredText"]
end
 
# Internal: Create a new Language object
Loading
Loading
@@ -43,18 +40,18 @@ module Linguist
 
@languages << language
 
# All Language names should be unique. Warn if there is a duplicate.
# All Language names should be unique. Raise if there is a duplicate.
if @name_index.key?(language.name)
warn "Duplicate language name: #{language.name}"
raise ArgumentError, "Duplicate language name: #{language.name}"
end
 
# Language name index
@index[language.name] = @name_index[language.name] = language
 
language.aliases.each do |name|
# All Language aliases should be unique. Warn if there is a duplicate.
# All Language aliases should be unique. Raise if there is a duplicate.
if @alias_index.key?(name)
warn "Duplicate alias: #{name}"
raise ArgumentError, "Duplicate alias: #{name}"
end
 
@index[name] = @alias_index[name] = language
Loading
Loading
@@ -62,33 +59,56 @@ module Linguist
 
language.extensions.each do |extension|
if extension !~ /^\./
warn "Extension is missing a '.': #{extension.inspect}"
raise ArgumentError, "Extension is missing a '.': #{extension.inspect}"
end
 
unless ambiguous?(extension)
# Index the extension with a leading ".": ".rb"
@extension_index[extension] = language
# Index the extension without a leading ".": "rb"
@extension_index[extension.sub(/^\./, '')] = language
end
@extension_index[extension] << language
end
 
language.overrides.each do |extension|
if extension !~ /^\./
warn "Extension is missing a '.': #{extension.inspect}"
end
@overrides[extension] = language
if @primary_extension_index.key?(language.primary_extension)
raise ArgumentError, "Duplicate primary extension: #{language.primary_extension}"
end
 
@primary_extension_index[language.primary_extension] = language
language.filenames.each do |filename|
@filename_index[filename] = language
@filename_index[filename] << language
end
 
language
end
 
# Public: Detects the Language of the blob.
#
# name - String filename
# data - String blob data. A block also maybe passed in for lazy
# loading. This behavior is deprecated and you should always
# pass in a String.
# mode - Optional String mode (defaults to nil)
#
# Returns Language or nil.
def self.detect(name, data, mode = nil)
# A bit of an elegant hack. If the file is executable but extensionless,
# append a "magic" extension so it can be classified with other
# languages that have shebang scripts.
if File.extname(name).empty? && mode && (mode.to_i(8) & 05) == 05
name += ".script!"
end
possible_languages = find_by_filename(name)
if possible_languages.length > 1
data = data.call() if data.respond_to?(:call)
if data.nil? || data == ""
nil
elsif result = Classifier.classify(Samples::DATA, data, possible_languages.map(&:name)).first
Language[result[0]]
end
else
possible_languages.first
end
end
# Public: Get all Languages
#
# Returns an Array of Languages
Loading
Loading
@@ -124,33 +144,22 @@ module Linguist
@alias_index[name]
end
 
# Public: Look up Language by extension.
#
# extension - The extension String. May include leading "."
#
# Examples
#
# Language.find_by_extension('.rb')
# # => #<Language name="Ruby">
#
# Returns the Language or nil if none was found.
def self.find_by_extension(extension)
@extension_index[extension]
end
# Public: Look up Language by filename.
# Public: Look up Languages by filename.
#
# filename - The path String.
#
# Examples
#
# Language.find_by_filename('foo.rb')
# # => #<Language name="Ruby">
# # => [#<Language name="Ruby">]
#
# Returns the Language or nil if none was found.
# Returns all matching Languages or [] if none were found.
def self.find_by_filename(filename)
basename, extname = File.basename(filename), File.extname(filename)
@filename_index[basename] || @extension_index[extname]
langs = [@primary_extension_index[extname]] +
@filename_index[basename] +
@extension_index[extname]
langs.compact.uniq
end
 
# Public: Look up Language by its name or lexer.
Loading
Loading
@@ -231,16 +240,18 @@ module Linguist
raise(ArgumentError, "#{@name} is missing lexer")
 
@ace_mode = attributes[:ace_mode]
@wrap = attributes[:wrap] || false
 
# Set legacy search term
@search_term = attributes[:search_term] || default_alias_name
 
# Set extensions or default to [].
@extensions = attributes[:extensions] || []
@overrides = attributes[:overrides] || []
@filenames = attributes[:filenames] || []
 
@primary_extension = attributes[:primary_extension] || default_primary_extension || extensions.first
unless @primary_extension = attributes[:primary_extension]
raise ArgumentError, "#{@name} is missing primary extension"
end
 
# Prepend primary extension unless its already included
if primary_extension && !extensions.include?(primary_extension)
Loading
Loading
@@ -320,6 +331,11 @@ module Linguist
# Returns a String name or nil
attr_reader :ace_mode
 
# Public: Should language lines be wrapped
#
# Returns true or false
attr_reader :wrap
# Public: Get extensions
#
# Examples
Loading
Loading
@@ -331,7 +347,7 @@ module Linguist
 
# Deprecated: Get primary extension
#
# Defaults to the first extension but can be overriden
# Defaults to the first extension but can be overridden
# in the languages.yml.
#
# The primary extension can not be nil. Tests should verify this.
Loading
Loading
@@ -343,11 +359,6 @@ module Linguist
# Returns the extension String.
attr_reader :primary_extension
 
# Internal: Get overridden extensions.
#
# Returns the extensions Array.
attr_reader :overrides
# Public: Get filenames
#
# Examples
Loading
Loading
@@ -377,13 +388,6 @@ module Linguist
name.downcase.gsub(/\s/, '-')
end
 
# Internal: Get default primary extension.
#
# Returns the extension String.
def default_primary_extension
extensions.first
end
# Public: Get Language group
#
# Returns a Language
Loading
Loading
@@ -441,11 +445,36 @@ module Linguist
def hash
name.hash
end
def inspect
"#<#{self.class} name=#{name}>"
end
end
 
extensions = Samples::DATA['extnames']
filenames = Samples::DATA['filenames']
popular = YAML.load_file(File.expand_path("../popular.yml", __FILE__))
 
YAML.load_file(File.expand_path("../languages.yml", __FILE__)).each do |name, options|
options['extensions'] ||= []
options['filenames'] ||= []
if extnames = extensions[name]
extnames.each do |extname|
if !options['extensions'].include?(extname)
options['extensions'] << extname
end
end
end
if fns = filenames[name]
fns.each do |filename|
if !options['filenames'].include?(filename)
options['filenames'] << filename
end
end
end
Language.create(
:name => name,
:color => options['color'],
Loading
Loading
@@ -453,12 +482,12 @@ module Linguist
:aliases => options['aliases'],
:lexer => options['lexer'],
:ace_mode => options['ace_mode'],
:wrap => options['wrap'],
:group_name => options['group'],
:searchable => options.key?('searchable') ? options['searchable'] : true,
:search_term => options['search_term'],
:extensions => options['extensions'],
:extensions => options['extensions'].sort,
:primary_extension => options['primary_extension'],
:overrides => options['overrides'],
:filenames => options['filenames'],
:popular => popular.include?(name)
)
Loading
Loading
This diff is collapsed.
require 'digest/md5'
module Linguist
module MD5
# Public: Create deep nested digest of value object.
#
# Useful for object comparison.
#
# obj - Object to digest.
#
# Returns String hex digest
def self.hexdigest(obj)
digest = Digest::MD5.new
case obj
when String, Symbol, Integer
digest.update "#{obj.class}"
digest.update "#{obj}"
when TrueClass, FalseClass, NilClass
digest.update "#{obj.class}"
when Array
digest.update "#{obj.class}"
for e in obj
digest.update(hexdigest(e))
end
when Hash
digest.update "#{obj.class}"
for e in obj.map { |(k, v)| hexdigest([k, v]) }.sort
digest.update(e)
end
else
raise TypeError, "can't convert #{obj.inspect} into String"
end
digest.hexdigest
end
end
end
require 'mime/types'
require 'yaml'
class MIME::Type
attr_accessor :override
end
# Register additional mime type extensions
#
# Follows same format as mime-types data file
# https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data
File.read(File.expand_path("../mimes.yml", __FILE__)).lines.each do |line|
# Regexp was cargo culted from mime-types lib
next unless line =~ %r{^
#{MIME::Type::MEDIA_TYPE_RE}
(?:\s@([^\s]+))?
(?:\s:(#{MIME::Type::ENCODING_RE}))?
}x
mediatype = $1
subtype = $2
extensions = $3
encoding = $4
# Lookup existing mime type
mime_type = MIME::Types["#{mediatype}/#{subtype}"].first ||
# Or create a new instance
MIME::Type.new("#{mediatype}/#{subtype}")
if extensions
extensions.split(/,/).each do |extension|
mime_type.extensions << extension
end
end
if encoding
mime_type.encoding = encoding
end
mime_type.override = true
# Kind of hacky, but we need to reindex the mime type after making changes
MIME::Types.add_type_variant(mime_type)
MIME::Types.index_extensions(mime_type)
end
module Linguist
module Mime
# Internal: Look up mime type for extension.
#
# ext - The extension String. May include leading "."
#
# Examples
#
# Mime.mime_for('.html')
# # => 'text/html'
#
# Mime.mime_for('txt')
# # => 'text/plain'
#
# Return mime type String otherwise falls back to 'text/plain'.
def self.mime_for(ext)
mime_type = lookup_mime_type_for(ext)
mime_type ? mime_type.to_s : 'text/plain'
end
# Internal: Lookup mime type for extension or mime type
#
# ext_or_mime_type - A file extension ".txt" or mime type "text/plain".
#
# Returns a MIME::Type
def self.lookup_mime_type_for(ext_or_mime_type)
ext_or_mime_type ||= ''
if ext_or_mime_type =~ /\w+\/\w+/
guesses = ::MIME::Types[ext_or_mime_type]
else
guesses = ::MIME::Types.type_for(ext_or_mime_type)
end
# Use custom override first
guesses.detect { |type| type.override } ||
# Prefer text mime types over binary
guesses.detect { |type| type.ascii? } ||
# Otherwise use the first guess
guesses.first
end
end
end
# Additional types to add to MIME::Types
#
# MIME types are used to set the Content-Type of raw binary blobs. All text
# blobs are served as text/plain regardless of their type to ensure they
# open in the browser rather than downloading.
#
# The encoding helps determine whether a file should be treated as plain
# text or binary. By default, a mime type's encoding is base64 (binary).
# These types will show a "View Raw" link. To force a type to render as
# plain text, set it to 8bit for UTF-8. text/* types will be treated as
# text by default.
#
# <type> @<extensions> :<encoding>
#
# type - mediatype/subtype
# extensions - comma seperated extension list
# encoding - base64 (binary), 7bit (ASCII), 8bit (UTF-8), or
# quoted-printable (Printable ASCII).
#
# Follows same format as mime-types data file
# https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data
#
# Any additions or modifications (even trivial) should have corresponding
# test change in `test/test_mime.rb`.
# TODO: Lookup actual types
application/octet-stream @a,blend,gem,graffle,ipa,lib,mcz,nib,o,ogv,otf,pfx,pigx,plgx,psd,sib,spl,sqlite3,swc,ucode,xpi
# Please keep this list alphabetized
application/java-archive @ear,war
application/netcdf :8bit
application/ogg @ogg
application/postscript :base64
application/vnd.adobe.air-application-installer-package+zip @air
application/vnd.mozilla.xul+xml :8bit
application/vnd.oasis.opendocument.presentation @odp
application/vnd.oasis.opendocument.spreadsheet @ods
application/vnd.oasis.opendocument.text @odt
application/vnd.openofficeorg.extension @oxt
application/vnd.openxmlformats-officedocument.presentationml.presentation @pptx
application/x-chrome-extension @crx
application/x-iwork-keynote-sffkey @key
application/x-iwork-numbers-sffnumbers @numbers
application/x-iwork-pages-sffpages @pages
application/x-ms-xbap @xbap :8bit
application/x-parrot-bytecode @pbc
application/x-shockwave-flash @swf
application/x-silverlight-app @xap
application/x-supercollider @sc :8bit
application/x-troff-ms :8bit
application/x-wais-source :8bit
application/xaml+xml @xaml :8bit
image/x-icns @icns
text/cache-manifest @manifest
text/plain @cu,cxx
text/x-logtalk @lgt
text/x-nemerle @n
text/x-nimrod @nim
text/x-ocaml @ml,mli,mll,mly,sig,sml
text/x-rust @rs,rc
text/x-scheme @rkt,scm,sls,sps,ss
require 'linguist/language'
require 'linguist/mime'
require 'pygments'
module Linguist
# Similar to ::Pathname, Linguist::Pathname wraps a path string and
# provides helpful query methods. Its useful when you only have a
# filename but not a blob and need to figure out the language of the file.
class Pathname
# Public: Initialize a Pathname
#
# path - A filename String. The file may or maybe actually exist.
#
# Returns a Pathname.
def initialize(path)
@path = path
end
# Public: Get the basename of the path
#
# Examples
#
# Pathname.new('sub/dir/file.rb').basename
# # => 'file.rb'
#
# Returns a String.
def basename
File.basename(@path)
end
# Public: Get the extname of the path
#
# Examples
#
# Pathname.new('.rb').extname
# # => '.rb'
#
# Pathname.new('file.rb').extname
# # => '.rb'
#
# Returns a String.
def extname
File.extname(@path)
end
# Public: Get the language of the path
#
# The path extension name is the only heuristic used to detect the
# language name.
#
# Examples
#
# Pathname.new('file.rb').language
# # => Language['Ruby']
#
# Returns a Language or nil if none was found.
def language
@language ||= Language.find_by_filename(@path)
end
# Internal: Get the lexer of the path
#
# Returns a Lexer.
def lexer
language ? language.lexer : Pygments::Lexer.find_by_name('Text only')
end
# Public: Get the mime type
#
# Examples
#
# Pathname.new('index.html').mime_type
# # => 'text/html'
#
# Returns a mime type String.
def mime_type
@mime_type ||= Mime.mime_for(extname)
end
# Public: Return self as String
#
# Returns a String
def to_s
@path.dup
end
def eql?(other)
other.is_a?(self.class) && @path == other.to_s
end
alias_method :==, :eql?
end
end
Loading
Loading
@@ -8,6 +8,8 @@
- C#
- C++
- CSS
- Clojure
- CoffeeScript
- Common Lisp
- Diff
- Emacs Lisp
Loading
Loading
@@ -25,5 +27,3 @@
- SQL
- Scala
- Scheme
- TeX
- XML
Loading
Loading
@@ -67,20 +67,20 @@ module Linguist
return if @computed_stats
 
@enum.each do |blob|
# Skip binary file extensions
next if blob.binary_mime_type?
# Skip files that are likely binary
next if blob.likely_binary?
 
# Skip vendored or generated blobs
next if blob.vendored? || blob.generated? || blob.language.nil?
 
# Only include programming languages
if blob.language.type == :programming
# Only include programming languages and acceptable markup languages
if blob.language.type == :programming || Language.detectable_markup.include?(blob.language.name)
@sizes[blob.language.group] += blob.size
end
end
 
# Compute total size
@size = @sizes.inject(0) { |s,(k,v)| s + v }
@size = @sizes.inject(0) { |s,(_,v)| s + v }
 
# Get primary language
if primary = @sizes.max_by { |(_, size)| size }
Loading
Loading
This diff is collapsed.
require 'yaml'
require 'linguist/md5'
require 'linguist/classifier'
module Linguist
# Model for accessing classifier training data.
module Samples
# Path to samples root directory
ROOT = File.expand_path("../../../samples", __FILE__)
# Path for serialized samples db
PATH = File.expand_path('../samples.json', __FILE__)
# Hash of serialized samples object
if File.exist?(PATH)
DATA = YAML.load_file(PATH)
end
# Public: Iterate over each sample.
#
# &block - Yields Sample to block
#
# Returns nothing.
def self.each(&block)
Dir.entries(ROOT).each do |category|
next if category == '.' || category == '..'
# Skip text and binary for now
# Possibly reconsider this later
next if category == 'Text' || category == 'Binary'
dirname = File.join(ROOT, category)
Dir.entries(dirname).each do |filename|
next if filename == '.' || filename == '..'
if filename == 'filenames'
Dir.entries(File.join(dirname, filename)).each do |subfilename|
next if subfilename == '.' || subfilename == '..'
yield({
:path => File.join(dirname, filename, subfilename),
:language => category,
:filename => subfilename
})
end
else
if File.extname(filename) == ""
raise "#{File.join(dirname, filename)} is missing an extension, maybe it belongs in filenames/ subdir"
end
yield({
:path => File.join(dirname, filename),
:language => category,
:extname => File.extname(filename)
})
end
end
end
nil
end
# Public: Build Classifier from all samples.
#
# Returns trained Classifier.
def self.data
db = {}
db['extnames'] = {}
db['filenames'] = {}
each do |sample|
language_name = sample[:language]
if sample[:extname]
db['extnames'][language_name] ||= []
if !db['extnames'][language_name].include?(sample[:extname])
db['extnames'][language_name] << sample[:extname]
db['extnames'][language_name].sort!
end
end
if sample[:filename]
db['filenames'][language_name] ||= []
db['filenames'][language_name] << sample[:filename]
db['filenames'][language_name].sort!
end
data = File.read(sample[:path])
Classifier.train!(db, language_name, data)
end
db['md5'] = Linguist::MD5.hexdigest(db)
db
end
end
end
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment