- Increase logging signal to noise ratio.
- Allow to specify the `http.Client` for github requests. (This allows
the use of caching http.Clients).
- Clean up implementation.
Binary searches through different ranges of file sizes to create search
queries with fewer than 1000 results. This is required since github will
only return the first 1000 result of any search query.
The implementation handles the case where some files may be deleted
while the search is running, and (possibly artificially) assures that the
number of files increases monotonically as the filesize range grows.
The implementation also caches queries and is expected to make fewer
than O((#files/1000) * lg(max file size)) API calls to retrieve the
range queries that can be used to index all of the files.
In practice running the search splitting takes a few minutes, while
retrieving all of the data takes a few hours.
Currently I've left the search splitting by file size out of this commit
since it's ~200 lines of logic, and I think it's best to get it reviewed
separately.
In it's current state the crawler would only be able to get the last
1000 indexed files by Github, but it does show the general structure of
how the crawler is implemented.
To make this easier to read, use, and modify, I've abstracted the
important parts of the github query api into crawler/github/query.go
which allows to describe at a high level what is to be searched without
knowing the API syntax.
`crawler.Crawler` interface is defined, where a crawler has to implement
a `Crawl` method that forwards document found by the crawler to a channel.
A helper function that launches a list of crawlers concurrently and
merges their channels into one main output channel, forwarding errors is
also implemented.
Finally, a test that verifies the correctness and concurrency of the
helper method is provided.
This patch removes a layer of looping in the name reference candiate
resmap building process by not checking if the resources already exist
in the new resmap.