Commit Graph

30 Commits

Author SHA1 Message Date
Haiyan Meng
145ba0c7ff Avoid reprocess queries whose range size is 0 2020-06-23 11:36:29 -07:00
Haiyan Meng
f5419e9f72 Check the incomplete_results field of github query responses
Currently, we don't check the `incomplete_results` field of a github
query response, which is problematic when incomplete query results are
used to split the query ranges: the splitted query ranges will
be very wild.
2020-02-03 09:59:52 -08:00
Haiyan Meng
7a87c84403 Reprocess the github filesize search ranges which have more than 1000 items 2020-02-03 09:59:52 -08:00
Jeff Regan
0ce076758d Merge pull request #2150 from haiyanmeng/stats
Add `fileType` and `User` into the index
2020-01-28 09:18:31 -08:00
HowJMay
00f68c12a8 fix typos
Fix typos
2020-01-23 23:35:38 +08:00
Haiyan Meng
0820865e1d Retry FindRangesForRepoSearch 2020-01-22 10:13:57 -08:00
Haiyan Meng
1120c6bc7a Add a User field into Document to make it easy to aggregate on github
user level.
2020-01-21 10:09:52 -08:00
Haiyan Meng
f4636f8555 Add a fileType field into the index 2020-01-17 13:15:49 -08:00
Haiyan Meng
cf8d53a195 Move SeenMap to the utils dir 2020-01-15 15:29:16 -08:00
Haiyan Meng
2e895c147e Use log.Print* instead of fmt.Print* 2020-01-14 15:50:35 -08:00
Haiyan Meng
72eda992bd make seen a non-primitive type 2020-01-14 12:14:00 -08:00
Haiyan Meng
230e0ca752 Add two methods to type RangeQueryResult: Add and String 2020-01-14 12:14:00 -08:00
Haiyan Meng
14eb524b9e Add a command for searching for kustomize resource files 2020-01-14 12:14:00 -08:00
Haiyan Meng
81d62f90bf Improve the efficency of crawling github
Make sure a github file is crawled once
2020-01-14 12:14:00 -08:00
Haiyan Meng
b154af8be4 Check the error of closing response body 2020-01-08 10:32:12 -08:00
Haiyan Meng
ccd129f7a5 Check empty http response before accessing it 2020-01-08 10:24:00 -08:00
Haiyan Meng
142c105500 SKip the empty resource/base item in a kustomization file and set the
defaultBranch if needed
2020-01-06 12:06:18 -08:00
Haiyan Meng
ee659a70e4 Fix how to construct URLs for finding all the commits related to a
github file

The existing logic sets the creation time of a github file to the time
when the github repository was created.
The fix sets the creation time of a github file to the time when the
file was created.
2020-01-06 12:06:18 -08:00
Haiyan Meng
2c2aa928cc Delete non-existing documents from the index 2019-12-18 15:56:44 -08:00
Haiyan Meng
8c89f0946c Avoid to index a document if FetchDcoument or SetCreated fails 2019-12-18 15:56:44 -08:00
Haiyan Meng
e44d1298df Return errors if http Client.Do resp status code is not 2xx 2019-12-18 15:56:44 -08:00
Haiyan Meng
a9244f759e Add supports for crawling a specific git user or repo 2019-12-13 11:18:33 -08:00
Haiyan Meng
d9239104aa Escape spaces in the query paths of git commit requests 2019-12-12 10:03:15 -08:00
Haiyan Meng
0d79219e46 Avoid processing the nil pointer returned by kustomizationResultAdapter
Currently, the crawler job panics whenever a nil pointer is returned by
kustomizationResultAdapter.
2019-12-11 13:54:01 -08:00
Haiyan Meng
bffc0d7071 Mulitple improvements of the crawler
1) Set document IDs to avoid duplicating documents;
2) Set the `creationTime` field of each document in the index;
3) set the `values`, `kinds` and `identifiers` fields for all documents;
4) Add a `Copy` method into the `Document` struct: this fixes the issue
where all the documents existing in the index point to the same Document
object;
5) Avoid using keystore redis;
6) Set imagePullPolicy to `Always` for crawler jobs.
2019-12-11 11:10:48 -08:00
Jeffrey Regan
e9ab3da164 Fix some nits in the crawler and elsewhere. 2019-12-03 10:44:44 -08:00
Haiyan Meng
84b75afae4 Make the crawler work
1) add the crawler binary and fix the crawler library
2) remove the readiness probe in the search backend
3) add config for redis keystore
4) add github_api_secret.txt file with instructions
2019-11-26 09:50:51 -08:00
Haiyan Meng
9255c991f4 Replace the sigs.k8s.io/kustomize/hack/crawl/* import path with
`sigs.k8s.io/kustomize/api/internal/crawl/*`
2019-11-14 13:38:18 -08:00
Haiyan Meng
d08140d3f7 Remove api/internal/hack/crawl/crawler/git dir, use api/internal/git
instead.
2019-11-14 13:35:00 -08:00
Haiyan Meng
f69d2d2e69 Move hack/crawl under api/internal 2019-11-14 13:17:28 -08:00