Commit Graph

166 Commits

Author SHA1 Message Date
Yujun Zhang
ff6250cdb4 Allow loading file from http 2020-02-29 16:19:21 +08:00
Haiyan Meng
b7b7a5a79f Fix typo 2020-02-10 15:44:51 -08:00
Haiyan Meng
807ca9c1e3 Add notes on backup and restore 2020-02-10 08:30:08 -08:00
Haiyan Meng
baccf58ccf Avoid tracking the change in github_api_secret.txt
This helps prevent commiting your Github personal access token into
Github by accident.
2020-02-05 12:06:21 -08:00
Haiyan Meng
c7bdb3fbe4 Add cmds to process the kustomize-stats log 2020-02-05 11:04:59 -08:00
Haiyan Meng
967fe44e3f Add curl commands for kustomize stats 2020-02-05 11:04:59 -08:00
Haiyan Meng
d0602c732b Remove the usage of github access token from the kustomize-stats job 2020-02-05 11:04:59 -08:00
Haiyan Meng
a4179fa87f Use the silence mode of curl 2020-02-05 11:04:59 -08:00
Haiyan Meng
c9bce3fc0a Add comments on backup and restore 2020-02-05 11:04:59 -08:00
Haiyan Meng
3ebeebabde Add comments for backup and restore 2020-02-03 12:37:18 -08:00
Haiyan Meng
a3b3449b1f Add curl commands for generator/transformer exploration 2020-02-03 09:59:52 -08:00
Haiyan Meng
1b8488da2c Add curl commands for snapshoting 2020-02-03 09:59:52 -08:00
Haiyan Meng
f5419e9f72 Check the incomplete_results field of github query responses
Currently, we don't check the `incomplete_results` field of a github
query response, which is problematic when incomplete query results are
used to split the query ranges: the splitted query ranges will
be very wild.
2020-02-03 09:59:52 -08:00
Haiyan Meng
7a87c84403 Reprocess the github filesize search ranges which have more than 1000 items 2020-02-03 09:59:52 -08:00
Haiyan Meng
0fcb3a014c Add config for index backup and restore 2020-02-03 09:59:52 -08:00
Haiyan Meng
0b38e6d284 Improve the analysis on generator and transformer 2020-02-03 09:59:52 -08:00
Haiyan Meng
d5c66cb3d4 Add KustomizationDocument.Copy method 2020-02-03 09:59:52 -08:00
Haiyan Meng
b35b5aa73d Check the checksums of documents in the index 2020-02-03 09:59:52 -08:00
Haiyan Meng
bb409a5ea8 Set up cronjob to run crawler every 7 days 2020-02-03 09:59:52 -08:00
Haiyan Meng
74e1b5d54b Add GCP service account into ESCluster config
This is necessary for index backup into GCS and index recovery from GCS
2020-02-03 09:59:52 -08:00
Jeff Regan
0ce076758d Merge pull request #2150 from haiyanmeng/stats
Add `fileType` and `User` into the index
2020-01-28 09:18:31 -08:00
Haiyan Meng
154208d331 Improve the efficiency of crawling github by skipping the documents
already in the index
2020-01-24 19:55:56 -08:00
Haiyan Meng
b7b88cae76 Add curl commands for querying different filetypes 2020-01-23 16:04:55 -08:00
HowJMay
00f68c12a8 fix typos
Fix typos
2020-01-23 23:35:38 +08:00
Haiyan Meng
0820865e1d Retry FindRangesForRepoSearch 2020-01-22 10:13:57 -08:00
Haiyan Meng
1120c6bc7a Add a User field into Document to make it easy to aggregate on github
user level.
2020-01-21 10:09:52 -08:00
Haiyan Meng
96ee9e9146 Add curl ElasticSearch cmd for using filter and range together 2020-01-17 15:49:14 -08:00
Haiyan Meng
377eb5b66d Fix the regexp for determining kustomization file 2020-01-17 15:48:38 -08:00
Haiyan Meng
f4636f8555 Add a fileType field into the index 2020-01-17 13:15:49 -08:00
Haiyan Meng
9f80da28ae Refactor the stats code for generators and transformers 2020-01-16 09:20:24 -08:00
Haiyan Meng
5477bde7e5 Use an env variable for index name and fix the call to NewKustomizeIndex in backend 2020-01-15 15:29:17 -08:00
Haiyan Meng
3ead42fe27 Add --index flag to kustomize_stats config file 2020-01-15 15:29:16 -08:00
Haiyan Meng
cf8d53a195 Move SeenMap to the utils dir 2020-01-15 15:29:16 -08:00
Haiyan Meng
aaaba99389 Use Document.Path instead of its fields 2020-01-15 12:10:08 -08:00
Haiyan Meng
29e50ab476 Collect stats on generators and transformers 2020-01-15 12:10:08 -08:00
Haiyan Meng
3519cc56a1 Add support to get files referred in the generators and tranformers
fields
2020-01-15 12:10:08 -08:00
Haiyan Meng
2e895c147e Use log.Print* instead of fmt.Print* 2020-01-14 15:50:35 -08:00
Haiyan Meng
af131c7471 Use flags to specify crawling mode and github user/repo info 2020-01-14 15:36:12 -08:00
Haiyan Meng
7ac573ae51 Add a flag to specify the index name 2020-01-14 14:25:29 -08:00
Haiyan Meng
bb09f82f3c Remove kustomize-index-name setting 2020-01-14 13:53:16 -08:00
Haiyan Meng
72eda992bd make seen a non-primitive type 2020-01-14 12:14:00 -08:00
Haiyan Meng
230e0ca752 Add two methods to type RangeQueryResult: Add and String 2020-01-14 12:14:00 -08:00
Haiyan Meng
14eb524b9e Add a command for searching for kustomize resource files 2020-01-14 12:14:00 -08:00
Haiyan Meng
81d62f90bf Improve the efficency of crawling github
Make sure a github file is crawled once
2020-01-14 12:14:00 -08:00
Kubernetes Prow Robot
1a330f89d9 Merge pull request #2080 from yujunz/git-cloner
Simplify git cloner logic
2020-01-13 15:23:11 -08:00
Haiyan Meng
569fafba81 Add the Document ID pointing to a kuostomization root into cache to
avoid crawl it repeatedly
2020-01-11 15:32:25 -08:00
Yujun Zhang
ae458d0c80 Simplify git cloner logic
Related to #2072
2020-01-11 20:40:55 +08:00
Haiyan Meng
c801958d40 Log response status code to help debug
Recently, the crawler job often fails after 10+ hours with the following
error (10.0.47.27:9200 is the ElasticSearch master):
dial tcp 10.0.47.27:9200: connect: connection refused
2020-01-10 11:37:22 -08:00
Haiyan Meng
f9a4d5a14e Track the crawling process 2020-01-10 11:10:38 -08:00
Jeff Regan
9555095de9 Merge pull request #2016 from haiyanmeng/stats
Add a binary for generating the stats of the index
2020-01-09 13:11:50 -08:00