diff --git a/internal/tools/README.md b/internal/tools/README.md new file mode 100644 index 000000000..058753d3f --- /dev/null +++ b/internal/tools/README.md @@ -0,0 +1,428 @@ +## What is this? +### In short +Be the GoDoc.org of k8s configuration files. + +### More explicitly +Support k8s document indexing from open-source configurations in order to make +it easy for people to learn to use a new feature, explore k8s configs in a +central hub, and see some metrics about kustomize use. + +We want people to be able to support three main classes of queries: + +1. Structured document queries: how should I use the following fields + - Grace periods: `spec:template:spec:terminationGracePeriod`? + - Kustomize inline patch: `patches:patch`? + +2. Key value queries: how should I use this more specific use case of a + structure configuration. + - HorizontalPodAutoScalers: `kind=HorizontalPodAutoScaler`? + - Patches on StatefulSets: `patches:target:kind=StatefulSet`? + +3. Full text search: search the comments and the document text from any + type of k8s config file. + +## Road map +There is a lot that can be added in order to improve the state of this +application. Some more details along with general thoughts and comments can be +found in the Roadmap.md file in this directory. This README contains only +what can be considered as mostly complete and iterable parts of this project. + +## Running this project +Everything is configured using kubernetes, so it should be easy for people to +spin this up on any k8s cluster. Everything should just work (TM). + +The config files live in the `config` directory. + +``` +config +├── base +│   └── kustomization.yaml +├── crawler +│   ├── base +│   │   ├── github_api_secret.txt +│   │   └── kustomization.yaml +│   ├── cronjob +│   │   ├── cronjob.yaml +│   │   └── kustomization.yaml +│   └── job +│   ├── job.yaml +│   └── kustomization.yaml +├── elastic +│   └── ... +├── redis +│   ├── document_keystore +│   │   ├── kustomization.yaml +│   │   ├── redis.yaml +│   │   └── service.yaml +│   └── http_cache +│      ├── kustomization.yaml +│      ├── redis.yaml +│      └── service.yaml +├── webapp +│ ├── backend +│ │   ├── deployment.yaml +│ │   ├── kustomization.yaml +│ │   └── service.yaml +│ └── frontend +│    ├── deployment.yaml +│    ├── kustomization.yaml +│    └── service.yaml +└── schema_files +    └── kustomization_index +       ├── es_index_mappings.json +       └── es_index_settings.json +``` + +To get everything up and running you have to: + +1. Get some instance of elasticsearch working... and configure the + configmapGenerator in `config/base` to point to the right endpoint(s). The + configurations that need this value to be populated are the following: + - `config/crawler/cronjob` to run periodic crawls. + - `config/crawler/job` to run crawls on demand. + - `config/webapp/backend` to run the search server. + +2. Configure the elasticsearch indices: +``` +kustomize build config/schema_files/kustomization_index | kubectl apply -f - +``` +This will run a `curl` command that reads json data from a ConfigMap. This will +setup the schema. If you want to make more complex modifications to the +schema, you should refer to the elastic docs to figure out whether the mapping +can be added to the current index, or whether you will need to copy the +existing index into a different one with the appropriate mappings. Modifications +can be made by using the elasticsearch go library and writing a simple program, +or it can be made with any http command to the appropriate server endpoint from +within the cluster. Unfortunately I did not have the time to write a few helper +tools for this. Feel free to contact me if you need help with modifying +elasticsearch configs, I'm by no means an expert, but I can try to help. + +3. (Optional) run the redis http chache for the crawler: +``` +kubectl apply -k config/redis/http_cache +``` + This will create a deployment for the cache, and a service. The crawler should + be configured to connect to the `http_cache` if it exists, but you can always + check the logs to make sure it connects, and that the identifiers match in the + crawler configuration and for the service endpoint. + + The please be aware that the cache does not have a persistent volume. + +4. Configure the main redis instance: +``` +kubectl apply -k config/redis/document_keystore +``` + This will create a StatefulSet with a volume of 4GiB for a redis instance. + +5. Get an access token from GitHub. + +To be able to kindly ask GitHub for it's data on k8s config files, you'll need +to create an access\_token. From my understanding, this is the only way to do +these code search queries (without first specifying a repository). + +To generate a token, go to your GitHub's account in Settings > Developer +Settings > Personal access tokens. It should look like this. + +![GitHub Token 1]( +https://sigs.k8s.io/kustomize/internal/tools/pictures/github_token.png) + +From here you want to generate a new token and have the following +configuration: + +![GitHub Token 1]( +https://sigs.k8s.io/kustomize/internal/tools/pictures/token_config.png) + +If you have uses for any other data from this token, (org data, or something +else) you can pick and choose, but be careful since it can grant this +application access to your notifications, etc. However, any such extension +is explicitly a non-goal and would not be maintained by this project. + +6. Launch the crawler: +``` +kustomize build config/crawler/cronjob | kubectl apply -f - +``` +This will periodically run the crawler every day according to the cron timing +rules in the cronjob.yaml file. + +Instead, to get the crawler running now, you can run: +``` +kustomize build config/crawler/cronjob | kubectl apply -f - +``` +which will launch a non-periodic version of the crawler. It will take a few +minutes for the crawler to split the search, but then config files should +start to get populated within 20 minutes. It may take a while to do the +first crawl, since it has to fetch rate-limited endpoints for each new file it +finds. It should get significantly faster to update in the future. + +5. Launch the search backend +``` +kustomize build config/webapp/backend | kubectl apply -f - +``` + +6. Launch the search frontend +``` +kustomize build config/webapp/frontend | kubectl apply -f - +``` + +## Notes about the components + +### Elasticsearch +I will add a basic working setup soon. I just did the lazy thing and used an +already packaged solution. Most clouds will provide their own elastic +environments, however, Elasticsearch is also working on their own +implementation of a +![k8s operator](https://www.elastic.co/elasticsearch-kubernetes), which might +be worth checking out. Please note that it comes with its own license +agreement. + +### Redis +There are two Redis instances that are used in this application. + +One of them is configured to have on disk persistence, so make sure to have +that set up in your kubernetes cluster. Also note that it is running on a +single master node (i.e. it does not automatically shard keys to multiple head +nodes as part of a highly available cluster). Since it's storing a sparse +graph, I can't imagine this being much of an issue, but it's probably worth +mentioning. + +The other Redis instance is running as a HTTP (RFC 7234) cache for etags from +GitHub (or any other document store from which we could crawl/index). This one +does not require full persistent storage on disk. The caching strategy is an +LRU cache which is probably a good starting point. It might be worth it to +investigate other cache policies, but I think LRU will work well since +documents may or may not expire anyway, and the amount of memory allocated for +keys is fairly large, so eviction of frequently used documents seems unlikely +anyway. + +### Nginx + Angular +There is a Dockerfile included for generating the container image with Nginx +(using the default package) and adding all of the supporting compiled angular +files. Any modifications to the code-base should be compatible with this setup, +so all that's needed is to rebuild the container image, and possibly modify +the image tags in the k8s file. + +### Supporting Go binaries +There are a few go binaries that each have their own Dockerfile to build +containers in which to run them on k8s, namely the crawler and the search +service. Their configurations are not optimal (read: needs to be cleaned up), +but they are functional. + +## Technical details + +### Overall design and imlpementation + +There are a few components that are all running together in order to get +the overall application to work smoothly. This section will provide a brief +overview of each component with the following sections going into more details. + +The overall structure is outlined in the following figure: +![overview]( +https://sigs.k8s.io/kustomize/internal/tools/pictures/sys_arch.png) + +#### Crawler +The leftmost component consists of a crawler with an http cache of GitHub +queries does two things, it first looks at the list of documents in +elasticsearch and tries to update them. In doing so, it maintains a set of +newly updated files to exclude them from other parts of the crawl. + +To find newly added documents, the crawler crawls any new dependencies +introduced in the document updating step and it also queries GitHub for the +most recently indexed kustomization.\* files. Each new file will be processed +for efficient text queries and put into the document index. Any new dependency +will also incur more crawl operations. Finally, a graphical +representation of the documents and their dependencies is built in Redis to be +used for graph algorithms such as PageRank and component analysis. + +#### Data library +There are a few helper libaries for dealing with Elasticsearch, Redis and +documents. This is not persistent, nor is it centralized. They act as small +components that help to package common pieces of code. Eventually it may make +sense to merge all of it together and make a proper persistent model around +this while providing an external API for document insertion/deletion. But +that is definitely out of scope in terms of getting this to run. However +there are limitations with the current model in terms of minimizing the +API surface for the different components of the application. For now this +problem is mostly mitigated by having the query server only connected to +a data node of the Elasticsearch cluster, but the problem of knowing what +is accessible and what isn't is left to the programmer instead of being +clearly and explicitly supported by the API. + +#### Server +Uses the data library to communicate with the data store and answer queries. +Processes the user entered text queries into somewhat optimized elasticsearch +queries. Provides a few endpoints to get different metrics and to eventually +allow for registration of remote repositories. + +This application has an exposing service in order to allow users of the +application access to queries and the results. + +#### Nginx + Angular +Communicates directly with the backend server to forward user queries and +their results. Presents the results on an interface. It's still pretty simple +looking but it seems usable (to me). + + +### Crawling GitHub +With the use of API keys, GitHub allows account owners to search for files +using their API. + +The search endpoints allow for the use of metadata search +that is fairly useful/powerful. For instance they provide a `filename:` keyword +that permits us to look for `kustomization.yaml`, `kustomization.yml`, etc. +This enables the fetching of a list of kustomization documents, from which +we can get the actual content from another endpoint +(raw.githubusercontent.com). + +However, the search API is fairly limited. There is a restriction to the number +of documents that can be retrieved from this method. One possible way to +mitigate this would be to periodically query GitHub for results, sorted by the +last indexed time. This would allow you to collect most documents from this +point forwards. The downside to this is that it may require a large number of +requests to their API since you cannot know when new files will be added. +Furthermore, there is a possibility that you would not be able to get all of +files either, depending on the velocity of growth. + +The approach that was taken to mitigate this is to use the `filesize:` keyword +and to shard the search space into contiguous buckets of appropriate size in +order to get all of the documents. This is fairly efficient, since you can find +a good enough way to shard the documents in +`lg(max file size) * number of documents / 1000` API queries. Moreover, since +queries are paginated with at most 100 results per query, this solution is +competitive with getting the optimal (non-contiguous) sharding of result sets. +Furthermore, filesize queries can be cached to minimize the total number of +queries called to the API in order to shard the search space. This is done by +querying for file size intervals that always start with 0..X and binary +searching over the `filesize:` space. This will allow you to reuse a lot of +queries when you're looking for the next range, since it is upper bounded and +lower bounded to a smaller number of queries within a range that has also been +queried. I think this is only true because filesizes are power law distributed, +so searches will typically require less queries as they progress from left to +right. + +However, this method in no way depends on intervals of the form 0..X, as +the number of documents in the many intervals of the range search could be +added together to also make this work. This approach just seemed simpler to +implement, maintain, and debug so it was preferred. + +To get an idea of how efficient this method is, to shard the search space of +7000 documents, it will only take ~90 API range queries which should only take +a few minutes. While actually fetching the documents and their relevant +metadata (creation time, etc.) will take several hours. Furthermore, this +could be made more efficient if a prior distribution is approximated. +This prior could be scaled to the number of documents that need to be fetched, +and then finding a shard that has an adequate number of requests, will only +take a few queries per shard. It could probably be supported in a constant +number of size queries if the size of each shard is halved which shouldn't +have terrible performance impact for the retrieval. However, there where +more pressing things to implement. I might revisit this later. + +### Document Indexing and Processing +In order to support simple text queries the structured documents must be +processed in some way that makes searching them easy. The current method +is to recursively traverse the map of configurations to generate each sub-path +and each key-value pair for the leaf nodes of the recursion tree. + +However, note that this means that a document has to be valid yaml/json +format in order for indexing to happen. The rest of the document is treated +as mostly text and uses default text settings from Elasticsearch. + +What this means is that for the following yaml document: + +```yaml +resources: +- service.yaml +- deployment.yaml + +configmapGenerator: +- name: app-configuration + files: + - config.yaml + +patchesJson6902: +- target: + version: v1 + kind: StatefulSet + name: ss-name + path: ss-patch.yaml +- target: + version: v1 + kind: Deployment + name: dep-name + path: dep-patch.yaml +``` + +the following flattened structure would look like: +```json +{ + "identifiers": [ + "resources", + "configmapGenerator", + "configmapGenerator:name", + "configmapGenerator:files", + "patchesJson6902", + "patchesJson6902:target", + "patchesJson6902:target:version", + "patchesJson6902:target:kind", + "patchesJson6902:target:name", + "patchesJson6902:path", + ], + "values": [ + "resources=service.yaml" + "resources=deployment.yaml" + "configmapGenerator:name=app-configuration" + "configmapGenerator:files=config.yaml" + "patchesJson6902:target:version=v1", + "patchesJson6902:target:kind=StatefulSet", + "patchesJson6902:target:name=ss-name", + "patchesJson6902:path=ss-patch.yaml", + "patchesJson6902:target:kind=Deployment", + "patchesJson6902:target:name=dep-name", + "patchesJson6902:path=dep-patch.yaml", + ], + ... +} +``` + +Note that unique paths and values are deduplicated. + +On the search side, exact queries will be prioritized, but the document paths +and key=value pairs will also be analyzed with 3-grams to have some amount of +fuzzy search. The reason that a Levenshtein-Distance was not used instead, is due +to searching multiple fields at the same time, which is a use case where +Elasticsearch does not support proper fuzzy searching. + +### Document Search +Given a text query, each token is considered separately. Each token will be fed +through a handful of analyzers on the Elasticsearch side, and will be compared +with the reverse document index of each document fields. It will then determine +the best matching documents. Text ordering is largely insignificant. This makes +sense for the structured search, but may leave room for improvement for the +text only search within the document. + +Each token _must_ be matched, so each white space character acts as a +conjunction of individual queries. There are also ways of telling +Elasticsearch that some things _should_ match, but I think for now it makes +more sense to leave it as is. + +I think this behavior is sufficient to make the search feel fairly intuitive +while providing support for fairly complex use cases. + +### Metrics Computation +From the each kustomization document that is indexed, we can find it's +resources that are publicly available. This includes other kustomizations. +From this, we can build a directed graph of dependencies and reverse +dependencies. + +This opens up the possibility to add a plethora of graph metrics that can +give the project maintainers feedback and insight into how people are using +their tools. + +Some of these are useful such as getting an idea for how large the dependency +graphs actually grow in practice, and can be used to find _popular_ +kustomizations within the corpus. This lends itself to implementing PageRank +to help bubble up popular results as good search results. I unfortunately +did not have the time to implement the algorithm, but I do plan to revisit +this sometime soon to add a few good and efficient implementations of useful +graph algorithms that would be useful to have. See the Roadmap.md for a more +complete list of features that could be added and how I think they could be +implemented. diff --git a/internal/tools/ROADMAP.md b/internal/tools/ROADMAP.md new file mode 100644 index 000000000..8fdcb0b81 --- /dev/null +++ b/internal/tools/ROADMAP.md @@ -0,0 +1,176 @@ +# Road map and comments about this work + +From working on this project, here is a collection of thoughts and suggestions +for future improvements. For any questions about this, or to request help do +not hesitate to contact @damienr74 on GitHub, my email should be listed. + +I think this project has the potential for the K8s community to promote best +practices. If this becomes popular, It could become easier to find +*subjectively good* configurations. This can act as a way to guide newcomers +to k8s config features that are easy to maintain, practical, and tested in some +real world environment. However, a lot of work remains to be made if this is +to happen. Extracting and ranking semantic-level information from the open +source configuration files, is definitely not trivial, and will require a lot of +though and consideration from the experts and the patterns that successful k8s +project follow. This, is outside of my scope having little to no experience with +k8s other than working on this project; however, if you have ideas I can +probably suggest approaches in order to implement it, having worked a lot on +this project. + +### Improving configuration files and container configs +I did not have a lot of time to refactor the images to use configmaps for +everything. This is a good thing to improve, should be fairly easy. Another +thing that could make the user experience of launcing this could be to make all +of the go utilities be subcommands to the same binary/container image. This +would reduce the number of things that would have to be rebuilt, in order to get +it running, and it would make the application (and its components) more self +contained. (also has some disadvantages, so I'll let someone else decide. + +### Adding graph metrics +From the Redis graph representation, we are able to run a multitude of graph +algorithms (not all of which are implemented). + +The simplest one would be to run kruskal's algorithm to find connected +components, and to compute graph metrics on each component. Here are some of the +metrics that may be useful: + ++ Average size and histograms of the sizes of each components. + ++ Average size and histograms of the node with the highest in degree (rdeps) of + each component. + ++ Average size and histograms of the number of repositories in a connected + component. + ++ Any other metric that may be helpful to measure the scale of the kustomize + import graph. + +Another cool thing that may be helpful, would be to output the graph +representation of deps/rdeps. This should be fairly easy to do with graphviz/dot +so if anyone really wants this, I (damienr74) should be able to do it. Feel free +to send me an email or to @ mention me in an issue. + +Note: dfs could also be used to find connected components, but I think union +find is preferable, since the results can be stored and modified very +efficiently. The only challenging part would be to implement deleting of edges +and nodes from a component efficiently, but I know it is possible to support +these operations with a union find structure. + +### Implementing PageRank +The graph is set up to be able to efficiently compute PageRank since the edge +weights are real valued, and the graph representation is sparse which means that +it will fit in the memory of a single machine which will make the processing +much more efficient. + +It could also be implemented as a Redis script, but I feel like there's +something fundamentally wrong with implementing PageRank in lua. :P + +### Implement feature tracking +Each day, when the crawler finds and indexes these structured documents, +it should insert aggregate data to a separate index. This data could look like the +following: + +```json +{ + "kind": "kustomization", + "added_identifiers": [ + { + "identifier": "some:new:k8s:feature", + "addedIn": [ + "docID1", + "docID100", + "docID45", + ... + ], + } + { + "identifier": "another:k8s:feature", + "documents": [ + ... + ], + } + ... + ] + + "removed_identifiers": [ + { + "identifier": "some:deprecated:field", + "documents": [ + ... + ] + } + ] +} +``` + +This would make it fairly easy to get deep insight into: +- the speed at which things can effectively be deprecated. +- how many people are migrating to current best practices. +- how many documents get updated frequently/rarely. +- detailed cross sections of growth/regression over conjunctions of features. +- a world of possibilities. + +This is also something that I would be interested to work on sometime soon, so +feel free to contact me (damienr74) or ask questions about this. + +As needed, it could be a good idea to also aggregate past data with a larger +granularity. for instance each month, the past 30 days can be aggregated into +weekish durations, And every year these weekly aggregations can be converted +into monthly summaries depending on how much data this ends up being, and how +much you want to pay for the storage of this data. + +Another cool way to compress this data would be to dynamically compress this +data into a logarithmic number of buckets with decreasing granularity. But it +seems like overkill for the amount of data that we'd likely get. + +### The UI probably needs a lot of work +I'm not much of a UI/UX person and have little to no experience in developing +these types of applications. If anyone with Angular experience wants to dive in +and completely restructure the app to make the UI/UX/Code health better that +would be greatly appreciated. + +### Query tuning probably still has to be adjusted +I'm also not an expert in Elasticsearch. From what I could read in the docs, +I think I've made sane decisions in converting user queries into meaningful +Elasticsearch queries, but I'm sure there are a lot of improvements that remain +to be done in order to get more accurate results. + + +### Some other signals that indicate the presence of a good configuration file +There are lots of heuristics that could be used to achieve this. Here are a +couple in no particular order: + ++ Penalize for the number of yaml `---` document splits. I'm not sure what the + general consensus is, but I think it's better to separate them, since it + makes git commits less noisy, it's a trivial transformation, and it makes + config files smaller. However, I can understand the argument that its somewhat + practical to keep an overall view of the configurations together (maybe). + ++ Penalize the number of unique identifiers in a structured document. I think + this makes sense, since we don't want to have someone game the search engine + to match documents with every possible path from the k8s docs. PageRank might + help with this to some extent, but with a small corpus it would be fairly easy + to game. + ++ Assign weights to the usefulness of certain fields. It would be good to + promote documents that use `keyRefFromConfigMap`, liveness probes, etc. + +These are the main ones I can think of, but I'm sure there are a *ton* of +ways to achieve this. + +If the corpus gets large enough, we might even be able to use *blockchains*, +*machine learning*, and maybe even self-driving cars. + +### Add more support for indexing of other k8s/kustomize related data +One thing that jumps to mind is the use of kustomize plugins. They are easy +to track since they all have an unused global variable: `var KustomizePluggin` +it would be easy to run the pluginator command and generate godocs for each +go file with this unique identifier. + +For the sake of completeness, here is the full GitHub query that we can use to +find these: +`api.github.com/search/code?q=var+KustomizePlugin+extension%3A.go&access_token=access_token` + +Godoc will not show much, since most packages will be using package main, but +using pluginator we can make it a properly named package such that Godoc would +actually generate the relevant documentation. diff --git a/internal/tools/pictures/github_token.png b/internal/tools/pictures/github_token.png new file mode 100644 index 000000000..0292de9a1 Binary files /dev/null and b/internal/tools/pictures/github_token.png differ diff --git a/internal/tools/pictures/sys_arch.png b/internal/tools/pictures/sys_arch.png new file mode 100644 index 000000000..3991e3ba1 Binary files /dev/null and b/internal/tools/pictures/sys_arch.png differ diff --git a/internal/tools/pictures/token_config.png b/internal/tools/pictures/token_config.png new file mode 100644 index 000000000..4edde8a96 Binary files /dev/null and b/internal/tools/pictures/token_config.png differ